Is your feature request related to a problem?
In OpenSearch we support some sentence-transformers model as pretrained models. The registration of pretrained models is much more convenient, and users don't need to change the cluster settings plugins.ml_commons.allow_registering_model_via_url.
With the development of the research and engineering evolution in IR domain, now there are much stronger text_embedding models in the open source community. (leaderboard ref) However, users still need to trace these models and generate the tarball manually, which is a heavy workload especially for those with little machine-learning background knowledge.
Considering the models will consume resources in local deployment, We can support bge-small-en-v1.5 and bge-base-en-v1.5 as pretrained models in OpenSearch.
What alternatives have you considered?
A clear and concise description of any alternative solutions or features you've considered.
Is your feature request related to a problem? In OpenSearch we support some sentence-transformers model as pretrained models. The registration of pretrained models is much more convenient, and users don't need to change the cluster settings plugins.ml_commons.allow_registering_model_via_url.
With the development of the research and engineering evolution in IR domain, now there are much stronger text_embedding models in the open source community. (leaderboard ref) However, users still need to trace these models and generate the tarball manually, which is a heavy workload especially for those with little machine-learning background knowledge.
What solution would you like? BGE models(https://huggingface.co/BAAI/bge-small-en-v1.5, https://huggingface.co/BAAI/bge-base-en-v1.5, https://huggingface.co/BAAI/bge-large-en-v1.5) have very strong text_embedding representation among the models with same size. And we can use them consistently with other sentence-transformers text_embedding models.
Considering the models will consume resources in local deployment, We can support bge-small-en-v1.5 and bge-base-en-v1.5 as pretrained models in OpenSearch.
What alternatives have you considered? A clear and concise description of any alternative solutions or features you've considered.
Do you have any additional context? https://github.com/opensearch-project/ml-commons/issues/2210