opensearch-project / opensearch-py-ml

Apache License 2.0
34 stars 64 forks source link

[FEATURE] enhance model_uploader workflow to support BGE models from huggingface #387

Closed zhichao-aws closed 5 months ago

zhichao-aws commented 7 months ago

Is your feature request related to a problem? In OpenSearch we support some sentence-transformers model as pretrained models. The registration of pretrained models is much more convenient, and users don't need to change the cluster settings plugins.ml_commons.allow_registering_model_via_url.

With the development of the research and engineering evolution in IR domain, now there are much stronger text_embedding models in the open source community. (leaderboard ref) However, users still need to trace these models and generate the tarball manually, which is a heavy workload especially for those with little machine-learning background knowledge.

What solution would you like? BGE models(https://huggingface.co/BAAI/bge-small-en-v1.5, https://huggingface.co/BAAI/bge-base-en-v1.5, https://huggingface.co/BAAI/bge-large-en-v1.5) have very strong text_embedding representation among the models with same size. And we can use them consistently with other sentence-transformers text_embedding models.

Considering the models will consume resources in local deployment, We can support bge-small-en-v1.5 and bge-base-en-v1.5 as pretrained models in OpenSearch.

What alternatives have you considered? A clear and concise description of any alternative solutions or features you've considered.

Do you have any additional context? https://github.com/opensearch-project/ml-commons/issues/2210

dblock commented 5 months ago

Catch All Triage - 1 2 3 4 5 6

zhichao-aws commented 5 months ago

We need to deprecate this work item as the model use Reddits as training data