opensearch-project / ml-commons

ml-commons provides a set of common machine learning algorithms, e.g. k-means, or linear regression, to help developers build ML related features within OpenSearch.
Apache License 2.0
88 stars 122 forks source link

[FEATURE] support BGE models in OpenSearch pretrained models list #2210

Open zhichao-aws opened 4 months ago

zhichao-aws commented 4 months ago

Is your feature request related to a problem? In OpenSearch we support some sentence-transformers model as pretrained models. The registration of pretrained models is much more convenient, and users don't need to change the cluster settings plugins.ml_commons.allow_registering_model_via_url.

With the development of the research and engineering evolution in IR domain, now there are much stronger text_embedding models in the open source community. (leaderboard ref) However, users still need to trace these models and generate the tarball manually, which is a heavy workload especially for those with little machine-learning background knowledge.

What solution would you like? BGE models(https://huggingface.co/BAAI/bge-small-en-v1.5, https://huggingface.co/BAAI/bge-base-en-v1.5, https://huggingface.co/BAAI/bge-large-en-v1.5) have very strong text_embedding representation among the models with same size. And we can use them consistently with other sentence-transformers text_embedding models. We can support these models as pretrained models in OpenSearch.

What alternatives have you considered? N/A

Do you have any additional context? N/A

zhichao-aws commented 4 months ago

Please assign this issue to me. We can try to get this done before 2.13 release.

This feature doesn't need to change the code in ml-commons. Based on my understanding to the code, once we build the model tarball artifact and upload them to the OpenSearch public artifacts website https://artifacts.opensearch.org/models/ml-models/, we can use them like other pre-trained text_embedding models. Please correct me if it doesn't work as I expected.

dhrubo-os commented 4 months ago

Let's cut an issue here: https://github.com/opensearch-project/opensearch-py-ml/issues

And I believe we need to update this workflow: https://github.com/opensearch-project/opensearch-py-ml/actions/workflows/model_uploader.yml

lkyhfx commented 4 months ago

Could we consider adding support for bge-m3 as well? bge-m3 is a model that can generate both dense and sparse vectors, and it has gained significant popularity within the Hugging Face community, with approximately 1.2 million downloads last month. Additionally, BAAI has recently published cross encoder model called bge-reranker-v2-m3 based on bge-m3. Given its growing popularity, it's foreseeable that more all-in-one models supporting various information retrieval (IR) tasks will emerge in the near future. Hence, integrating bge-m3 as a dense vector model, sparse vector model, and cross-encoder model could be a promising initiative.

P.S.: The latest version of Milvus has already incorporated bge-m3 as a hybrid search model: hello_hybrid_sparse_dense.py.

zhichao-aws commented 3 months ago

Could we consider adding support for bge-m3 as well? bge-m3 is a model that can generate both dense and sparse vectors, and it has gained significant popularity within the Hugging Face community, with approximately 1.2 million downloads last month. Additionally, BAAI has recently published cross encoder model called bge-reranker-v2-m3 based on bge-m3. Given its growing popularity, it's foreseeable that more all-in-one models supporting various information retrieval (IR) tasks will emerge in the near future. Hence, integrating bge-m3 as a dense vector model, sparse vector model, and cross-encoder model could be a promising initiative.

P.S.: The latest version of Milvus has already incorporated bge-m3 as a hybrid search model: hello_hybrid_sparse_dense.py.

bge-m3 sounds great. But based on my knowledge, current ml-commons framework can not support models with hybrid dense/sparse output. We need to conduct further design to make enhancements on local deployment or remote deployment to support these hybrid models. We may need to enhance the neural-search ingest processors, too. I think these are topics with sufficient complexity and are dedicated with current onboarding pipeline. We can create a new issue and track it there. BTW, from my perspective, bge-m3 has very large model size(2.27GB) and very long context length(8192 tokens). If we use local deployment, I'm concerned the model will take too much resources and may impact other tasks in the cluster. It makes more sense to deploy the model in dedicated resources and use remote connector, e.g. SageMaker.