feat: update milvus model to support jina embeddings 3

milvus-io / milvus-model

The embedding/reranking model zoo help user to convert their unstructured data into embeddings

Apache License 2.0

22 stars 17 forks source link

feat: update milvus model to support jina embeddings 3 #38

Closed bwanglzu closed 2 months ago

bwanglzu commented 2 months ago

This PR should allow pymilvus support `jina-embeddings-v3:

with 2 additional fields:

task_type: to properly load adapters
dimensions: allow user to truncate dimension with MRL.

wxywb commented 2 months ago

Is there any doc about jina v3 embedding? what kind of tasks_type it can use?

bwanglzu commented 2 months ago

hi @wxywb yes i'm working on another PR to milvus-docs with detailed documentation, the tasks are:

+ **retrieval.query**: Used to encode user queries or questions in retrieval tasks.
+ **retrieval.passage**: Used to encode large documents in retrieval tasks at indexing time.
+ **classification**: Used to encode text for text classification tasks.
+ **text-matching**: Used to encode text for similarity matching, such as measuring similarity between two sentences.
+ **separation**: Used for clustering or reranking tasks.

bwanglzu commented 2 months ago

i need to figure out a good way for this PR, now my problem is, we have a encode_query and encode_document function built-in each embedding function, such as JinaEmbeddingFunction, i guess in these two functions, i need to use task_type=query and task_type=passage respectively.

While for other tasks, i'm not sure what is the best way to parse this task_type, so i overwrite the __call__ function (not sure if it is desired).

Can you give me some suggestions?

bwanglzu commented 2 months ago

(btw if you are in the jina_ai-x-milvus joint slack channel, i can give you more detailed information :)

wxywb commented 2 months ago

The current design puts task_type in the init method, and the call method will use this setting. If someone needs to change it, they can create a new embedding function. You can see an example in cohere. https://github.com/milvus-io/milvus-model/blob/main/milvus_model/dense/cohere.py