michaelfeil / infinity

Infinity is a high-throughput, low-latency serving engine for text-embeddings, reranking models, clip, clap and colpali
https://michaelfeil.github.io/infinity/
MIT License
1.5k stars 116 forks source link

Pass kwargs to encoder #482

Open RichaMax opened 13 hours ago

RichaMax commented 13 hours ago

Feature request

Models like https://huggingface.co/BAAI/bge-m3 and https://huggingface.co/jinaai/jina-embeddings-v3 can take extras kwargs as input of the encode function such as task=... for Jina v3 or return_dense=False/True for bge-m3

It would be great if we could pass these kwargs either when using the async engine via the Python API engine.embed(sentences=[...], additional_args=**kwargs)

or when we are sending requests to an endpoint create using your docker image

r = requests.post("http://0.0.0.0:7997/embeddings", json={"model":"test_model","input":["Two cute cats."], "task": "text-matching"})

Motivation

This would could also be used to handle truncate_dim for Matryoshka embeddings.

might be linked to: #476

Your contribution

I could try to implement it on my free time but I do not have much currently plus I'm still navigating the code. Any pointers at where to start are welcome.

michaelfeil commented 8 hours ago

Matryoshka embeddings is handled in another issue.

Due to the dynamic batching nature the models such as Jina need to handle the prompt template on instance level. However this is currently done on batch level (which might happen over multiple requests/tenants). Therefore this request is not possible.