Images Embeddings (ex. CLIP model)

huggingface / text-embeddings-inference

A blazing fast inference solution for text embeddings models

https://huggingface.co/docs/text-embeddings-inference/quick_tour

Apache License 2.0

2.35k stars 145 forks source link

Images Embeddings (ex. CLIP model) #333

Open joaomsimoes opened 2 weeks ago

joaomsimoes commented 2 weeks ago

Feature request

It is on the road map to have images embeddings models?

Motivation

This is very useful since there are many VLLMs coming out.

Your contribution

Any thing that is needed.

OlivierDehaene commented 2 weeks ago

I haven't decided yet tbh. For CLIP, I don't see the benefit of supporting the image encoder in TEI. VIT are very different and can be easily saturated. For the text decoder, for sure it would be interesting to have them supported natively.

For VLMs, I don't see them being used for embeddings right now but if you have examples I could look into them. I think having VLM support in TEI would make sense, more sense than supporting CLIP because you have a cross-attention between the visual features and the text features.

joaomsimoes commented 2 weeks ago

I'm seeing this being applied more with vector databases: https://docs.trychroma.com/guides/multimodal

For example, with VLMs can happen is that the user makes a question, "what is this?", and if the image is not in the model training it will not be able to respond. Using the vector database would help to avoid fine tune for every image.

This is just an example where it can be useful.