Open joaomsimoes opened 2 weeks ago
I haven't decided yet tbh. For CLIP, I don't see the benefit of supporting the image encoder in TEI. VIT are very different and can be easily saturated. For the text decoder, for sure it would be interesting to have them supported natively.
For VLMs, I don't see them being used for embeddings right now but if you have examples I could look into them. I think having VLM support in TEI would make sense, more sense than supporting CLIP because you have a cross-attention between the visual features and the text features.
I'm seeing this being applied more with vector databases: https://docs.trychroma.com/guides/multimodal
For example, with VLMs can happen is that the user makes a question, "what is this?", and if the image is not in the model training it will not be able to respond. Using the vector database would help to avoid fine tune for every image.
This is just an example where it can be useful.
Feature request
It is on the road map to have images embeddings models?
Motivation
This is very useful since there are many VLLMs coming out.
Your contribution
Any thing that is needed.