Open eostis opened 11 months ago
Makes sense. CLIP has two parts, image encoding and text encoding, and are handled by two different neural networks.
We could fit the text transformer model into the existing embed framework as already done in multiple vespa sample applications, but image encoding would not fit into the existing embed functionality which takes a string or array of string as inputs.
So if you are fine with just having the text-to-image space model in Vespa, we can create that type of example using HF-embedder functionality.
With the same process ?
To handle image data, we would have to create a new type of embedder functionality.
Exactly! It will also prepare Vespa for further types: audio, video ...
I was a bit ahead of time apparently. 7-modality is here.
ImageBind is interesting, but I do recommend looking at the licensing :)
Indeed, non commercial license. https://creativecommons.org/licenses/by-nc-sa/4.0/ https://github.com/facebookresearch/ImageBind/blob/main/LICENSE
Does vespa support multimodality currently?
Hey @AriMKatz,
We currently do not expose any provided embedders that is for multimodal. The provided embedder models are text only.
This doesn't mean that you cannot use multimodal representations with Vespa, for example here is a recent example of a multimodal model PDF Retrieval with Vision Language Models (ColPali).
My goal is to build a unique multimodal WooCommerce search experience with Vespa multivectors and an hybrid ranking on text-BM25, text-vectors, and image-vectors.
For instance, E-commerce can use:
Of course, sounds and videos are also a possibility.
Currently, I implemented a text-to-text demo: https://demo-woocommerce-cloudways-2k-vespa-transformers.wpsolr.com/shop/
But image HF embedders are not available yet, as far as I can read in the documentation and blog.
Blog examples require an external Python code to produce the image vectors.