[FR] Support of more HuggingFace embedders for multimodality

vespa-engine / vespa

AI + Data, online. https://vespa.ai

https://vespa.ai

Apache License 2.0

5.48k stars 585 forks source link

[FR] Support of more HuggingFace embedders for multimodality #28090

Open eostis opened 11 months ago

eostis commented 11 months ago

My goal is to build a unique multimodal WooCommerce search experience with Vespa multivectors and an hybrid ranking on text-BM25, text-vectors, and image-vectors.

For instance, E-commerce can use:

text-to-image (CLIP): search images
text-to-text (sentence transformers): search texts
image-to-image (resnet): similar images.

Of course, sounds and videos are also a possibility.

Currently, I implemented a text-to-text demo: https://demo-woocommerce-cloudways-2k-vespa-transformers.wpsolr.com/shop/

But image HF embedders are not available yet, as far as I can read in the documentation and blog.

Blog examples require an external Python code to produce the image vectors.

jobergum commented 11 months ago

Makes sense. CLIP has two parts, image encoding and text encoding, and are handled by two different neural networks.

We could fit the text transformer model into the existing embed framework as already done in multiple vespa sample applications, but image encoding would not fit into the existing embed functionality which takes a string or array of string as inputs.

jobergum commented 11 months ago

So if you are fine with just having the text-to-image space model in Vespa, we can create that type of example using HF-embedder functionality.

eostis commented 11 months ago

With the same process ?

Export the HF CLIP .onnx
Set the containers's HF component in services.xml
Define the embedded field with the input fields and input images participating in .sd
Add a closeness rank profile
Define the YQL query with nearestNeighbor() and ranking

jobergum commented 11 months ago

To handle image data, we would have to create a new type of embedder functionality.

eostis commented 11 months ago

Exactly! It will also prepare Vespa for further types: audio, video ...

eostis commented 11 months ago

I was a bit ahead of time apparently. 7-modality is here.

jobergum commented 11 months ago

ImageBind is interesting, but I do recommend looking at the licensing :)

eostis commented 11 months ago

Indeed, non commercial license. https://creativecommons.org/licenses/by-nc-sa/4.0/ https://github.com/facebookresearch/ImageBind/blob/main/LICENSE

AriMKatz commented 15 hours ago

Does vespa support multimodality currently?

jobergum commented 1 hour ago

Hey @AriMKatz,

We currently do not expose any provided embedders that is for multimodal. The provided embedder models are text only.

This doesn't mean that you cannot use multimodal representations with Vespa, for example here is a recent example of a multimodal model PDF Retrieval with Vision Language Models (ColPali).