unum-cloud / uform

Pocket-Sized Multimodal AI for content understanding and generation across multilingual texts, images, and 🔜 video, up to 5x faster than OpenAI CLIP and LLaVA 🖼️ & 🖋️
https://unum-cloud.github.io/uform/
Apache License 2.0
1.03k stars 62 forks source link

Can I use the joint_embedding for Composed Image Retrieval (CIR) ? #78

Closed javiabellan closed 5 months ago

javiabellan commented 5 months ago

After reading https://www.unum.cloud/blog/2023-02-20-efficient-multimodality i found very interesting the multimodal encoder. My first thought was this will produce an embeding in the same latent space of the visual and textual embeddings, for solving the CIR problem:

joint_embedding = model.encode_multimodal(image=image_info, text=text_info)

But after further examination of the loss functions (ALBEF and ViCHA) Im not sure if that is the case.

My insight is to make use of this late-fusion multimodal encoder to align the joint embedding to the same latent space of img and text embeddings:

There are several papers about this problem, but the UForm multimodal encoder looks very similar to the "fusion" family of CIR methods:

Captura de pantalla 2024-04-16 a las 1 34 39 (Screenshot from https://arxiv.org/abs/2303.11916v3)

VoVoR commented 5 months ago

Hi,

You are correct, our multimodal encoder was not trained for the CIR problem. We trained it for retrieval + reranking search approach. So, we expect users to utilize it to change the search results' order or filter it.

We are considering training something for CIR, but we wanted to wait until MagicLens releases its models and code!

Btw, you can still get an embedding from multimodal encoder and try it on your data. But I am not sure it will work well on it.

javiabellan commented 5 months ago

Nice, I appraciate your comment. Im waiting for MagicLens weights too :)