Can I use the joint_embedding for Composed Image Retrieval (CIR) ?

javiabellan commented 5 months ago

After reading https://www.unum.cloud/blog/2023-02-20-efficient-multimodality i found very interesting the multimodal encoder. My first thought was this will produce an embeding in the same latent space of the visual and textual embeddings, for solving the CIR problem:

joint_embedding = model.encode_multimodal(image=image_info, text=text_info)

But after further examination of the loss functions (ALBEF and ViCHA) Im not sure if that is the case.

Image-Text Matching (ITM) -> A linear layer on top of the multimodal emb, followed by a sigmoid to predict the similarity of both modalities. Useful to predict similarity, but not to aling the embedding.
Masked Language Modeling (MLM) -> BERT like
Masked Image Modeling -> MAE / I-JEPA like
Hierarchical Image-Text Contrastive (H-ITC) -> layer-wise CLIP
Visual Concepts Extraction (VCE) -> Keywords extraction

My insight is to make use of this late-fusion multimodal encoder to align the joint embedding to the same latent space of img and text embeddings:

There are several papers about this problem, but the UForm multimodal encoder looks very similar to the "fusion" family of CIR methods:

ARTEMIS
Combiner
MagicLens (the latest and probably the best one)

Captura de pantalla 2024-04-16 a las 1 34 39 (Screenshot from https://arxiv.org/abs/2303.11916v3)

VoVoR commented 5 months ago

Hi,

You are correct, our multimodal encoder was not trained for the CIR problem. We trained it for retrieval + reranking search approach. So, we expect users to utilize it to change the search results' order or filter it.

We are considering training something for CIR, but we wanted to wait until MagicLens releases its models and code!

Btw, you can still get an embedding from multimodal encoder and try it on your data. But I am not sure it will work well on it.

javiabellan commented 5 months ago

Nice, I appraciate your comment. Im waiting for MagicLens weights too :)

unum-cloud / uform

Can I use the joint_embedding for Composed Image Retrieval (CIR) ? #78