Closed javiabellan closed 5 months ago
Hi,
You are correct, our multimodal encoder was not trained for the CIR problem. We trained it for retrieval + reranking search approach. So, we expect users to utilize it to change the search results' order or filter it.
We are considering training something for CIR, but we wanted to wait until MagicLens releases its models and code!
Btw, you can still get an embedding from multimodal encoder and try it on your data. But I am not sure it will work well on it.
Nice, I appraciate your comment. Im waiting for MagicLens weights too :)
After reading https://www.unum.cloud/blog/2023-02-20-efficient-multimodality i found very interesting the multimodal encoder. My first thought was this will produce an embeding in the same latent space of the visual and textual embeddings, for solving the CIR problem:
But after further examination of the loss functions (ALBEF and ViCHA) Im not sure if that is the case.
My insight is to make use of this late-fusion multimodal encoder to align the joint embedding to the same latent space of img and text embeddings:
There are several papers about this problem, but the UForm multimodal encoder looks very similar to the "fusion" family of CIR methods:
(Screenshot from https://arxiv.org/abs/2303.11916v3)