kongds / E5-V

E5-V: Universal Embeddings with Multimodal Large Language Models
https://arxiv.org/abs/2407.12580
144 stars 6 forks source link

Unimodal contrastive learning #3

Open NBitBuilder opened 3 weeks ago

NBitBuilder commented 3 weeks ago

Thank you for sharing such an interesting idea!

Since there is no longer a modality gap in the embeddings, we can transfer the single modality representation capabilities to multimodal embeddings by training on text pairs only.

Can we use image contrastive learning to explore a larger scale of data than NLI in this regard?

image \n Summary above image in one word: image [CLS]

kongds commented 3 weeks ago

Thank you for your interest in our work.

I think contrastive learning on image pairs should also works. However, it presents several challenges compared to contrastive learning on text pairs.

First, the number of visual tokens is often larger than that of text tokens. For instance, LLaVA uses 576 visual tokens for a single image, which results in higher GPU memory and computational resource requirements for image-based training.

Second, it is easier to define positive and negative pairs for text than for images. We can utilize hard negative samples in NLI to learn better representations, which is more challenging with images.

NBitBuilder commented 3 weeks ago

Yes, indeed, image contrastive learning is more computationally demanding. Still, it is easier to make positive-negative pairs through augmentations, and thus, it can be scalable and extended to other image domains. We do not have to rely on the fixed dataset NLI. Moreover, although E5-V claimed that it does not require image-text pairs, it is built on models trained with image-text pairs. Therefore, the final model is not trained with single-modality training. The last step only shortened the multimodality gap but did not build the multimodal embeddings from scratch.

kongds commented 3 weeks ago

We compared multimodal training in our paper and found that it performs worse than single-modality training. And I don't think that contrastive learning on image pairs can achieve better performance than on text pairs or image-text pairs.

We want to clarify that the multimodal embeddings we discuss are not built from scratch using single-modality training but are based on MLLMs (as the title mentioned). The focus of E5-V is on leveraging existing MLLMs to achieve multimodal embeddings. We have successfully done so using only single-modality training with text-pair contrastive learning.

NBitBuilder commented 3 weeks ago

Thank you so much for your explanations!