kongds / E5-V

E5-V: Universal Embeddings with Multimodal Large Language Models
https://arxiv.org/abs/2407.12580
148 stars 6 forks source link

How does E5V prevent catastrophic forgetting on image modality? #5

Open huangyjhust opened 1 week ago

huangyjhust commented 1 week ago

Hi this is really a nice work that shows potential on embedding anything using LLMs.

In section 3.1, you explained that by a summary prompt, both vision and text can be embedded into next token. And by this next-token embedding, the modality gap can be better bridged compared to former last-token embedding. This should be true given LLaVA's visual encoder/projector and LLM are matched by end-to-end training.

However, since LLM does not see visual tokens anymore during text-only training, typically I would expect catastrophic forgetting on the vision side that makes LLM no longer recognize visual tokens, as visual encoder/projector and LLM are no longer matched after the LLM pamarters shift towards text-only modality after training.

It would be interesting to understand how this potential problem is prevented in this paper. Is it by QLoRA or carefully designed layer decays, or LLAVA visual tokens are just too similar to text tokens to be forgotten?

kongds commented 1 week ago

Thank you for your interest in our work and your insightful question.

Indeed, we do not use auxiliary techniques to avoid the forgetting of visual understanding abilities in MLLMs.

For example, LLMs can still understand foreign languages even after being trained on English datasets. In my opinion, I think that visual tokens can be viewed as another form of foreign language, and MLLMs can still understand visual tokens even when trained primarily on text.