How does E5V prevent catastrophic forgetting on image modality?

Hi this is really a nice work that shows potential on embedding anything using LLMs.

In section 3.1, you explained that by a summary prompt, both vision and text can be embedded into next token. And by this next-token embedding, the modality gap can be better bridged compared to former last-token embedding. This should be true given LLaVA's visual encoder/projector and LLM are matched by end-to-end training.

However, since LLM does not see visual tokens anymore during text-only training, typically I would expect catastrophic forgetting on the vision side that makes LLM no longer recognize visual tokens, as visual encoder/projector and LLM are no longer matched after the LLM pamarters shift towards text-only modality after training.

It would be interesting to understand how this potential problem is prevented in this paper. Is it by QLoRA or carefully designed layer decays, or LLAVA visual tokens are just too similar to text tokens to be forgotten?

kongds / E5-V

How does E5V prevent catastrophic forgetting on image modality? #5