Thanks for the wonderful survey. We would like to add a new work: Matryoshka Query Transformer for Large Vision-Language Models. Paper: https://arxiv.org/abs/2405.19315 code: https://github.com/gordonhu608/MQT-LLaVA.
This model adds a new perspective in efficiently utilize visual tokens for Multimodal LLMs. Thank you so much ahead for considering our work.
Thanks for the wonderful survey. We would like to add a new work: Matryoshka Query Transformer for Large Vision-Language Models. Paper: https://arxiv.org/abs/2405.19315 code: https://github.com/gordonhu608/MQT-LLaVA. This model adds a new perspective in efficiently utilize visual tokens for Multimodal LLMs. Thank you so much ahead for considering our work.