swordlidev / Efficient-Multimodal-LLMs-Survey

Efficient Multimodal Large Language Models: A Survey
Apache License 2.0
269 stars 12 forks source link

Efficient-Multimodal-LLMs-Survey

Efficient Multimodal Large Language Models: A Survey [arXiv]

Yizhang Jin12, Jian Li1, Yexin Liu3, Tianjun Gu4, Kai Wu1, Zhengkai Jiang1, Muyang He3, Bo Zhao3, Xin Tan4, Zhenye Gan1, Yabiao Wang1, Chengjie Wang1, Lizhuang Ma2

1Tencent YouTu Lab, 2SJTU, 3BAAI, 4ECNU

⚡We will actively maintain this repository and incorporate new research as it emerges. If you have any questions, please contact swordli@tencent.com. Welcome to collaborate on academic research and writing papers together.(欢迎学术合作).

@misc{jin2024efficient,
      title={Efficient Multimodal Large Language Models: A Survey}, 
      author={Yizhang Jin and Jian Li and Yexin Liu and Tianjun Gu and Kai Wu and Zhengkai Jiang and Muyang He and Bo Zhao and Xin Tan and Zhenye Gan and Yabiao Wang and Chengjie Wang and Lizhuang Ma},
      year={2024},
      eprint={2405.10739},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

📌 What is This Survey About?

In the past year, Multimodal Large Language Models (MLLMs) have demonstrated remarkable performance in tasks such as visual question answering, visual understanding and reasoning. However, the extensive model size and high training and inference costs have hindered the widespread application of MLLMs in academia and industry. Thus, studying efficient and lightweight MLLMs has enormous potential, especially in edge computing scenarios. In this survey, we provide a comprehensive and systematic review of the current state of efficient MLLMs. Specifically, we summarize the timeline of representative efficient MLLMs, research state of efficient structures and strategies, and the applications. Finally, we discuss the limitations of current efficient MLLM research and promising future directions.

Summary of 17 Mainstream Efficient MLLMs

Model Vision Encoder Resolution Vision Encoder Parameter Size LLM LLM Parameter Size Vision-LLM Projector Timeline
MobileVLM CLIP ViT-L/14 336 0.3B MobileLLaMA 2.7B LDP 2023-12
LLaVA-Phi CLIP ViT-L/14 336 0.3B Phi-2 2.7B MLP 2024-01
Imp-v1 SigLIP 384 0.4B Phi-2 2.7B - 2024-02
TinyLLaVA SigLIP-SO 384 0.4B Phi-2 2.7B MLP 2024-02
Bunny SigLIP-SO 384 0.4B Phi-2 2.7B MLP 2024-02
MobileVLM-v2-3B CLIP ViT-L/14 336 0.3B MobileLLaMA 2.7B LDPv2 2024-02
MoE-LLaVA-3.6B CLIP-Large 384 - Phi-2 2.7B MLP 2024-02
Cobra DINOv2, SigLIP-SO 384 0.3B+0.4B Mamba-2.8b-Zephyr 2.8B MLP 2024-03
Mini-Gemini CLIP-Large 336 - Gemma 2B MLP 2024-03
Vary-toy CLIP 224 - Qwen 1.8B - 2024-01
TinyGPT-V EVA 224/448 - Phi-2 2.7B Q-Former 2024-01
SPHINX-Tiny DINOv2 , CLIP-ConvNeXt 448 - TinyLlama 1.1B - 2024-02
ALLaVA-Longer CLIP-ViT-L/14 336 0.3B Phi-2 2.7B - 2024-02
MM1-3B-MoE-Chat CLIP_DFN-ViT-H 378 - - 3B C-Abstractor 2024-03
LLaVA-Gemma DinoV2 - - Gemma-2b-it 2B - 2024-03
Mipha-3B SigLIP 384 - Phi-2 2.7B - 2024-03
VL-Mamba SigLIP-SO 384 - Mamba-2.8B-Slimpj 2.8B VSS-L2 2024-03
MiniCPM-V 2.0 SigLIP - 0.4B MiniCPM 2.7B Perceiver Resampler 2024-03
DeepSeek-VL SigLIP-L 384 0.4B DeepSeek-LLM 1.3B MLP 2024-03
KarmaVLM SigLIP-SO 384 0.4B Qwen1.5 0.5B - 2024-02
moondream2 SigLIP - - Phi-1.5 1.3B - 2024-03
Bunny-v1.1-4B SigLIP 1152 - Phi-3-Mini-4K 3.8B - 2024-02

Efficient MLLMs

Architecture

Vision Encoder

Multiple Vision Encoders
Lightweight Vision Encoder

Vision-Language Projector

MLP-based
Attention-based
CNN-based
Mamba-based
Hybrid Structure

Small Language Models

Vision Token Compression

Multi-view Input
Token processing
Multi-Scale Information Fusion
Vision Expert Agents
Video-Specific Methods

Efficient Structures

Mixture of Experts
Mamba
Inferece Acceleration

Training

Pre-Training

Which part to unfreeze
Multi-stage pre-training

Instruction-Tunining

Efficient IT

Diverse Training Steps

Parameter Efficient Transfer Learning

Applications

Biomedical Analysis

Document Understanding

Video Comprehension