paperswithlove / papers-we-read

3 stars 0 forks source link

ORYX MLLM: On-demand Spatial-Temporal Understanding at Arbitrary resolution #48

Open runhani opened 1 month ago

runhani commented 1 month ago

Some Links

Model Zoo

Model Link Size Visual Encoder LLM-Type Intermidiate Model
Oryx-7B Huggingface 7B Oryx-ViT Qwen-2-7B Oryx-7B-Image
Oryx-34B Huggingface 34B Oryx-ViT Yi-1.5-34B Oryx-34B-Image

Model Link Size Visual Encoder LLM-Type Intermidiate Model Oryx-7B Huggingface 7B Oryx-ViT Qwen-2-7B Oryx-7B-Image Oryx-34B Huggingface 34B Oryx-ViT Yi-1.5-34B Oryx-34B-Image Ge

image 이해와 video 이해를 동시에 한다고?

Input vs Tokens

image

이제 arbitrary image resolution과 arbitrary video frame length는 대세가 되어 버렸다...

image

Train : 2 stage

General Video Understanding

Model Size VideoMME NextQA MVBench PercepTest MMB-Video VCG VDC
Proprietary Models                
GPT-4V (OpenAI, 2023b) - 59.9/63.3 - 43.7 - 1.53 4.06 4.00
GPT-4o (OpenAI, 2024) - 71.9/77.2 - - - 1.63 - -
Gemini-1.5-Pro (GeminiTeam, 2024) - 75.0/81.3 - - - 1.30 - -
Open-Sourced Video MLLMs                
VideoChat2-HD (Li et al., 2024c) 7B 45.3/55.7 79.5 62.3 47.3 1.18 3.10 -
VideoLLaMA2 (Cheng et al., 2024) 7B 47.9/50.3 - 54.6 51.4 - 3.13 -
LLaVA-OneVision (Li et al., 2024) 7B 58.2/61.5 79.4 56.7 49.7 - 3.51 3.75
Kangaroo (Liu et al., 2024e) 8B 56.0/57.6 - 61.1 - 1.44 - -
VideoCCAM (Fei et al., 2024) 9B 53.9/56.1 - 64.6 - - - -
LLaVA-Next-Video (Zhang et al., 2024c) 34B 52.0/54.9 70.2 - 51.6 - 3.34 3.48
PLLaVA (Xu et al., 2024a) 34B - - 58.1 - - 3.48 -
VILA-1.5 (Lin et al., 2023b) 40B 60.1/61.1 67.9 - 54.0 - 3.36 3.37
VideoLLaMA2 (Cheng et al., 2024) 72B 61.4/63.1 - 62.0 57.5 - 3.16 -
LLaVA-OneVision (Li et al., 2024) 72B 66.2/69.5 80.2 59.4 66.9 - 3.62 3.60
Oryx 7B 58.3/62.6 81.9 63.9 68.6 1.47 3.53 3.76
Oryx 34B 63.2/67.4 83.5 64.7 71.4 1.49 3.51 3.66

General Video Understanding (Long)

Model Size MLVU LongVideoBench VideoMME-Long (w/o subs) VideoMME-Long (w subs)
Proprietary Models          
GPT-4V (OpenAI, 2023b) - 49.2 60.7 53.5 56.9
GPT-4o (OpenAI, 2024) - 64.6 66.7 65.3 72.1
Gemini-1.5-Pro (GeminiTeam, 2024) - - 64.4 67.4 77.4
Open-Sourced Video MLLMs          
VideoLLaMA2 (Cheng et al., 2024) 7B 48.5 - 42.1 43.8
LongVA (Zhang et al., 2024a) 7B 56.3 - 46.2 47.6
LLaVA-OneVision (Li et al., 2024a) 7B 64.7 - - -
Kangaroo (Liu et al., 2024e) 8B 61.0 54.8 46.6 49.3
LongVILA (Xue et al., 2024) 8B - - 39.7 -
VideoCCAM (Fei et al., 2024) 14B 63.1 - 46.7 49.9
LLaVA-Next-Video (Zhang et al., 2024c) 34B - 50.5 - -
PLLaVA (Xu et al., 2024a) 34B - 53.2 - -
VILA-1.5 (Lin et al., 2023b) 40B 56.7 - 53.8 55.7
LLaVA-OneVision (Li et al., 2024a) 72B 66.4 61.3 60.0 62.4
Oryx 7B 67.5 55.3 50.3 55.8
Oryx 34B 70.8 62.2 53.9 58.0

General Image Understanding

Model Size MMBench MMMU DocVQA OCRBench AI2D TextVQA
Deepseek-VL (Lu et al., 2024) 7B 73.2 36.6 - 456 - 64.7
Monkey (Li et al., 2024d) 7B 72.4 40.7 - 534 68.5 -
LLaVA-NeXT (Liu et al., 2024c) 8B 72.1 41.7 78.2 531 71.6 -
Bunny-LLama3 (He et al., 2024) 8B 77.2 43.3 - 444 69.4 -
Cambrian-1 (Tong et al., 2024) 8B 75.9 42.7 77.8 624 73.6 71.7
VILA-1.5 (Lin et al., 2023b) 8B 75.3 38.6 - - - 68.5
Idefics2 (Laurençon et al., 2024) 8B 76.7 43.0 - - - 73.0
Yi-VL (Young et al., 2024) 34B - 45.1 - 290 65.9 -
LLaVA-NeXT (Liu et al., 2024c) 34B 79.3 49.7 84.0 574 74.9 -
Cambrian-1 (Tong et al., 2024) 34B 81.4 49.7 75.5 600 79.7 76.7
VILA-1.5 (Lin et al., 2023b) 40B 82.4 51.9 - - - 73.4
Oryx 7B 81.4 43.9 89.0 672 78.5 75.0
Oryx 34B 84.5 50.3 91.4 743 81.0 77.8

결론

개인적 생각...