558k image captions from blip
connector만 풀고, 나머지 다 freeze
max 768x768 pixels (square로 resize하는게 아니라 최대 pixels 개수만 이 안으로)
first stage (1.2)
4.1m (image, text) pairs from LLaVA-NeXt, Cauldron, Cambrian-1
connector, LLM 풀고, visual encoder freeze
max 1280x1280 pixels (square로 resize하는게 아니라 최대 pixels 개수만 이 안으로)
Second stage
image : 600k images random samples from stage1
video : 650k videos from variety open source
max 1536x1536 pixels (square로 resize하는게 아니라 최대 pixels 개수만 이 안으로)
제한사항
image : max 1536 pixels
video : resolution between (288, 480)
video : 1 FPS (frame per seconds), max 64 frames
만약에 너무 긴 영상이면 64에 맞도록 uniformly sample
General Video Understanding
Model
Size
VideoMME
NextQA
MVBench
PercepTest
MMB-Video
VCG
VDC
Proprietary Models
GPT-4V (OpenAI, 2023b)
-
59.9/63.3
-
43.7
-
1.53
4.06
4.00
GPT-4o (OpenAI, 2024)
-
71.9/77.2
-
-
-
1.63
-
-
Gemini-1.5-Pro (GeminiTeam, 2024)
-
75.0/81.3
-
-
-
1.30
-
-
Open-Sourced Video MLLMs
VideoChat2-HD (Li et al., 2024c)
7B
45.3/55.7
79.5
62.3
47.3
1.18
3.10
-
VideoLLaMA2 (Cheng et al., 2024)
7B
47.9/50.3
-
54.6
51.4
-
3.13
-
LLaVA-OneVision (Li et al., 2024)
7B
58.2/61.5
79.4
56.7
49.7
-
3.51
3.75
Kangaroo (Liu et al., 2024e)
8B
56.0/57.6
-
61.1
-
1.44
-
-
VideoCCAM (Fei et al., 2024)
9B
53.9/56.1
-
64.6
-
-
-
-
LLaVA-Next-Video (Zhang et al., 2024c)
34B
52.0/54.9
70.2
-
51.6
-
3.34
3.48
PLLaVA (Xu et al., 2024a)
34B
-
-
58.1
-
-
3.48
-
VILA-1.5 (Lin et al., 2023b)
40B
60.1/61.1
67.9
-
54.0
-
3.36
3.37
VideoLLaMA2 (Cheng et al., 2024)
72B
61.4/63.1
-
62.0
57.5
-
3.16
-
LLaVA-OneVision (Li et al., 2024)
72B
66.2/69.5
80.2
59.4
66.9
-
3.62
3.60
Oryx
7B
58.3/62.6
81.9
63.9
68.6
1.47
3.53
3.76
Oryx
34B
63.2/67.4
83.5
64.7
71.4
1.49
3.51
3.66
General Video Understanding (Long)
Model
Size
MLVU
LongVideoBench
VideoMME-Long (w/o subs)
VideoMME-Long (w subs)
Proprietary Models
GPT-4V (OpenAI, 2023b)
-
49.2
60.7
53.5
56.9
GPT-4o (OpenAI, 2024)
-
64.6
66.7
65.3
72.1
Gemini-1.5-Pro (GeminiTeam, 2024)
-
-
64.4
67.4
77.4
Open-Sourced Video MLLMs
VideoLLaMA2 (Cheng et al., 2024)
7B
48.5
-
42.1
43.8
LongVA (Zhang et al., 2024a)
7B
56.3
-
46.2
47.6
LLaVA-OneVision (Li et al., 2024a)
7B
64.7
-
-
-
Kangaroo (Liu et al., 2024e)
8B
61.0
54.8
46.6
49.3
LongVILA (Xue et al., 2024)
8B
-
-
39.7
-
VideoCCAM (Fei et al., 2024)
14B
63.1
-
46.7
49.9
LLaVA-Next-Video (Zhang et al., 2024c)
34B
-
50.5
-
-
PLLaVA (Xu et al., 2024a)
34B
-
53.2
-
-
VILA-1.5 (Lin et al., 2023b)
40B
56.7
-
53.8
55.7
LLaVA-OneVision (Li et al., 2024a)
72B
66.4
61.3
60.0
62.4
Oryx
7B
67.5
55.3
50.3
55.8
Oryx
34B
70.8
62.2
53.9
58.0
General Image Understanding
아니 DocVQA 90을 넘다니... TextVQA도 77.8
Model
Size
MMBench
MMMU
DocVQA
OCRBench
AI2D
TextVQA
Deepseek-VL (Lu et al., 2024)
7B
73.2
36.6
-
456
-
64.7
Monkey (Li et al., 2024d)
7B
72.4
40.7
-
534
68.5
-
LLaVA-NeXT (Liu et al., 2024c)
8B
72.1
41.7
78.2
531
71.6
-
Bunny-LLama3 (He et al., 2024)
8B
77.2
43.3
-
444
69.4
-
Cambrian-1 (Tong et al., 2024)
8B
75.9
42.7
77.8
624
73.6
71.7
VILA-1.5 (Lin et al., 2023b)
8B
75.3
38.6
-
-
-
68.5
Idefics2 (Laurençon et al., 2024)
8B
76.7
43.0
-
-
-
73.0
Yi-VL (Young et al., 2024)
34B
-
45.1
-
290
65.9
-
LLaVA-NeXT (Liu et al., 2024c)
34B
79.3
49.7
84.0
574
74.9
-
Cambrian-1 (Tong et al., 2024)
34B
81.4
49.7
75.5
600
79.7
76.7
VILA-1.5 (Lin et al., 2023b)
40B
82.4
51.9
-
-
-
73.4
Oryx
7B
81.4
43.9
89.0
672
78.5
75.0
Oryx
34B
84.5
50.3
91.4
743
81.0
77.8
결론
SFT Data open 되면 re-producible한지 검증해 봐야 할 것 같다.
만약 train & eval이 검증이 된다면... image, video MLLM의 architecture로 사용 가능하지 않을까?
개인적 생각...
발전의 속도가 너무 빠르다... (아쉬우면서도 다행인게 english, chinese 지원)
MLLM 성능은 알겠는데... 이미지, 비디오 입력 없을 때 LLM 성능은? 유지되는것일까?
Some Links
Model Zoo
Model Link Size Visual Encoder LLM-Type Intermidiate Model Oryx-7B Huggingface 7B Oryx-ViT Qwen-2-7B Oryx-7B-Image Oryx-34B Huggingface 34B Oryx-ViT Yi-1.5-34B Oryx-34B-Image Ge
image 이해와 video 이해를 동시에 한다고?
Input vs Tokens
이제 arbitrary image resolution과 arbitrary video frame length는 대세가 되어 버렸다...
Train : 2 stage
학습에는 8 nodes가 필요하다. (64 A100 gpus, 중국이어서 A800 gpus 사용)
first stage (1.1)
first stage (1.2)
Second stage
제한사항
General Video Understanding
General Video Understanding (Long)
General Image Understanding
결론
개인적 생각...