Videos : 60M (6천만)
Image : 500M (5억)
Audio : 500k (50만) non-verbal sounds and music
평가
baseline
Data Filtering Network-H/14-378 (Fang et al, Apple & University of Washington, 2023.09): This open-source image foundation model is based on the CLIP training objective. It was trained on 5 billion image-text pairs with a 378x378 image resolution.
LanguageBind-H (Zhu et al, Peking University, 2024.02): This open-source video foundation model processes both audio and visual information and was reportedly trained on 10 million video-text pairs (VIDAL-10m dataset).
VideoPrism-G (Zhao et al, Google, 2024.02): This video foundation model processes visual information and was reportedly trained on 618 million video-text pairs.
(Commercial) Google Gemini(GenAI) Multimodal Embedding API
Zero Shot Video Retrieval (ZS-T2V)
Zero Shot Video Retrieval
MSR-VTT R@1
MSR-VTT R@5
ActivityNet R@1
ActivityNet R@5
Peking Univ. LanguageBind-H (2024.01)
44.8%
70.0%
41.0%
68.4%
Google VideoPrism-G (2024.02)
39.7%
63.7%
52.7%
79.4%
Gemini Multimodal Embedding API (2024.02)
39.4%
63.1%
26.3%
49.8%
Ours (Marengo-2.6)
49.35% (+4.6%)
73.47% (+3.5%)
55.36% (+2.7%)
82.55% (+3.2%)
Zero Shot Image Retrieval (ZS-T2I)
Zero Shot Image Retrieval
MS-COCO Recall@1
MS-COCO Recall@5
Flickr30K R@1
Flickr30K R@5
Apple DFN-H/378 (2024.01)
55.6%
79.2%
82.1%
96.0%
Gemini Multimodal Embedding API (2024.02)
52.73%
75.80%
80.26%
94.28%
Ours (Marengo-2.6)
55.65% (-)
80.31% (+1.1%)
84.95% (+2.9%)
96.7% (+0.7%)
Zero Shot Audio Retrieval (ZS-T2A)
Zero Shot Audio Retrieval
Clotho R@1
Clotho R@10
AudioCaps R@1
AudioCaps R@10
Peking Univ. LanguageBind-H (2024.01)
16.7%
52.0%
19.7%
67.6%
Ours (Marengo-2.6)
17.61% (+0.9%)
52.25% (+0.3%)
23.01% (+3.3%)
69.43% (+1.8%)
그래 encoder를 잘 만들었다고 하면? 그 다음은?
확인해 보니 video caption의 quality와 granularity (얼마나 detail하게 표현 되는지)
100K high quality (10만 pairs) vs 10M low quality (1000만 pairs)
catestrophic forgetting 문제가 심해서 (자꾸 새로운걸 학습하면 이전에 학습한 걸 까먹는 현상) 특정 부분만 학습 나머지는 freeze하거나 lr을 아주 섬세하게 조정했다.
Some Links
Pegasus-1 17B
Video LLM (with video encoder Marengo 2.6)
그럼 Marengo가 뭘까?
Marengo 2.6
Multimodal Foundation Model for any-to-any search
Spec
학습
평가
baseline
Zero Shot Video Retrieval (ZS-T2V)
Zero Shot Image Retrieval (ZS-T2I)
Zero Shot Audio Retrieval (ZS-T2A)
그래 encoder를 잘 만들었다고 하면? 그 다음은?
평가
Video Question Answering
Video Conversations
Video Summarization