Pegasus-v1 Technical Report

Some Links

https://arxiv.org/abs/2404.14687

Pegasus-1 17B

Video LLM (with video encoder Marengo 2.6)

Video Encoder Model : Video Frames, Video ASR (text)
Video-Language Alignment : video embeddings을 llm에 align 시켜주는 모델
LLM : Transformer Decoder

그럼 Marengo가 뭘까?

Marengo 2.6

Multimodal Foundation Model for any-to-any search

a single embedding model을 통해 video, image, audio를 다 embedding할 수 있다고?
video처리를 text 만큼 쉽게 하고 싶다는 goal을 달성하기 위한 첫번째 step

Text-To-Video Text-To-Image Text-To-Audio Audio-To-Video Image-To-Video

Spec

입력 최소 : 4 seconds 입력 최대 : 20 minutes

학습

Videos : 60M (6천만) Image : 500M (5억) Audio : 500k (50만) non-verbal sounds and music

평가

baseline

Data Filtering Network-H/14-378 (Fang et al, Apple & University of Washington, 2023.09): This open-source image foundation model is based on the CLIP training objective. It was trained on 5 billion image-text pairs with a 378x378 image resolution.
LanguageBind-H (Zhu et al, Peking University, 2024.02): This open-source video foundation model processes both audio and visual information and was reportedly trained on 10 million video-text pairs (VIDAL-10m dataset).
VideoPrism-G (Zhao et al, Google, 2024.02): This video foundation model processes visual information and was reportedly trained on 618 million video-text pairs.

(Commercial) Google Gemini(GenAI) Multimodal Embedding API

Zero Shot Video Retrieval (ZS-T2V)

Zero Shot Video Retrieval	MSR-VTT R@1	MSR-VTT R@5	ActivityNet R@1	ActivityNet R@5
Peking Univ. LanguageBind-H (2024.01)	44.8%	70.0%	41.0%	68.4%
Google VideoPrism-G (2024.02)	39.7%	63.7%	52.7%	79.4%
Gemini Multimodal Embedding API (2024.02)	39.4%	63.1%	26.3%	49.8%
Ours (Marengo-2.6)	49.35% (+4.6%)	73.47% (+3.5%)	55.36% (+2.7%)	82.55% (+3.2%)

Zero Shot Image Retrieval (ZS-T2I)

Zero Shot Image Retrieval	MS-COCO Recall@1	MS-COCO Recall@5	Flickr30K R@1	Flickr30K R@5
Apple DFN-H/378 (2024.01)	55.6%	79.2%	82.1%	96.0%
Gemini Multimodal Embedding API (2024.02)	52.73%	75.80%	80.26%	94.28%
Ours (Marengo-2.6)	55.65% (-)	80.31% (+1.1%)	84.95% (+2.9%)	96.7% (+0.7%)

Zero Shot Audio Retrieval (ZS-T2A)

Zero Shot Audio Retrieval	Clotho R@1	Clotho R@10	AudioCaps R@1	AudioCaps R@10
Peking Univ. LanguageBind-H (2024.01)	16.7%	52.0%	19.7%	67.6%
Ours (Marengo-2.6)	17.61% (+0.9%)	52.25% (+0.3%)	23.01% (+3.3%)	69.43% (+1.8%)

그래 encoder를 잘 만들었다고 하면? 그 다음은?

확인해 보니 video caption의 quality와 granularity (얼마나 detail하게 표현 되는지)
100K high quality (10만 pairs) vs 10M low quality (1000만 pairs)
catestrophic forgetting 문제가 심해서 (자꾸 새로운걸 학습하면 이전에 학습한 걸 까먹는 현상) 특정 부분만 학습 나머지는 freeze하거나 lr을 아주 섬세하게 조정했다.

평가

Video Question Answering

	ActivityNet-QA Test Split (%)	NExT-QA Test Split (%)
Video-ChatGPT	35.2	-
VideoChat2	49.1	61.7
Gemini 1.0 Pro	49.8	28.0
Gemini 1.0 Ultra	52.2	29.9
Gemini 1.5 Pro	56.7	-
Pegasus-1	59.9	71.1

Video Conversations

	Correctness of Information	Detailed Orientation	Contextual Understanding	Temporal Understanding	Consistency	Average
Video-ChatGPT	2.40	2.52	2.62	1.98	2.37	2.38
VideoChat2	3.02	2.88	3.51	2.66	2.81	2.98
Gemini 1.0 Pro	2.98	2.99	3.44	2.32	2.32	2.81
Pegasus-1	3.79	3.76	4.29	3.34	4.03	3.84

Video Summarization

	Correctness of Information	Detailed Orientation	Contextual Understanding	Average
Vendor A	0.73	0.80	0.91	0.81
Whisper + ChatGPT-3.5	0.49	0.79	0.68	0.65
Video-ChatGPT	1.19	1.33	1.42	1.31
VideoChat2	1.78	1.52	1.98	1.76
Gemini 1.0 Pro	1.65	1.69	1.94	1.76
Pegasus-1	2.30	2.58	2.75	2.54

paperswithlove / papers-we-read

Pegasus-v1 Technical Report #43

Some Links

Pegasus-1 17B

Video LLM (with video encoder Marengo 2.6)

그럼 Marengo가 뭘까?

Marengo 2.6

Multimodal Foundation Model for any-to-any search

Spec

학습

평가

baseline

Zero Shot Video Retrieval (ZS-T2V)

Zero Shot Image Retrieval (ZS-T2I)

Zero Shot Audio Retrieval (ZS-T2A)

그래 encoder를 잘 만들었다고 하면? 그 다음은?

평가

Video Question Answering

Video Conversations

Video Summarization