What matters when building vision-language models?

제목이 이상하지만 Idefics2 paper임

Insight

Idefics2를 만들기 위한 과정이 포함(exploring the design space)
MLLM Modeling 논문들의 특징 중 하나는 문제 정의 후 해결하기 위한 과정을 기술하는 논문이 거의 없음
Benchmark Dataset에서 좋은 성능을 보이기 위한 Ablation 위주로 접근하여 좋은 성능을 보이는 조합을 구성하는 논문이 많음 + Dataset.

비전-언어 모델(VLM)에 대한 관심이 높아지고 있는 이유는 대규모 언어 모델과 비전 트랜스포머의 발전 때문이지만, VLM 설계에 관한 중요한 결정들이 종종 정당화되지 않고 있다.
이 문제를 해결하기 위해 사전 학습 모델, 아키텍처 선택, 데이터, 학습 방법 등에 관한 광범위한 실험이 수행되었으며, 그 결과 80억 개의 파라미터를 가진 효율적인 기반 VLM인 Idefics2가 개발되었다.
Idefics2는 다양한 멀티모달 벤치마크에서 동급 최고 성능을 내며, 4배 큰 모델과 대등한 수준을 보이고 있으며, base, instructed, chat 버전과 함께 학습 데이터 세트도 공개되었다.

LLM 성능과 Vision Encoder 성능 중 VLM 성능에 더 영향을 미치는 모델은? 유사한 파라미터 수를 갖는 모델에서 최종 VLM의 성능에 미치는 영향은 LLM performance가 Vision Encoder Performance 보다 더 큼.

Multimodal 구현을 위한 Visual Feature 활용 방법. Cross Attention? or Fully Autoregressive architecture(input concat)?
- LLM/Vision을 학습하지 않으면 Cross Attention이 더 좋음
- LLM/Vision을 학습하면 PEFT만 해도 성능이 훨씬 좋아짐
- Fully Autoregressive Architecture에서 LLM/Vision Full-tuning은 divergence를 야기할 수 있음.
- Fully Autoregressive Architecture에서 Lora를 사용하여 학습을 안정화 시킬 수 있음

Visual Token은 줄여도 괜찮은가? Learned Pooling으로 Visual Token 수를 줄이면 학습 및 추론 시의 계산 효율성이 크게 향상되며, 동시에 Downstream Task 성능도 개선된다.

입력 Image의 Aspect Ratio를 유지해야 하나? 고정된 Square Size로의 Resize가 학습 성능을 약화 시키지는 않는다(속도/메모리는 효율은 향상) ... 그래도 평균 1% 성능 저하는 약화시키지 않는다고 볼 수 있을 수준은 아니지 않은가...
입력 Image Splitting은 어떤 효과가 있나? Image Splitting은 계산 효율성과 성능 사이에서 절충할 수 있게 하며, Text 관련 Task 성능 향상이 두드러진다.

Idefics2 started from SigLIP-SO400M and Mistral-7B-v0.1 and pre-train Idefics2 on 3 types of data.

Interleaved image-text documents
- We use OBELICS.... 좋겠다..
Image-text pairs
- a combination of high-quality human-annotated image-text pairs from PMD and higher-noise web-scale image-text pairs from LAION-5B.
- Alt-text는 너무 Noisy하여 Captioning 모델을 이용하여 학습을 했더니 성능이 좋아졌다.
PDF Doduments
- 다수의 VLM 모델이 OCR 성능이 그렇게 좋지 못함
- OCR 성능 개선을 위해 PDF 데이터를 활용
  - 19 million industry documents from OCR-IDL and 18 million pages from PDFA
  - 다양한 Color와 폰트, 배경에 강인하도록 Rendered Text Dataset 사용

Text 관련 Task 성능

They create and release The Cauldron, a massive collection of 50 vision-language datasets.

We instruction-tune the base model using DoRA (Liu et al., 2024) (a variant of LoRA).
To lower the risk of overfitting
- They add noise to the embeddings with the NEFTune (Jain et al., 2024) technique
- They scale up randomly the resolution of the images during the trainin?
- They shuffle the multiple user/assistant turns randomly before feeding the example to the model

성능