[137] mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration

TL;DR

I read this because.. : very recent VLM model
task : VLM + LLM
problem : multi-modal task는 LLM freeze 시키고 사실상 V+L을 잘하려고 하는 시도가 많은데 V/L 둘다 잘하게 하고 싶다
idea : 전반적으로 BLIP-2 style. 이때 LLM을 modality별로 $W_K$, $W_V$, Norm을 다르게 하는게 다른 점. 그리고 LLM도 같이 tuning.
input/output : text + image -> text
architecture : CLIP ViT-L/14 + vision abstractor(=Q-former) + LLaMA-2 w/ Modality-Adaptive Module(MAM)
objective : ce loss
baseline : 7B LLM 기반의 모델들. BLIP-2, MiniGPT-4, LLAVA, mPLUG-Owl, InstructBLIP, Otter, Qwen-VL-Chat, LLaVA-1.5
data : 400M samples from {CC3/12M, COCO, COYO, LAION-en, DataComp} for pretraining / {captioning(TextCaps, COCO), VQA(VQAv2, OKVQA, OCR-VQA, GQA, A-OKVQA), region-aware(RefCOCO, VisualGenome), multi-modal instruction(LLaVa-instruct-150k), text-only instruction data(ShareGPT80-K, SlimOrca)}
evaluation : caption / vqa / multimodal benchmark(MME, MMBench, MM-Vet, SEED-Bench, Q-Bench) / text benchmark(MMLU, BBH, AGIEval, ARC-c, ARC-e)
result : 7B model 들 중에 거의 다 sota. textual instruction도 같이 씀 + MAM에 따라 pure text benchmark에서도 LLaMA2보다 성능 개선
contribution : VLM 모델이 text 성능도 개선하는건 아마 처음?
etc. : alibaba 돈 많은듯..

Vision Abstractor는 결국 Q-former
Modality-Adaptive Module은 결국 input의 modality에 따라 weight / norm을 다르게 하겠다는 점. 근데 query weight는 같음. 여기서 이미지에 대한 W는 새로 initialize되었기 때문에 step-1 pretraining 때 학습되는 부분.
학습 단계는 두 단계인데 1) Pre-training 때는 {CC3/12M, COCO, COYO, LAION-en, DataComp} 이런 걸로 vision encoder / q-former / language decoder의 초기화된 부분을 학습. BLIP-2랑 비교 하면 재밌을 것 같은데, BLIP-2에서는 CLIP ViT 가져와서 vision encoder freeze. 그리고 사용하는 이미지는 비슷한 소스의 새로 캡셔닝된 데이터(CapFilt) 여기서는 vision encoder freeze 하지 않고 상대적으로 Noisy한 alt-text류를 그대로 사용! 어떻게 보면 CLIP에서 본 종류의 데이터를 generation 형태로 다시 학습하는 꼴. 2) joint-instruction tuning 때는 다 unfreeze하고 instruction data로만 학습. 이때 text instruction data도 넣은게 다른 점.