[138] ShareGPT4V: Improving Large Multi-Modal Models with Better Captions

TL;DR

I read this because.. : GPT4-V를 활용한 데이터로 학습한 모델
task : VLM
problem : instruction data가 너무 noisy하다
idea : GPT4-V로 데이터 모으자! 후에 captioner 학습해서 나온 애들을 가지고 얘를 alignment 할 때 쓰자
input/output : image - (api call) -> GPT4V caption => LLaVA1.5 style로 학습
architecture : LLaVA-1.5
objective : ce loss
baseline : 데이터의 효과를 보기 위해 LLaVA-7B / LLaVA-1.5-7B(13B) / Qwen-VL-Chat-7B에 추가하여 학습, LLaVA 1.5 아키텍쳐 그대로 가져와서 학습 디테일 조금 바꾸고 pretraining - finetuning 했을 때 모든 경우에서 sota
data : image={LAION-400M, COCO, SBU, SAM, TextCaps}, text={GPT4-V call}
evaluation : SEED, VizWiz, VQA-v2, SQA, QBench, MM-Vet, MMBench-CN, MMBench, MME_cog, MME_per, LLaVA-Bench
result : sota~
contribution : 데이터 공개. 모델 공개. 아키텍쳐보다 데이터가 중요하다!!!를 강조
etc. :

etc: SAM, TextCaps, WikiArt + 1K images from webcrawled data (split evenly between images of landmarks and images of celebrities). (추가적으로 긁은 듯)

데이터 종류별로 prompt를 다르게 줬다고 함

이렇게 100K수집

ShareGPT4V-PT ShareCaptioner라는 모델을 따로 만들어서 1.2M 데이터셋을 만듦.
44 A100 GPU days 걸렸다고 함. 모델에 대한 정보가 없는걸 봐서 ShareGPT4V-7B 모델이랑 같은것 아닐까? 자세하게 추가로 정제했다던지 하는 정보는 없음.

이때 사용한 데이터

3개에 대한 human evaluation

more analysis

공정한 비교를 위해 원래 쟤네 학습할 때 있었던 data recipe 중에 'detailed caption'에 해당하는 100K의 데이터를 빼고 이 데이터를 넣음

각 데이터를 넣어서 학습하는 것의 효과

latter half만 학습한 것의 효과