[119] Visual Instruction Tuning

paper

TL;DR

I read this because.. : llava 1.5를 읽기 위해
task : chatting VLM
problem : chatGPT처럼 multi-modal에서도 instruction-following하게 해보자
idea : language only GPT에 bbox와 caption을 넣고 QA를 만들게 함
input/output : image + Q -> A
architecture : LLaMA 13B + CLIP + projection
objective : ce loss
baseline : GPT-4, BLIP-2, OpenFlamingo
data : (feature alignment) CC3M 중에 filtering한 것 (e2e learning) COCO 이미지에 대해 캡션 및 bbox를 넣고 GPT4 혹은 chatGPT로 만든 insturction data or SicenceQA
evaluation : coco에서 sampling 하여 question을 작성하고 GPT-4가 bbox랑 caption이랑 question 받았을 때 내뽑은 answer를 GPT-4한테 다시 평가하라고 함.
result : Science QA에서 좋은 성능, BLIP-2 / OpenFlamingo / GPT-4가 못하는 상위 reasoning(유머 해석 등)을 잘함
contribution : assist하기 위해 instruct 데이터를 만든 아마 최초의 work. 오픈소스화를 잘해서 널리 쓰임
etc. :

Details

Instruction following data

COCO 이미지에 대해 caption과 bbox넣고 만듦
converstaion(58K) / detailed description(23K) / complex reasoning(77K)

이에 대한 ablation. detailed caption을 넣으면 chatbot 쪽 성능이 오른다. reasoning에 도움을 주는 듯 하다.

Training

input sequence

첫번째 question은 이미지가 먼저 나올 수도 있고 question이 먼저 나올 수도 있고 순서는 랜덤

pre-training feature alignment CC3M에서 595K image text만 Filtering + linear projection만 학습 caption을 그대로 사용하되 간단하게 instruction following 포맷으로 맞춤(single turn, image를 briefly 설명해달라고 요구) 이때 filtering 방법은 아래와 같음 (noun frequency별로 uniform하게 맞춤)

finetuning end-to-end vision encoder만 freeze 시키고 나머지 projection + LM을 학습

Ability

complex reasoning에 대해 만드는 방법이 궁금한데 아래와 같이 system prompt 넣었다고 하네

You are an AI visual assistant that can analyze a single image. You receive five sentences, each describing the same image you are observing. In addition, specific object locations within the image are given, along with detailed coordinates. These coordinates are in the form of bounding boxes, represented as (x1, y1, x2, y2) with floating numbers ranging from 0 to 1. These values correspond to the top left x, top left y, bottom right x, and bottom right y.

The task is to use the provided caption and bounding box information, create a plausible question about the image, and provide the answer in detail.

Create complex questions beyond describing the scene.
To answer such questions, one should require first understanding the visual content, then based on the background knowledge or reasoning, either explain why the things are happening that way, or provide guides and help to user's request.  Make the question challenging by not including the visual content details in the question so that the user needs to reason about that first.

Instead of directly mentioning the bounding box coordinates, utilize this data to explain the scene using natural language. Include details like object counts, position of the objects, relative position between the objects.  

When using the information from the caption and coordinates, directly explain the scene, and do not mention that the information source is the caption or the bounding box.  Always answer as if you are directly looking at the image.

Ablations

ViT 마지막 레이어 vs 이전 레이어 -> 이전 레이어가 더 좋음
CoT 적용 즉 answer 다음 reasoning / Reasoning 다음 answer -> 수렴은 reasoning - answer가 더 빨랐으나 최종적인 성능은 그렇지 않았음
alignment learning 단계 없이 바로 학습 -> 성능 악화
LLM 13B에서 7B -> 성능 악화