long8v / PTIR

Paper Today I Read
19 stars 0 forks source link

[119] Visual Instruction Tuning #128

Open long8v opened 1 year ago

long8v commented 1 year ago
image

paper

TL;DR

Details

Instruction following data

image

이에 대한 ablation. detailed caption을 넣으면 chatbot 쪽 성능이 오른다. reasoning에 도움을 주는 듯 하다.

Training

input sequence

image image

첫번째 question은 이미지가 먼저 나올 수도 있고 question이 먼저 나올 수도 있고 순서는 랜덤

image image image image

Ability

image

complex reasoning에 대해 만드는 방법이 궁금한데 아래와 같이 system prompt 넣었다고 하네

You are an AI visual assistant that can analyze a single image. You receive five sentences, each describing the same image you are observing. In addition, specific object locations within the image are given, along with detailed coordinates. These coordinates are in the form of bounding boxes, represented as (x1, y1, x2, y2) with floating numbers ranging from 0 to 1. These values correspond to the top left x, top left y, bottom right x, and bottom right y.

The task is to use the provided caption and bounding box information, create a plausible question about the image, and provide the answer in detail.

Create complex questions beyond describing the scene.
To answer such questions, one should require first understanding the visual content, then based on the background knowledge or reasoning, either explain why the things are happening that way, or provide guides and help to user's request.  Make the question challenging by not including the visual content details in the question so that the user needs to reason about that first.

Instead of directly mentioning the bounding box coordinates, utilize this data to explain the scene using natural language. Include details like object counts, position of the objects, relative position between the objects.  

When using the information from the caption and coordinates, directly explain the scene, and do not mention that the information source is the caption or the bounding box.  Always answer as if you are directly looking at the image.

Ablations

image

Play with demo

https://llava.hliu.cc/ demo랑 놀아보았다

image

일반적인 설명 잘한다

scene graph generation 시켜보았다.

image image image

predicate가 더이상 동사가 아님..

image

거짓말 시작..

예시를 잘못 줘도 그냥 답변에 포함. 그래도 대충 말은 되는 triplet을 만드는군

image image

여기도 hallucination이 .. visual genome은 아마 학습 데이터에 있었을 것 같으니 다른 데이터 가져와보자 내가 대만에서 찍은 이 사진 ..

image image

child가 어딨는지 모르겠지만 대충 맞다

image image

나름 clear한 sample인데

image

점점 맛가기 시작 ㅜㅜ

image

prompt를 바꾸니 갑자기 또 바른 말 하기 시작...

예시를 잘 주니 나름 잘 동작 그러나 애기는 어디있는가?

image

relation이 아주 뚜렷한 ㅋㅋ 벤치마크에 있어도 이상하지 않을 부산에서 찍은 아이 사진을 올려본다

image image

완벽하네

image

약간 동어 반복이긴 하지만 틀린 말은 안하네

image

감성적인 말까지 ..

image image

3절 4절 뇌절해도 잘받아줌..

long8v commented 7 months ago

vit 10번째 layer feature > 12 layer oversmoothing?!