haotian-liu / LLaVA

[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
https://llava.hliu.cc
Apache License 2.0
20.11k stars 2.21k forks source link

[Discussion] Could you please provide the script to generate the dataset used in stage2? #268

Open LetsGoFir opened 1 year ago

LetsGoFir commented 1 year ago

Discussion

I cannot understand how to construct the "fewshot_samples" in table 10, so could you please help me?

LetsGoFir commented 1 year ago
  1. Is "fewshot_samples" the output of ChatGPT/GPT4? Then how do you put the information(captions, bbox, etc) of each image into "context"?

You are an AI visual assistant, and you are seeing a single image. What you see are provided with five sentences, describing the same image you are looking at. Answer all questions as you are seeing the image.

Design a conversation between you and a person asking about this photo. The answers should be in a tone that a visual AI assistant is seeing the image and answering the question. Ask diverse questions and give corresponding answers.

Include questions asking about the visual content of the image, including the object types, counting the objects, object actions, object locations, relative positions between objects, etc. Only include questions that have definite answers:

(1) one can see the content in the image that the question asks about and can answer confidently;

(2) one can determine confidently from the image that it is not in the image. Do not ask any question that cannot be answered confidently.

Also include complex questions that are relevant to the content in the image, for example, asking about background knowledge of the objects in the image, asking to discuss about events happening in the image, etc. Again, do not ask about uncertain details. Provide detailed answers when answering complex questions. For example, give detailed examples or reasoning steps to make the content more convincing and well-organized. You can include multiple paragraphs if necessary.

  1. Why "messages" append the"query" in the last? messages.append({"role":"user", "content":‘\n’.join(query)})