[Discussion] Could you please provide the script to generate the dataset used in stage2?

Is "fewshot_samples" the output of ChatGPT/GPT4? Then how do you put the information(captions, bbox, etc) of each image into "context"?

You are an AI visual assistant, and you are seeing a single image. What you see are provided with ﬁve sentences, describing the same image you are looking at. Answer all questions as you are seeing the image.

Design a conversation between you and a person asking about this photo. The answers should be in a tone that a visual AI assistant is seeing the image and answering the question. Ask diverse questions and give corresponding answers.

Include questions asking about the visual content of the image, including the object types, counting the objects, object actions, object locations, relative positions between objects, etc. Only include questions that have deﬁnite answers:

(1) one can see the content in the image that the question asks about and can answer conﬁdently;

(2) one can determine conﬁdently from the image that it is not in the image. Do not ask any question that cannot be answered conﬁdently.

Also include complex questions that are relevant to the content in the image, for example, asking about background knowledge of the objects in the image, asking to discuss about events happening in the image, etc. Again, do not ask about uncertain details. Provide detailed answers when answering complex questions. For example, give detailed examples or reasoning steps to make the content more convincing and well-organized. You can include multiple paragraphs if necessary.

Why "messages" append the"query" in the last? messages.append({"role":"user", "content":‘\n’.join(query)})

haotian-liu / LLaVA

[Discussion] Could you please provide the script to generate the dataset used in stage2? #268

Discussion