Closed lucasjinreal closed 5 months ago
Hi @lucasjinreal,
Thank you for your interest in our work. Almost 30% of the VideoInstruct data is human annotated and the rest is generated using an automated data generation pipeline. In the dataset creation pipeline, the key frames are fed to different SOTA vLLM models first, in order to capture the frame-level information. Later, all the frame-level information are fed to GPT-3.5 for generating detailed captions and conversation style question answers.
Here, GPT-3.5 is not used to generate the information but to summarize the information collected from different vLLMs. Although, this may cause some discrepancy in the generated descriptions, the context and content of the video can be captured accurately. The quality of the generation therefore, largely depends on the information from the vLLM models and not from GPT-3.5. The use of better (newer) vLLM's may therefore further improve the data quality. I hope it clarifies your question. Thank you.
@mmaaz60 thank u for the answer, from the git materials, didn't found clue about which SOTA vllm used here, can u specifically suggest which been used for roughly generate?
The dataset all generated from ChatGPT, it doesn't looked video actually, how to make sure the dataset is right?