mbzuai-oryx / Video-ChatGPT

[ACL 2024 🔥] Video-ChatGPT is a video conversation model capable of generating meaningful conversation about videos. It combines the capabilities of LLMs with a pretrained visual encoder adapted for spatiotemporal video representation. We also introduce a rigorous 'Quantitative Evaluation Benchmarking' for video-based conversational models.
https://mbzuai-oryx.github.io/Video-ChatGPT
Creative Commons Attribution 4.0 International
1.05k stars 92 forks source link

About Video Instruction Data Generation #30

Closed jhj7905 closed 11 months ago

jhj7905 commented 11 months ago

@mmaaz60 @hanoonaR Thank you for sharing your great work. I have question about video instruction data generation. As you mentioned in your paper, You made a video instruction dataset by using both human-assisted and semi-automatic annotation methods. What is the ratio of each method to the entire dataset? I think you created more than 70 percent of the entire dataset by using semi-automatic annotation methods. Because using human-assisted method costs a lot..... Thank you in advance

hanoonaR commented 11 months ago

Hi @jhj7905,

Thanks for your interest in our work and for your question about our data generation methods.

You're correct in your estimation. The dataset is indeed comprised of 30% human-annotated data and 70% semi-automatic annotations. As you correctly pointed out, creating a fully human-annotated dataset can be costly and time-consuming, so we complemented it with semi-automatic methods.

For the semi-automatic portion of our dataset, we developed a comprehensive method that combined predictions from SOTA models to extract relevant cues. We used specific models to eliminate any noisy or irrelevant context from the data. This rigorous process ensured that the data maintained its accuracy and relevance despite not being fully human-annotated.

I hope this clarifies the methodology we used for data generation. Please feel free to reach out if you have any more questions.