About Video Instruction Data Generation

mbzuai-oryx / Video-ChatGPT

[ACL 2024 🔥] Video-ChatGPT is a video conversation model capable of generating meaningful conversation about videos. It combines the capabilities of LLMs with a pretrained visual encoder adapted for spatiotemporal video representation. We also introduce a rigorous 'Quantitative Evaluation Benchmarking' for video-based conversational models.

Creative Commons Attribution 4.0 International

1.05k stars 92 forks source link

Hi @jhj7905,

Thanks for your interest in our work and for your question about our data generation methods.

You're correct in your estimation. The dataset is indeed comprised of 30% human-annotated data and 70% semi-automatic annotations. As you correctly pointed out, creating a fully human-annotated dataset can be costly and time-consuming, so we complemented it with semi-automatic methods.

For the semi-automatic portion of our dataset, we developed a comprehensive method that combined predictions from SOTA models to extract relevant cues. We used specific models to eliminate any noisy or irrelevant context from the data. This rigorous process ensured that the data maintained its accuracy and relevance despite not being fully human-annotated.

I hope this clarifies the methodology we used for data generation. Please feel free to reach out if you have any more questions.

mbzuai-oryx / Video-ChatGPT

About Video Instruction Data Generation #30