Closed jhj7905 closed 1 year ago
Hi @jhj7905,
Thanks for your interest in our work and for your question about our data generation methods.
You're correct in your estimation. The dataset is indeed comprised of 30% human-annotated data and 70% semi-automatic annotations. As you correctly pointed out, creating a fully human-annotated dataset can be costly and time-consuming, so we complemented it with semi-automatic methods.
For the semi-automatic portion of our dataset, we developed a comprehensive method that combined predictions from SOTA models to extract relevant cues. We used specific models to eliminate any noisy or irrelevant context from the data. This rigorous process ensured that the data maintained its accuracy and relevance despite not being fully human-annotated.
I hope this clarifies the methodology we used for data generation. Please feel free to reach out if you have any more questions.
@mmaaz60 @hanoonaR Thank you for sharing your great work. I have question about video instruction data generation. As you mentioned in your paper, You made a video instruction dataset by using both human-assisted and semi-automatic annotation methods. What is the ratio of each method to the entire dataset? I think you created more than 70 percent of the entire dataset by using semi-automatic annotation methods. Because using human-assisted method costs a lot..... Thank you in advance