Open YYX666660 opened 1 year ago
Hi @YYX666660 ,
Appreciate your interest in our work. The file Valid_Videos_Vis_Text.pickle lists out the videos which we obtain after our pre-processing. This entailed discarding videos where the sound did not agree with the visuals, such as graphics playing while a baby cries in the background. You should feel free to design/customize such protocols for your dataset. The Vision_Text_Labels.csv file lists the label of the audio class, the label of the principal object in the Visual Genome dataset, the frame index of where the most confident detection of this object was found, and which of the upto 20 objects detected in the confident frame corresponds to this principal object. Hope this helps!
Dear authors: After reading the paper, I really appreciate your great work and the open source code. But I have a question about the data pre-processing. How were the files
Valid_Videos_Vis_Text.pickle
andVision_Text_Labels.csv
generated? If I want to apply AVSGS on another dataset (Fair-Play dataset), what should I do about the data pre-processing?