Open JackHenry1992 opened 3 years ago
For AVE dataset, we extract frames at 1fps, and use the image and its corresponding 1 second audio clip as a pair for training and evaluation. And the script for generating image pseudo labels is shown in generate_labelv.py You can add the path of your pytorch pretrained resnet to the script for inference
Hello, Thank you for creating generate_labelv.py .
for one video, you can have 10 audio/image pairs because each AVE-dataset video has 10 seconds, right? And, I still don't know how to create the .pkl file of audio-image pairs. I want to know the script for creating pkl file from raw datasets.
Hello, Thank you for creating generate_labelv.py .
for one video, you can have 10 audio/image pairs because each AVE-dataset video has 10 seconds, right? And, I still don't know how to create the .pkl file of audio-image pairs. I want to know the script for creating pkl file from raw datasets.
I also want to know how to get audio/image pairs ...
In your code, I see that you feed one image to the network. But AVE-dataset is composed of videos, do you only extract a single frame of a video? And I can not find the pkl file of audio-image pairs Furthermore, can you provide the detailed scripts for generate_vlabel and gt-data?