How to pre-process the music dataset?

rhgao / co-separation

Co-Separating Sounds of Visual Objects (ICCV 2019)

Creative Commons Attribution 4.0 International

92 stars 23 forks source link

How to pre-process the music dataset? #6

Closed avis-ma closed 4 years ago

avis-ma commented 4 years ago

Hi, Thank you very much for sharing the source code. Follow the code and the paper, I know that each video from the dataset should be process into 10s clips, and be detected by the faster-rcnn model to form a .npy file storing the results of 10 frames in the 10s clip. But I am still confused about how to process the audio and how to number the frames ( 1.png-10.png for each video or number all the images from all the clips, maybe 1.png-500.png-1000.png). About the audio, is it downsampled to 11025 when clipping into 10s clip? It would be better that there is a script for process dataset.

rhgao commented 4 years ago

Hi,

For MUSIC, each video is first divided into 10s clips, it can be done as below (just vary the start_time recursively depending on the video length) : ffmpeg -i /videoPath -ss start_time -t 10 -c copy /video_clip_path

To extract frames, I use 8 fps and it can be done as follows: ffmpeg -i /video_clip_path -ss 0 -t 00:00:10 -vf "fps=8" + /clipFrameDir/%6d.png (Therefore, the frames are numbered from 000001.png to 000080.png for a 10s video clip)

All 10s audio are downsampled to 11025 sampling rate, and you can use the following script to preprocess the audios: https://github.com/facebookresearch/2.5D-Visual-Sound/blob/master/reEncodeAudio.py

For AudioSet data, they are already in the form of 10s clips.

avis-ma commented 4 years ago

@rhgao Thank you very much !

avis-ma commented 3 years ago

@rhgao Hi, from the paper, I see your results on multisource videos. Does it means 1solo+1duet as a training sample? How about the test setting of multisource results? Does it also test on 2-mix way? Thank you very much.

rhgao commented 3 years ago

@avis-ma ，if you were referring to Table 1 in the paper, by multisource, we mean the model is trained on multisource videos. The testing is still performed on the same single source videos.

avis-ma commented 3 years ago

@rhgao thank you very much for the response. I am a bit confused about the training part of multisource. When training solo+duet, in each selected solo and duet sample pair，is it necessary to satisfy that the dual category of duet needs to include the solo category? If this is the case, then for categories that have no intersection, such as accordion and erhu（which is in solo, but not in duet ), should they be excluded from the test category?