Codebase for submission"Language-Guided Audio-Visual Source Separation via Trimodal Consistency".
The code is developed under the following configurations.
[--num_gpus NUM_GPUS]
accordingly)Prepare video dataset.
a. Download MUSIC dataset from: https://github.com/roudimit/MUSIC_dataset
b. Download videos.
Preprocess videos. You can do it in your own way as long as the index files are similar.
a. Extract frames at 8fps and waveforms at 11025Hz from videos. We have following directory structure:
data
├── audio
| ├── acoustic_guitar
│ | ├── M3dekVSwNjY.mp3
│ | ├── ...
│ ├── trumpet
│ | ├── STKXyBGSGyE.mp3
│ | ├── ...
│ ├── ...
|
└── frames
| ├── acoustic_guitar
│ | ├── M3dekVSwNjY.mp4
│ | | ├── 000001.jpg
│ | | ├── ...
│ | ├── ...
│ ├── trumpet
│ | ├── STKXyBGSGyE.mp4
│ | | ├── 000001.jpg
│ | | ├── ...
│ | ├── ...
│ ├── ...
b. Make training/validation index files by running:
python scripts/create_index_files.py
It will create index files train.csv
/val.csv
with the following format:
./data/audio/acoustic_guitar/M3dekVSwNjY.mp3,./data/frames/acoustic_guitar/M3dekVSwNjY.mp4,1580
./data/audio/trumpet/STKXyBGSGyE.mp3,./data/frames/trumpet/STKXyBGSGyE.mp4,493
For each row, it stores the information: AUDIO_PATH,FRAMES_PATH,NUMBER_FRAMES
Train the default model.
./scripts/train_bimodal_cyclic_losses_solos_music.sh
During training, visualizations are saved in HTML format under ckpt/MODEL_ID/visualization/
.
We have observed that finetuning the separation model with the latent captions with a very low learning rate further helps to improve performance. More details will come soon.