This is an official page of "MedleyVox: An Evaluation Dataset for Multiple Singing Voices Separation" in ICASSP 2023.
A camera-ready version of the paper is uploaded in arXiv. Please check the icons below.
Now, MedleyVox is just uploaded on Zenodo!! Please check this website (https://zenodo.org/record/7984549).
Since we provide the metadata of MedleyVox in this code repository, you can easily obtain the MedleyVox with our code and existing MedleyDB v1 and v2. You should manually check some directory parameters in testset/testset_save code.
python -m testset.testset_save
Our code is heavily based on asteroid. You first have to install asteroid as a python package.
pip install git+https://github.com/asteroid-team/asteroid
and other remaining packages in ‘requirements.txt’. fairseq package is not needed for training but you need to use it when you use the chunk-wise processing based on wav2vec representation. It will be introduced in the last section of this page.
In svs/preprocess folder, you can find a number of preprocessing codes. For the preparation of train data, most of the codes are just simple downsampling/save processes. For the preparation of validation data, you can ignore it because we already made json files of metadata for validation.
For mixture construction strategy, we have a total of 5 arguments in svs/main.py for 6 training input construction strategy. Each of them are
We first train the standard Conv-TasNet (for 200 epochs).
python -m svs.main --exp_name=your_exp_name --patience=50\
--use_wandb=True --mixture_consistency=mixture_consistency\
--train_loss_func pit_snr multi_spectral_l1\
Then, we start joint training of the pre-trained Conv-TasNet and the cascaded iSRNet. (for 30 epochs with argument —reduced_training_data_ratio=0.1, for more frequent validation loss checking)
python -m svs.main --exp_name=your_exp_name_iSRNet\
--start_from_best=True --reduced_training_data_ratio=0.1\
--gradient_clip=5 --lr=3e-5 --batch_size=8 --above_freq=3000\
--epochs=230 --lr_decay_patience=6 --patience=15\
--use_wandb=True --mixture_consistency=sfsrnet --srnet=convnext\
--sr_input_res=False --train_loss_func pit_snr multi_spectral_l1 snr\
--continual_train=True --resume=/path/to/your_exp_name
Similar to duet and unison separation model, we first train the standard Conv-TasNet (for 200 epochs). You have to set different --dataset argument.
python -m svs.main --exp_name=your_exp_name --patience=50\
--use_wandb=True --mixture_consistency=mixture_consistency\
--train_loss_func pit_snr multi_spectral_l1\
--dataset=multi_singing_librispeech
After that, also similar to duet and unison separation model, we start joint training of the pre-trained Conv-TasNet and the cascaded iSRNet. (for 30 epochs with argument —reduced_training_data_ratio=0.1, for more frequent validation loss checking)
python -m svs.main --exp_name=your_exp_name_iSRNet\
--start_from_best=True --reduced_training_data_ratio=0.1\
--gradient_clip=5 --lr=3e-5 --batch_size=8 --above_freq=3000\
--epochs=230 --lr_decay_patience=6 --patience=15\
--use_wandb=True --mixture_consistency=sfsrnet --srnet=convnext\
--sr_input_res=False --train_loss_func pit_snr multi_spectral_l1 snr\
--continual_train=True --resume=/path/to/your_exp_name\
--dataset=multi_singing_librispeech
We use a total of 13 different singing datasets of 400 hours and 460 hours of LibriSpeech data for training.
Dataset | Labels (same song (segment) of different singers) | Labels (different songs of same singer) | Lengths[hours] | Notes |
---|---|---|---|---|
Children’s song dataset (CSD) | _ | ✓ | 4.9 | _ |
NUS | _ | ✓ | 1.9 | _ |
TONAS | _ | _ | 0.3 | _ |
VocalSet | _ | ✓ | 8.8 | _ |
Jsut-song | _ | ✓ | 0.4 | _ |
Jvs_music | _ | ✓ | 2.3 | _ |
Tohoku Kiritan | _ | ✓ | 1.1 | _ |
vocadito | _ | _ | 0.2 | _ |
Musdb-hq (train subset) | _ | ✓ | 2.0 | Single singing regions were extracted from the annotations in musdb-lyrics extension |
OpenSinger | _ | ✓ | 51.9 | _ |
MedleyDB v1 | _ | _ | 3.8 | For training, we only used the songs that included in musdb18 dataset. |
K_multisinger | ✓ | ✓ | 169.6 | _ |
K_multitimbre | ✓ | ✓ | 150.8 | _ |
LibriSpeech_train-clean-360 | _ | ✓ | 360 | _ |
LibriSpeech_train-clean-100 | _ | ✓ | 100 | _ |
We use a musdb-hq (test subset) and LibriSpeech_dev-clean for validation data.
Case | Description | Notes |
---|---|---|
1) | Different singing + singing | — |
2) | One singing + its unison | — |
3) | Different songs of same singer | — |
4) | Different speech + speech | — |
5) | One speech + its unison | — |
6) | Different speeches of same speaker | — |
7) | Different speech + singing | — |
Currently, we have no plan to upload the pre-trained weights of our models.
python -m svs.test --singing_task=duet --exp_name=your_exp_name
separate every audio file (.mp3, .flac, .wav) in --inference_data_dir
python -m svs.inference --exp_name=your_exp_name\
--model_dir=/path/where/your/checkpoint/is\
--inference_data_dir=/path/where/the/input/data/is\
--results_save_dir=/path/to/save/output
If the input is too long, it may be impossible to impossible due to lack of VRAM, or performance may be degraded at all. In that case, use --use_overlapadd. Among the --use_overlapadd options, "ola", "ola_norm", and "w2v" all work similarly to LambdaOverlapAdd in asteroid.
In our paper, we have analyzed several failure cases that standard ola methods cannot handle. To this end, we implemented some useful inference methods for chunk-wise processing based on voice activity detection (VAD).
—vad_method can be used between spectrogram energy based (spec) and py-webrtcvad based (webrtc).