In this example, we demonstrate how to use PaddleAudio to train two types of networks for speaker verification. The networks we support here are
Install the requirements via
# install paddleaudio
git clone https://github.com/PaddlePaddle/models.git
cd models/PaddleAudio
pip install -e .
Then clone this project,
git clone https://github.com/ranchlai/speaker-verification.git
cd speaker-verification
pip install -r requirements.txt
Pytorch is supported for inference only. Install librosa, torch and torchaudio, download the checkpoint here, then run
python test_torch.py
Following from this example and this example, we use the dev split VoxCeleb 1 which consists aof 1,211
speakers and the dev split of VoxCeleb 2 consisting of 5,994
speakers for training. Thus there are 7,502
speakers totally in our training set.
Please download the two datasets from the official website and unzip all audio into a folder, e.g., ./data/voxceleb/
. Make sure there are 7502
subfolders with prefix id1****
under the folder. You don't need to further process the data because all data processing such as adding noise / reverberation / speed perturbation will be done on-the-fly. However, to speed up audio decoding, you can manually convert the m4a file in VoxCeleb 2 to wav file format, at the expanse of using more storage.
Finally, create a txt file that contains the list of audios for training by
cd ./data/voxceleb/
find `pwd`/ --type f > vox_files.txt
The following datasets are required for dataset augmentation
For the RIR dataset, you must list all audio files under the folder RIRS_NOISES/simulated_rirs/
into a text file, e.g., data/rir.list and config it as rir_path in the config.yaml
file.
Likewise, you have to config the the following fields in the config file for noise augmentation
muse_speech: <musan_split/speech.list> #replace with your actual path
muse_speech_srn_high: 15.0
muse_speech_srn_low: 12.0
muse_music: <musan_split/music.list> #replace with your actual path
muse_music_srn_high: 15.0
muse_music_srn_low: 5.0
muse_noise: <musan_split/noise.list> #replace with your actual path
muse_noise_srn_high: 15
muse_noise_srn_low: 5.0
The testing split of VoxCeleb 1 is used for measuring the performance of speaker verification duration training and after the training completes. You will need to download the data and unzip into a folder, e.g, ./data/voxceleb/test/
.
Then download the text files which list utterance pairs to compare and the true labels indicating whether the utterances come from the same speaker. There are multiple trials and we will use veri_test2.
To train your model from scratch, first create a folder(workspace) by
cd egs
mkdir <your_example>
cd <your_example>
cp ../resnet/config.yaml . #Copy an example config to your workspace
Then change the config file accordingly to make sure all audio files can be correctly located(including the files used for data augmentation). Also you can change the training and model hyper-parameters to suit your need.
Finally start your training by
python ../../train.py -c config.yaml -d gpu:0
First download the checkpoints for resnet or ecapa-tdnn,
checkpoint | size | eer |
---|---|---|
ResnetSE34 + SAP + CMSoftmax | 26MB | 0.93% |
ecapa-tdnn + AAMSoftmax | 80MB | 1.10% |
Then prepare the test dataset as described in Testing datasets, and set the following path in the config file,
mean_std_file: ../../data/stat.pd
test_list: ../../data/veri_test2.txt
test_folder: ../../data/voxceleb1/
To compute the eer using resnet, run:
cd egs/resnet/
python ../../test.py -w <checkpoint path> -c config.yaml -d gpu:0
which will result in eer 0.00931.
for ecapa-tdnn, run:
cd egs/ecapa-tdnn/
python ../../test.py -w <checkpoint path> -c config.yaml -d gpu:0
which gives you eer 0.0105.
We compare our results with voxceleb_trainer.
The test list is veri_test2.txt, which can be download from here VoxCeleb1 (cleaned)
model | config | checkpoint | eval frames | eer |
---|---|---|---|---|
ResnetSE34 + ASP + softmaxproto | - | baseline_v2_ap | 400 | 1.06% |
ResnetSE34 + ASP + softmaxproto | - | baseline_v2_ap | all | 1.18% |
model | config | checkpoint | eval frames | eer |
---|---|---|---|---|
ResnetSE34 + SAP + CMSoftmax | config.yaml | checkpoint | all | 0.93% |
ECAPA-TDNN + AAMSoftmax | config.yaml | checkpoint | all | 1.10% |
[1] Hu J, Shen L, Sun G. Squeeze-and-excitation networks, Proceedings of the IEEE conference on computer vision and pattern recognition. 2018: 7132-7141
[2] Desplanques B, Thienpondt J, Demuynck K. Ecapa-tdnn: Emphasized channel attention, propagation and aggregation in tdnn based speaker verification[J]. arXiv preprint arXiv:2005.07143, 2020.