seongmin-kye / meta-SR

Pytorch implementation of Meta-Learning for Short Utterance Speaker Recognition with Imbalance Length Pairs (Interspeech, 2020)
73 stars 19 forks source link
meta-learning short-utterances speaker-recognition speaker-verification

Meta-Learning for Short Utterance Speaker Recognition with Imbalance Length Pairs

Pytorch code for following paper:

Abstract

In practical settings, a speaker recognition system needs to identify a speaker given a short utterance, while the enrollment utterance may be relatively long. However, existing speaker recognition models perform poorly with such short utterances. To solve this problem, we introduce a meta-learning framework for imbalance length pairs. Specifically, we use a Prototypical Networks and train it with a support set of long utterances and a query set of short utterances of varying lengths. Further, since optimizing only for the classes in the given episode may be insufficient for learning disminative embeddings for unseen classes, we additionally enforce the model to classify both the support and the query set against the entire set of classes in the training set. By combining these two learning schemes, our model outperforms existing state-of-the-art speaker verification models learned with a standard supervised learning framework on short utterance (1-2 seconds) on the VoxCeleb datasets. We also validate our proposed model for unseen speaker identification, on which it also achieves significant performance gains over the existing approaches.

Requirements

Data preparation

The following script can be used to download and prepare the VoxCeleb dataset for training. This preparation code is based on VoxCeleb_trainer, but slightly changed.

python dataprep.py --save_path /root/home/voxceleb --download --user USERNAME --password PASSWORD 
python dataprep.py --save_path /root/home/voxceleb --extract
python dataprep.py --save_path /root/home/voxceleb --convert

In addition to the Python dependencies, wget and ffmpeg must be installed on the system.

Feature extraction

In configure.py, specify the path to the directory. For example, in meta-SR/configure.py line 2:

save_path = '/root/home/voxceleb'

Then, extract acoustic feature (mel filterbank-40).

python feat_extract/feature_extraction.py

Training examples

Evaluation

If you use n-th folder & k-th checkpoint

Pretrained model

A pretrained model can be downloaded from here. If you put this model into meta-SR/saved_model/baseline_000, and run following script, you can get EER 2.08.

python EER_full.py --n_folder 0 --cp_num 100 --data_type vox2

Citation

Please cite the following if you make use of the code.

@inproceedings{kye2020meta,
  title={Meta-Learning for Short Utterance Speaker Recognition with Imbalance Length Pairs},
  author={Kye, Seong Min and Jung, Youngmoon and Lee, Hae Beom and Hwang, Sung Ju and Kim, Hoirin},
  booktitle={Interspeech},
  year={2020}
}

Acknowledgments

This code is based on the implementation of SR_tutorial and VoxCeleb_trainer. I would like to thank Youngmoon Jung, Joon Son Chung and Sung Ju Hwang for helpful discussions.