Due to the copyright issue, I can only publish the single GPU version which was developed before Jan. 2019. Some implementations need to be improved as well, e.g., the GPU memory allocation. The library can still be used as a framework for speaker verification. Multi-GPU and other approaches could be added with fewer efforts.
Note When you extract the speaker embedding using extract.sh, make sure that your TensorFlow is compiled WITHOUT MKL. As I know, some versions of TF installed by anaconda are compiled with MKL. It will use multiple threads when TF is running on CPUs. This is harmful if you run multiple processes (say 40). The threads conflict will make the extraction extremely slow. For me, I use pip to install TF 1.12, and that works.
The tf-kaldi-speaker implements a neural network based speaker verification system using Kaldi and TensorFlow.
The main idea is that Kaldi can be used to do the pre- and post-processing while TF is a better choice to build the neural network. Compared with Kaldi nnet3, the modification of the network (e.g. adding attention, using different loss functions) using TF costs less. Adding other features to support text-dependent speaker verification is also possible.
The purpose of the project is to make researches on neural network-based speaker verification easier. I also try to reproduce some results in my papers.
Python: 2.7 (Update to 3.6/3.7 should be easy.)
Kaldi: >5.5
Since Kaldi is only used to do the pre- and post-processing, most version >5.2 works. Though I'm not 100% sure, I believe Kaldi with x-vector support (e.g. egs/sre16/v2) is enough. But if you want to run egs/voxceleb, make sure your Kaldi also contains these examples.
Tensorflow: >1.4.0
I write the code using TF 1.4.0 at the very beginning. Then I updated to v1.12.0. The future version will support TF >1.12 but I will try to make the API compatible with lower versions. Due to the API changes (e.g. keep_dims to keepdims in some functions), some may experience incorrect parameters. In that case, simply checking the parameters may fix these problems.
The general pipeline of our framework is:
For training:
For test:
Evaluate the performance:
In our framework, the speaker embedding can be trained and extracted using different network architectures. Again, the backend classifier is integrated using Kaldi.
run.sh
to go through the code.Performance
I've tested the code on three datasets and the results are better than the standard Kaldi recipe. (Of course, you can achieve better performance using Kaldi by carefully tuning the parameters.)
See RESULTS for details.
Speed
Since it only supports single GPU, the speed is not very fast but acceptable in medium-scale datasets. For VoxCeleb, the training takes about 2.5 days using Nvidia P100 and it takes ~4 days for SRE.
VoxCeleb
Training data: VoxCeleb1 dev set and VoxCeleb2
Google Drive and
BaiduYunDisk (extraction code: xwu6)
NIST SRE
Training data: NIST SRE04-08, SWBD
Only the models trained with large margin softmax are released at this moment.
Google Drive and
BaiduYunDisk (extraction code: rt9p)
Advantages
Disadvantages
In this code, I provide two possible methods to tune the learning rate when SGD is used: using validation set and using a fixed file. The first method works well but it may take longer to train the network.
More complicated network architectures could be implemented (similar to the TDNN in model/tdnn.py). Deeper networks are worth trying since we have enough training data. That would result in better performance.
Apache License, Version 2.0 (Refer to LICENCE)
The computational resources are initially provided by Prof. Mark Gales in Cambridge University Engineering Department (CUED). After my visit to Cambridge, the resources are mainly supported by Dr. Liang He in Tsinghua University Electronic Engineering Department (THUEE).
Unfortunately, the code is developed under Windows. The file property cannot be maintained properly. After downloading the code, simply run:
find ./ -name "*.sh" | awk '{print "chmod +x "$1}' | sh
to add the 'x' property to the .sh files.
For cluster setup, please refer to Kaldi for help. In my case, the program is run locally. Modify cmd.sh and path.sh just according to standard Kaldi setup.
Contact:
Website: http://yiliu.org.cn
E-mail: liu-yi15 (at) tsinghua {dot} org {dot}cn
@inproceedings{liu2019speaker,
author={Yi Liu and Liang He and Jia Liu},
Title = {Large Margin Softmax Loss for Speaker Verification},
BookTitle = {Proc. INTERSPEECH},
Year = {2019}
}