mycrazycracy / tf-kaldi-speaker

Neural speaker recognition/verification system based on Kaldi and Tensorflow
Apache License 2.0
32 stars 16 forks source link

Why you use PLDA for AMSoftmax? #3

Closed i7p9h9 closed 5 years ago

i7p9h9 commented 5 years ago

Why you use PLDA for AMSoftmax/ArcSoftmax/ASoftmax, did you try using simple cosine similarity?

i7p9h9 commented 5 years ago

Sorry, i notice, that you compute cosine similarity in code, but PLDA compute as well. Which EER provided in your table: for plda scoring or for cosine scoring?

mycrazycracy commented 5 years ago

Hi,

I use both cos and PLDA in my experiments. I report PLDA results in the performance table.

Generally, PLDA performs better in the SRE and VoxCeleb datasets where you can get lots of in-domain data to train the PLDA model.

In VoxCeleb, you can get similar performance using cosine if large margin softmax is used. If vanilla softmax is used, cos perform worse than PLDA. I also noticed that in some papers, cosine beats PDLA in some conditions (using large margin softmax of course).

In SRE, however, PLDA significantly outperforms cos whether you use the large margin softmax or not. Also, in speaker recognition evaluation (SRE), the best system used PLDA as the backend.

I suggest you try both cos and PLDA scoring in your own dataset and figure out which one is better.

Hope this helps.

i7p9h9 commented 5 years ago

Thank you for your answer. In my experiments i notice that PLDA actually significant better then cosine, it's interesting result, because in face recognition tasks cosine work good. One more thing, kaldi has modified version of PLDA, and this version work better then original. What is your typical accuracy on train part at the end of training? When i train voxceleb using kaldi (and using his natural gradient optimizer) i get about 96% accuracy on train, but when i use tensorflow and Adam or SGD+Nesterov i get about 86-88% accuracy? Is this result matching with your?

mycrazycracy commented 5 years ago

Yes, it is very interesting to investigate why PLDA outperform cosine in speaker verification. Actually, I think it is one of the research projects to be done.

As you can see in the code, I didn't monitor the accuracy on the training set. When you found your accuracy on the training set is lower than Kaldi, what about the final performance on the evaluation set? I think lots of factors could impact the training procedure, e.g. the way you choose your training examples, the learning rate decay.

i7p9h9 commented 5 years ago

"When you found your accuracy on the training set is lower than Kaldi": in different datasets - different results. Kaldi result more robust for noise, one can be 2 times better on SITW for instance and a little bit worse on librispeech or VCTK (actually not relevant comparison due to very low error for both system on this dataset).

Took noise resistance into account I tried increase SNR for augmentation to get more robust system^ but convergence very slow (tried to use adam, adamw, adam+sgdr scheduller, SGD+sgdr, one cycle learning, weights averaging, and some more tricks). Kaldi optimizer always showed better result. I can't resolve this Kaldi magic now.

p.s.: I have some hope that clustering pooling (Vlad/GhostVlad) into TDNN architecture can allow avoid using plda

mycrazycracy commented 5 years ago

You mean your system performs worse than Kaldi? For me, I observed better performance using my toolkit.

The main difference should be the optimizer, since Kaldi uses natural gradient. Do you use multi-gpu? If so, Kaldi use model average rather than data and gradient average.

i7p9h9 commented 5 years ago

Yes, i trained on 2 gpu. "You mean your system performs worse than Kaldi? It's correct, but i compare my kaldi training, not kaldi recipe out-of-the box, at the SRE10 core-core test we get 0.67% eer, but training set was large then your.