microsoft / UniSpeech

UniSpeech - Large Scale Self-Supervised Learning for Speech
Other
434 stars 74 forks source link

Speaker verification result #46

Open pierfale opened 1 year ago

pierfale commented 1 year ago

Hello,

Thank you for your work on WavLM. I try to reproduce the results but I have some difficulties.

First of all, I don't undestand exactly the difference between scores displayed in different places. For instance, on Vox1-O:

Moreover I tried to reproduce result from the fine-tuned checkpoint available on this repository (https://drive.google.com/file/d/1-aE1NfzpRCLxA4GUxX9ITI3F9LlbtEGP/view?usp=sharing).

I get the following result on vox1-O:

Do you have any more details to provide?

Thank you

gozsoy commented 9 months ago

I can confirm that I obtained EER 0.558% for Vox1-O using WavLM large finetuned.

gancx commented 7 months ago

Hello,

Thank you for your work on WavLM. I try to reproduce the results but I have some difficulties.

First of all, I don't undestand exactly the difference between scores displayed in different places. For instance, on Vox1-O:

Moreover I tried to reproduce result from the fine-tuned checkpoint available on this repository (https://drive.google.com/file/d/1-aE1NfzpRCLxA4GUxX9ITI3F9LlbtEGP/view?usp=sharing).

I get the following result on vox1-O:

  • Without normalisation, I get EER = 0.558%
  • With s-norm, I get EER = 0.542%
  • with as-norm (cohort size = 600), I get EER = 0.505%

Do you have any more details to provide?

Thank you

I also observed these differences. Have you fixed it?

RegulusBai commented 6 months ago

Same 0.558% and waiting for reply

tcourat commented 2 months ago

I have the same question.

I did not test myself, but according to the original WavLM paper :

In the evaluation stage, the whole utterance is fed into the system to extract speaker embedding. We use cosine similarity to score the evaluation trial list. We also use the adaptive snorm [59], [60] to normalize the trial scores. The imposter cohort is estimated from the VoxCeleb2 dev set by speakerwise averaging all the extracted speaker embeddings. We set the imposter cohort size to 600 in our experiment. To further push the performance, we also introduce the quality-aware score calibration [58] for our best systems, where we randomly generate 30k trials based on the VoxCeleb2 test set to train the calibration model.

Maybe the results are reported by using their calibration model, but this calibration model was not shared. WIthout this quality aware score calibration, the EER on Vox1-O goes down from 0.383% to 0.617% , which may explain the gap.