Open pierfale opened 1 year ago
I can confirm that I obtained EER 0.558% for Vox1-O using WavLM large finetuned.
Hello,
Thank you for your work on WavLM. I try to reproduce the results but I have some difficulties.
First of all, I don't undestand exactly the difference between scores displayed in different places. For instance, on Vox1-O:
- In WavLM paper (https://arxiv.org/pdf/2110.13900.pdf) the EER is 0.383%.
- On the README of this repository (https://github.com/microsoft/UniSpeech#speaker-verification) the EER is 0.33%.
- On the README of the downstream tasks (https://github.com/microsoft/UniSpeech/tree/main/downstreams/speaker_verification) the EER is 0.431%.
Moreover I tried to reproduce result from the fine-tuned checkpoint available on this repository (https://drive.google.com/file/d/1-aE1NfzpRCLxA4GUxX9ITI3F9LlbtEGP/view?usp=sharing).
I get the following result on vox1-O:
- Without normalisation, I get EER = 0.558%
- With s-norm, I get EER = 0.542%
- with as-norm (cohort size = 600), I get EER = 0.505%
Do you have any more details to provide?
Thank you
I also observed these differences. Have you fixed it?
Same 0.558% and waiting for reply
I have the same question.
I did not test myself, but according to the original WavLM paper :
In the evaluation stage, the whole utterance is fed into the system to extract speaker embedding. We use cosine similarity to score the evaluation trial list. We also use the adaptive snorm [59], [60] to normalize the trial scores. The imposter cohort is estimated from the VoxCeleb2 dev set by speakerwise averaging all the extracted speaker embeddings. We set the imposter cohort size to 600 in our experiment. To further push the performance, we also introduce the quality-aware score calibration [58] for our best systems, where we randomly generate 30k trials based on the VoxCeleb2 test set to train the calibration model.
Maybe the results are reported by using their calibration model, but this calibration model was not shared. WIthout this quality aware score calibration, the EER on Vox1-O goes down from 0.383% to 0.617% , which may explain the gap.
Hello,
Thank you for your work on WavLM. I try to reproduce the results but I have some difficulties.
First of all, I don't undestand exactly the difference between scores displayed in different places. For instance, on Vox1-O:
Moreover I tried to reproduce result from the fine-tuned checkpoint available on this repository (https://drive.google.com/file/d/1-aE1NfzpRCLxA4GUxX9ITI3F9LlbtEGP/view?usp=sharing).
I get the following result on vox1-O:
Do you have any more details to provide?
Thank you