Closed aliencaocao closed 6 months ago
Hi @aliencaocao , I am sorry to be late. I was interested, but executing the evaluation took a long time.
I was testing our M2D variant for speech (M2D-S), in which we trained M2D on LibriSpeech (https://arxiv.org/abs/2305.14079). The result for Speaker Verification (EER%) with the M2D-S was 5.74% on the SUPERB benchmark; it turns out that ours is better than wav2vec2.0, while WavLM and HuBERT are better than ours.
However, I do not know how to compare with the result in your link: Speaker Verification (EER%) on the https://huggingface.co/nvidia/speakerverification_en_titanet_large
Version | Model | Model Size | VoxCeleb1 (Cleaned trial file) |
---|---|---|---|
1.10.0 | TitaNet-Large | 23M | 0.66 |
Hi, thanks so much for this. I have checked the paper and they report the EER in %, so it actually got 0.66% EER on VoxCeleb1, which is way smaller than the results you listed. I think that the test set is difference as Nvidia used some 'cleaned trial file' thing. Paper: https://arxiv.org/abs/2110.04410
@aliencaocao Thanks for the info. I was trying to understand the difference, but I got little information about the cleaned trial file, even in the VoxCeleb paper (https://www.sciencedirect.com/science/article/pii/S0885230819302712).
So I guess the biggest difference would be the evaluation protocol. The TitaNet would be fine-tuning using various speech corpora, while SUPERB evaluates embeddings from frozen models (and these models are mostly SSLs).
Ours are SSLs, so following SUPERB.
Thanks for raising this issue. We summarized the ASV results with our models in the recently submitted paper. Closing this issue for now. Please feel free to re-open.
Trying to compare it with https://huggingface.co/nvidia/speakerverification_en_titanet_large