nttcslab / m2d

Masked Modeling Duo: Towards a Universal Audio Pre-training Framework
https://ieeexplore.ieee.org/document/10502167
Other
64 stars 2 forks source link

Any idea on the EER performance of this model in SV task? #3

Closed aliencaocao closed 6 months ago

aliencaocao commented 1 year ago

Trying to compare it with https://huggingface.co/nvidia/speakerverification_en_titanet_large

daisukelab commented 1 year ago

Hi @aliencaocao , I am sorry to be late. I was interested, but executing the evaluation took a long time.

I was testing our M2D variant for speech (M2D-S), in which we trained M2D on LibriSpeech (https://arxiv.org/abs/2305.14079). The result for Speaker Verification (EER%) with the M2D-S was 5.74% on the SUPERB benchmark; it turns out that ours is better than wav2vec2.0, while WavLM and HuBERT are better than ours.

image

However, I do not know how to compare with the result in your link: Speaker Verification (EER%) on the https://huggingface.co/nvidia/speakerverification_en_titanet_large

Version Model Model Size VoxCeleb1 (Cleaned trial file)
1.10.0 TitaNet-Large 23M 0.66
aliencaocao commented 1 year ago

Hi, thanks so much for this. I have checked the paper and they report the EER in %, so it actually got 0.66% EER on VoxCeleb1, which is way smaller than the results you listed. I think that the test set is difference as Nvidia used some 'cleaned trial file' thing. Paper: https://arxiv.org/abs/2110.04410 image

daisukelab commented 1 year ago

@aliencaocao Thanks for the info. I was trying to understand the difference, but I got little information about the cleaned trial file, even in the VoxCeleb paper (https://www.sciencedirect.com/science/article/pii/S0885230819302712).

So I guess the biggest difference would be the evaluation protocol. The TitaNet would be fine-tuning using various speech corpora, while SUPERB evaluates embeddings from frozen models (and these models are mostly SSLs).

Ours are SSLs, so following SUPERB.

daisukelab commented 6 months ago

Thanks for raising this issue. We summarized the ASV results with our models in the recently submitted paper. Closing this issue for now. Please feel free to re-open.