microsoft / unilm

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
https://aka.ms/GeneralAI
MIT License
19.11k stars 2.44k forks source link

how to use wavlm model to extract speaker embedding for speaker verification task? #802

Open fatemeshiravand opened 1 year ago

fatemeshiravand commented 1 year ago

Hi, I wanna use wavlm model to extract speaker embedding for speaker verification task. In the paper it is mentioned that for the task of speaker verification, the weighted sum of the representations from transformer layers should be used. I've used the mean of all layers' representations and also the last layer's representation as speaker embedding and haven't got reasonable cosine similarity between two different embeddings belonging to one speaker. I wanted to ask if you could provide me with the learned weights of the transformer layers so that I could extract robust speaker embedding from the model.

Sanyuan-Chen commented 1 year ago

Hi @fatemeshiravand We have released the pre-trained speaker verification models here. Please refer to the README instructions and scripts for the speaker representation extraction.

fatemeshiravand commented 1 year ago

Thank you for your reply @Sanyuan-Chen. I've read the repo you provided for me and used the WavLM large model to compare audios from same speaker and audios from two different speakers. In both case I've got a cosine similarity near 1 (about 0.99) and I'm confused whether the model just doesn't work or I'm doing something wrong.