Wespeaker embeddings question

pyannote / pyannote-audio

Neural building blocks for speaker diarization: speech activity detection, speaker change detection, overlapped speech detection, speaker embedding

http://pyannote.github.io

MIT License

5.47k stars 725 forks source link

Wespeaker embeddings question #1590

Open picheny-nyu opened 6 months ago

picheny-nyu commented 6 months ago

If I wanted to use one of the larger wespeaker models - say 293 - would I just download the .pt file and point to it in the config.yaml?

hbredin commented 6 months ago

It is a tiny bit more complex than that.

See this script that does most of the job.

If you work on this, would be nice to share them on Huggingface, taking this repo as example.

picheny-nyu commented 6 months ago

I have never uploaded a model to huggingface before. Is there some way I can give it a similar name - like pyannote/wespeaker-voxceleb-resnet293-LM? If I understand correctly, the way the code works is that it first scans for the keyword "pyannote" in the model name, so another option I assume would be to call it "picheny/pyannote-wespeaker-voxceleb-resnet293-LM". Another concern is that the use of these embeddings is not giving me any improvement on my task (relative to the 34M version). That could just be life, or I might have messed something up.......

hbredin commented 6 months ago

It should be fine with picheny/wespeaker-voxceleb-resnet293-LM since it will end up using this branch of the code:

https://github.com/pyannote/pyannote-audio/blob/66dd72bb2b807aaf6d011c89678d85b51fb3b859/pyannote/audio/pipelines/speaker_verification.py#L764-L768

Also, in my speaker diarization experiments, larger models did not bring any significant (or consistent) improvement either. That is why I sticked with the ResNet34 version for pyannote/speaker-diarization-3.1.

Larger models might help for speaker verification, though.

picheny-nyu commented 6 months ago

OK I will try to put it out there then :-).

akmalmasud96 commented 1 month ago

@hbredin Thanks for the awesome work!

I want to ask if I need to change the clustering threshold when using wespeaker-voxceleb-resnet293-LM? If so, could you please share the experimental threshold you used when testing the resnet293-LM?

hbredin commented 1 month ago

Yes, you would need to optimize thresholds for each version of the embedding network. However, I did not keep track of the optimized thresholds, sorry.

akmalmasud96 commented 1 month ago

@hbredin Thanks for the quick response. Can you please point me out some documentation or some guidelines to set the threshold value? And using which dataset, I should perform benchmarking.

hbredin commented 1 month ago

Grid search should be fine. For benchmarking, I guess you’d have to use data similar to the expected test/production data.

akmalmasud96 commented 1 month ago

@hbredin Currently, I dont have any annotated data. Can you inform that the existing setup is on which dataset? The current configurations are working fine on my data.