Does speaker embedding training require a different dataset than speaker change detection?

pyannote / pyannote-audio

Neural building blocks for speaker diarization: speech activity detection, speaker change detection, overlapped speech detection, speaker embedding

http://pyannote.github.io

MIT License

5.94k stars 756 forks source link

Does speaker embedding training require a different dataset than speaker change detection? #466

Closed kan-cloud closed 3 years ago

kan-cloud commented 3 years ago

In the tutorial, the AMI dataset is used to train speech activity and change detection. However, the voxceleb data set is used to train speaker embedding. Does the speaker embedding model necessarily require a different dataset than speaker activity and change detection?

I have trained all aspects of my diarization pipeline (sad,scd,emb) on the same dataset (which is split into train, development, and test subsets) and I am getting very poor results. I was wondering if it was because I do not have a separate data set for speaker embedding.

mogwai commented 3 years ago

As long as you have labelled waveforms of non-overlapping speakers, you could use the AMI dataset for speaker embedding but it has a limited number of speakers so model trained from scratch on such a dataset isn't going to be very good. It's generally better to use another dataset like voxceleb to train speaker embedding on.

kan-cloud commented 3 years ago

As long as you have labelled waveforms of non-overlapping speakers, you could use the AMI dataset for speaker embedding but it has a limited number of speakers so model trained from scratch on such a dataset isn't going to be very good. It's generally better to use another dataset like voxceleb to train speaker embedding on.

Oh I see why it was trained on voxceleb now. Thank you for the prompt reply!