pyannote / pyannote-audio

Neural building blocks for speaker diarization: speech activity detection, speaker change detection, overlapped speech detection, speaker embedding
http://pyannote.github.io
MIT License
5.47k stars 725 forks source link

Speakers with similar pitch are difficult to distinguish #1712

Open ChristianNSchmitz opened 1 month ago

ChristianNSchmitz commented 1 month ago

Tested versions

3.1

System information

Ubuntu 24.04, pyannote 3.1.1

Issue description

Dear Pyannote-team, we are using pyannote speaker segmentation 3.1.1 for distinguishing speakers for further analysis in a dialogue of 2-3 people. However, if people have similar pitch (e.g. two men with deep voices), pyannote oftentimes misclassifies the speakers. For the human ear, distinguishing the two speakers is easy, so there must be only slight differences in pitch. Thus, I would like to ask whether you have tip for preferences or preprocessing for optimising the classification. Thanks a lot!

Minimal reproduction example (MRE)

Can be provided if necessary

metal3d commented 1 week ago

I've got exactly the same problem. 3 men are speaking, I force the "num_speaker" to 3, but the model only matches one voice for 90% of the time.

That means that I cannot, at this time, use pyannote for what I want to do 😢

hbredin commented 1 week ago

As for any machine learning approaches, train/test domain mismatch is usually the culprit.

Fine-tuning internal models and pipelines to your use case data is usually the best solution.

Did you try alternative speaker diarization tools? Do they perform better? I’d love to have a look at those files where it does not work.