pyannote / pyannote-audio

Neural building blocks for speaker diarization: speech activity detection, speaker change detection, overlapped speech detection, speaker embedding
http://pyannote.github.io
MIT License
5.57k stars 733 forks source link

How to set hyperparameters for speaker diarization pipeline? #1579

Open sunraymoonbeam opened 7 months ago

sunraymoonbeam commented 7 months ago

I am currently working on a speaker diarization task for classroom discussions without labeled data. To assess the pipeline's performance, I rely on two methods: listening manually and using intrinsic measures such as clustering metrics. The pipeline, when used out of the box, doesn't perform well. Some segments contain background noise, while others are very short (e.g., 0.1 seconds).

I want to improve the diarization pipeline's performance by tweaking hyperparameters. I know about the hyperparameters for segmentation (threshold, _min_durationoff, and _min_durationon) and clustering (method, _min_clustersize, and threshold). However, I'm having trouble instantiating hyperparameters for the segmentation model.

Here's my attempt:

pretrained_pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization-3.1", use_auth_token=True)
default_hyperparameters = pretrained_pipeline.parameters(instantiated=True)
for param, value in default_hyperparameters.items():
    print(f"{param}: {value}")

Screenshot 2023-12-04 at 17 36 19

The output only shows one tunable hyperparameter for segmentation (_min_durationoff). After some investigation, I discovered that using pyannote/segmentation-3.0 for the segmentation model results in only _min_durationoff being visible. However, when using pyannote.audio.pipelines.SpeakerDiarization pipeline with the default segmentation model pyannote/segmentation@2022.07, the threshold activation parameter is available.

Screenshot 2023-12-04 at 17 35 30

I'm curious about the difference between these segmentation models. Additionally, I noticed that the VAD pipeline has _min_durationon, but the speaker diarization pipeline does not (which I would like to remove those short speech segments). Initially, I performed each task separately (VAD -> Embedding of speech segments -> Clustering) instead of using the pipeline. My understanding is that the VAD pipeline doesn't account for speaker change detection and only detects regions of speech, which is why I switched back to the pipeline for easy inference and testing.

### Questions

  1. What is the difference between the segmentation models and why do they have different settable hyper-parameters?
  2. Is it possible to set "min_duration_on" for the speaker diarization pipeline?
  3. What is the difference between VAD pipeline and how the segmentation model works in the Speaker Diarization Pipeline? My understanding is that the VAD pipeline uses the maximum over the speaker axis for each frame, but the Segmentation model performs speaker change detection by taking the absolute value of the first derivative over the time axis, and take the maximum value over the speaker axis.

Can you provide insights into these issues and advice me on how to proceed for playing around with the hyperparameters to improve my performance?

Warm regards, Zack

hbredin commented 7 months ago
  • What is the difference between the segmentation models and why do they have different settable hyper-parameters?

pyannote/segmentation is described in this paper. pyannote/segmentation-3.0 is described in that paper.

This should help you understand why the latter does not need for onset and offset thresholds.

  • Is it possible to set "min_duration_on" for the speaker diarization pipeline?

Not out of the box. You would have to post process the output yourself. Pipelines output pyannote.audio.Annotation instances so this API might help you do just that.

  • What is the difference between VAD pipeline and how the segmentation model works in the Speaker Diarization Pipeline?
  • My understanding is that the VAD pipeline uses the maximum over the speaker axis for each frame

Correct. Plus thresholding.

  • but the Segmentation model performs speaker change detection by taking the absolute value of the first derivative over the time axis, and take the maximum value over the speaker axis.

Incorrect. See this paper explaining how the segmentation model is used.

picheny-nyu commented 7 months ago

Does this mean that there is no need to tune the segmentation hyperparameters for 3.0 so this step should be omitted?

Also I think the tutorial needs some changes. Even though the segmentation system has no apparent hyperparameters, the min_duration_off parameter still needs to be set after updating the clustering parameter when instantiating the new pipeline with the modified parameters.

hbredin commented 7 months ago

Yes indeed, the tutorial is outdated. Feel free to open a PR!

avion23 commented 6 months ago

pyannote/speaker-diarization-3.1.1 has more parameters, but it turns out they are read only :(

Probably that's only documentation on why the model was trained.

stale[bot] commented 1 week ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.