Open sunraymoonbeam opened 7 months ago
- What is the difference between the segmentation models and why do they have different settable hyper-parameters?
pyannote/segmentation
is described in this paper.
pyannote/segmentation-3.0
is described in that paper.
This should help you understand why the latter does not need for onset
and offset
thresholds.
- Is it possible to set "min_duration_on" for the speaker diarization pipeline?
Not out of the box. You would have to post process the output yourself.
Pipelines output pyannote.audio.Annotation
instances so this API might help you do just that.
- What is the difference between VAD pipeline and how the segmentation model works in the Speaker Diarization Pipeline?
- My understanding is that the VAD pipeline uses the maximum over the speaker axis for each frame
Correct. Plus thresholding.
- but the Segmentation model performs speaker change detection by taking the absolute value of the first derivative over the time axis, and take the maximum value over the speaker axis.
Incorrect. See this paper explaining how the segmentation model is used.
Does this mean that there is no need to tune the segmentation hyperparameters for 3.0 so this step should be omitted?
Also I think the tutorial needs some changes. Even though the segmentation system has no apparent hyperparameters, the min_duration_off parameter still needs to be set after updating the clustering parameter when instantiating the new pipeline with the modified parameters.
Yes indeed, the tutorial is outdated. Feel free to open a PR!
pyannote/speaker-diarization-3.1.1 has more parameters, but it turns out they are read only :(
Probably that's only documentation on why the model was trained.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
I am currently working on a speaker diarization task for classroom discussions without labeled data. To assess the pipeline's performance, I rely on two methods: listening manually and using intrinsic measures such as clustering metrics. The pipeline, when used out of the box, doesn't perform well. Some segments contain background noise, while others are very short (e.g., 0.1 seconds).
I want to improve the diarization pipeline's performance by tweaking hyperparameters. I know about the hyperparameters for segmentation (threshold, _min_durationoff, and _min_durationon) and clustering (method, _min_clustersize, and threshold). However, I'm having trouble instantiating hyperparameters for the segmentation model.
Here's my attempt:
The output only shows one tunable hyperparameter for segmentation (_min_durationoff). After some investigation, I discovered that using pyannote/segmentation-3.0 for the segmentation model results in only _min_durationoff being visible. However, when using
pyannote.audio.pipelines.SpeakerDiarization
pipeline with the default segmentation model pyannote/segmentation@2022.07, the threshold activation parameter is available.I'm curious about the difference between these segmentation models. Additionally, I noticed that the VAD pipeline has _min_durationon, but the speaker diarization pipeline does not (which I would like to remove those short speech segments). Initially, I performed each task separately (VAD -> Embedding of speech segments -> Clustering) instead of using the pipeline. My understanding is that the VAD pipeline doesn't account for speaker change detection and only detects regions of speech, which is why I switched back to the pipeline for easy inference and testing.
### Questions
Can you provide insights into these issues and advice me on how to proceed for playing around with the hyperparameters to improve my performance?
Warm regards, Zack