pyannote / pyannote-audio

Neural building blocks for speaker diarization: speech activity detection, speaker change detection, overlapped speech detection, speaker embedding
http://pyannote.github.io
MIT License
6.38k stars 784 forks source link

Mismatch between DiscreteDiarizationErrorRate and DiarizationErrorRate #1733

Closed hhd52859 closed 4 months ago

hhd52859 commented 4 months ago

Tested versions

3.3.1

System information

Ubuntu 20.04

Issue description

When fine-tuning the PyanNet model, I observed that it calculates the DiscreteDiarizationErrorRate (DiscreteDER) on the validation set during training. However, this metric is assessed only at the chunk level, leading to discrepancies between the local DiscreteDER and the overall DiarizationErrorRate (DER). For instance, using a duration of 5 seconds resulted in a local DiscreteDER of 16% on the AMI development set, whereas the global DER was 17%. However, increasing the duration to 20 seconds leads to 20% DiscreteDER and 15.5% DER.

This observation raises several questions:

  1. What is the relationship between DiscreteDiarizationErrorRate and DiarizationErrorRate? Which metric aligns more closely with the conventional DER?
  2. How can we monitor the global DiarizationErrorRate during the training process?

Minimal reproduction example (MRE)

Can provide if needed

hbredin commented 4 months ago
  1. There is no direct relationship between the two as the second one needs an extra clustering step used to stitch chunks together.

  2. You cannot. That being said, DER is the sum of three terms (false alarm, missed detection, speaker confusion). Better local FA or MD will directly translate into better global FA or MD. Global SC, on the other hand, cannot be infered from local SC (because of the extra clustering step mentioned above).

This paper might give you a better understanding of the whole pipeline.

hhd52859 commented 4 months ago

@hbredin, thank you for your reply!

After reading the pyannote diarization pipeline paper, I've got better understanding about this problem. But still something is unclear to me. If I set the duration parameter to a value greater than the maximum duration of all files in the dataset, and omit the second clustering stage, will the resulting DiscreteDER be equivalent to the standard DER? In this case clustering isn't necessary to stitch chunks together anymore.

Additionally, I came across a different DER calculation method on page 13 of this paper, which diverges from both DiscreteDER and the DER used in pyannote. This has led to some confusion on my end about how to achieve comparable DER metrics across different methods. Could you shed some light on this?

hbredin commented 4 months ago

If I set the duration parameter to a value greater than the maximum duration of all files in the dataset, and omit the second clustering stage, will the resulting DiscreteDER be equivalent to the standard DER?

Yes.

Additionally, I came across a different DER calculation method on page 13 of this paper, which diverges from both DiscreteDER and the DER used in pyannote. This has led to some confusion on my end about how to achieve comparable DER metrics across different methods. Could you shed some light on this?

This is very similar (if not identical) to DiscreteDER.