pyannote / pyannote-metrics

A toolkit for reproducible evaluation, diagnostic, and error analysis of speaker diarization systems
http://pyannote.github.io/pyannote-metrics
MIT License
183 stars 30 forks source link

Collar calculation is not standard? #63

Closed liutaocode closed 2 years ago

liutaocode commented 2 years ago

Recently, I use the pyannote metric using a collar with 0.25 ms. But I found that the result by pyannote is different from another publicly used evaluation tool called dscore.

I think this is caused by the different meanings in the collar. In pyannote, the collar is used to cut the beginning and end of the segment in half of the collar size. But in dscore, the collar is used to cut the beginning and end of the segment in full size of the collar.

So, I want to ask why pyannote uses a different way to use the collar, which quite confuse me.

hbredin commented 2 years ago

Not sure which one is standard given than pyannote.metrics was released before dscore. Both were released after NIST's md-eval.pl. Also, here is a nice post by @desh2608 comparing those tools and introducing yet another one (spyder): https://desh2608.github.io/2021-03-05-spyder/

Regarding the collar convention used in pyannote.metrics, this is documented here: https://pyannote.github.io/pyannote-metrics/reference.html#evaluation-metrics

image

hbredin commented 2 years ago

That being said, I recommend you use collar = 0.0 when reporting results.

For very dynamic conversations with lots of short speech turns, using a 250ms collar may actually remove more than half of the conversation (and usually the more difficult half) -- leading to over-optimistic reported diarization error rates.

liutaocode commented 2 years ago

Thanks for your quickly reply ~ I have learnt a lot from your reply. The way pyannote calculates collar is different from dscore. We can set the collar to 2 * collar to keep consistency with the dscore.