Different Total values from Detection Error Rate and Diarization Error Rate when there is overlap segment

pyannote / pyannote-metrics

A toolkit for reproducible evaluation, diagnostic, and error analysis of speaker diarization systems

http://pyannote.github.io/pyannote-metrics

MIT License

186 stars 33 forks source link

Different Total values from Detection Error Rate and Diarization Error Rate when there is overlap segment #49

Closed fayejf closed 3 years ago

fayejf commented 3 years ago

Description

Different Total values from Detection Error Rate and Diarization Error Rate when there is overlap segment

Steps/Code to Reproduce

from pyannote.core import Annotation, Segment from pyannote.metrics import detection, diarization

a = Annotation('hello') a[Segment(0, 1)]='B' a[Segment(0.5, 2)]='B' # overlap

metric_det = detection.DetectionErrorRate(collar=0, skip_overlap=False) metric_det(a,a) print(metric_det['total']) # 2

metric_dia = diarization.DiarizationErrorRate(collar=0, skip_overlap=False) metric_dia(a,a) print(metric_dia['total']) # 2.5

Expected Results

Quote from the documentation: In Detection: " total is the total duration of speech in the reference." In Diarization: "total is the total duration of speech in the reference"

We expect the TOTAL from two conditions are identical.

Actual Results

metric_det['total']= 2 metric_dia['total'] =2.5

Versions

pyannote.core==4.1 pyannote.metrics==3.0.1

fayejf commented 3 years ago

@nithinraok

hbredin commented 3 years ago

Thanks! This is an imprecision in the documentation.

For diarization, it should read: "total is the total duration of speech turns in the reference". Can you please provide a link to where exactly you found this error? Or, even better, make a PR to fix it?

But, maybe there is also some misunderstanding on your side. In your example above, the same speaker "B" seems to speak twice during the [0.5, 1] time range. When can this happen?

nithinraok commented 3 years ago

We found the error when we were checking for False alarm and missed detection from DetectionErrorRate and DiarizationErrorRate. Theoretically, they should match but they weren't matching when there is overlap in rttm file.

Even if we replace a[Segment(0.5, 2)]='B' with A we get the same error -> different total length.

hbredin commented 3 years ago

As stated in my previous comment, this behavior is actually expected. I agree that the documentation is misleading, though. I'd gladly merge a PR fixing the documentation.

For detection, total is the total duration of speech activity (i.e. where at least one person speaks). This is meant as an evaluation metric for binary classification (a.k.a detection) task.

For diarization, total is the total duration of speech turns (i.e. the sum of speech turns duration over all speakers). Hence, two overlapping speakers are counted twice. This is what md-eval.pl or dscore do as well and is the way the community does speaker diarization evaluation.

nithinraok commented 3 years ago

Thanks, @hbredin for the clear explanation and reference. Submitted PR for doc correction: https://github.com/pyannote/pyannote-metrics/pull/50

hbredin commented 3 years ago

PR merged. Thanks!