DER differs from the one obtained with NIST SCTK `md-eval.pl`

dev0x13 commented 4 years ago

Description

I make some tests on NIST SRE 2000 CALLHOME dataset and diarization error rates I got using pyannote is greater by 6% than the one computed with canonical SCTK md.eval.pl. I just want to know if I am doing something wrong or it is an expected behavior defined by the different implementations. Thank you!

Steps/Code to Reproduce

Example:

pyannote:

metric = DiarizationErrorRate(collar=0.25, skip_overlap=True)

for r, h in zip(references, hypotheses):
    metric(r, h)

print(100 * abs(metric))

md-eval.pl:

$ perl md-eval.pl -1 -c 0.25 -r ref.rttm -s hyp.rttm

Expected Results

pyannote: 13.94% md-eval.pl: 13.94%

Actual Results

pyannote: 19.36% md-eval.pl: 13.94%

Versions

pyannote.core==3.0 pyannote.database==2.3 pyannote.metrics==2.1

hbredin commented 4 years ago

The semantic of “collar” differs between md-eval -c option and pyannote collar argument. md-eval means collar on both side of each speech turn boundary. pyannote means overall collar centered on speech turn boundary. You should therefore use collar=0.5 in pyannote to be equivalent to -c 0.25.

dev0x13 commented 4 years ago

Thank you, that helped! It's great to use pure Python DER computer instead of that monstrous Perl script.

pyannote / pyannote-metrics