pyannote / pyannote-metrics

A toolkit for reproducible evaluation, diagnostic, and error analysis of speaker diarization systems
http://pyannote.github.io/pyannote-metrics
MIT License
183 stars 30 forks source link

Input type to Segmentation Metrics #25

Closed Rachine closed 5 years ago

Rachine commented 5 years ago

Hello, this is not really an issue, this is just for my own understanding

Description

Segmentation Metrics does not take the same type of input, timeline for Segmentation Precision/Recall and annotation for Purity/Coverage

https://github.com/pyannote/pyannote-metrics/blob/master/pyannote/metrics/segmentation.py#L94

Steps/Code to Reproduce

Example:

from pyannote.core import Timeline, Segment
reference = Timeline()
reference.add(Segment(0, 10))
reference.add(Segment(12, 20)) 

hypothesis = Timeline()
hypothesis.add(Segment(2, 13))
hypothesis.add(Segment(28, 35))

from pyannote.metrics.segmentation import SegmentationRecall, SegmentationPurity
recall = SegmentationRecall()
purity = SegmentationPurity()
print("Recall = {0:.3f}".format(recall(reference, hypothesis)))
print("Purity = {0:.3f}".format(purity(reference, hypothesis)))

Actual Results

Recall = 0.000 
[...]
TypeError: reference must be an instance of `Annotation`

Questions

Why is it necessary to have annotation for the purity and coverage? What should be the labels injected in the Annotation?

It seems related to the remark in the docs: http://pyannote.github.io/pyannote-metrics/reference.html#segmentation that they directly relate to the cluster-wise purity and coverage used for evaluating speaker

Thank you very much!

hbredin commented 5 years ago

First, let me get this out of the way: I am not happy with the way segmentation metrics are implemented. It is the result of a trial and error process that led to this inconsistency...

Precision/recall metrics evaluate segmentation as a discrete detection task: "find timestamps when changes happen" -- the actual duration of each segment is not taken into account.

Purity/coverage metrics actually use the duration of each segment. In practice, labels are not used and you can simply use Timeline.to_annotation() to add a unique fake label to each segment of your hypothesis.

That being said labels could be used in a variant of this metric (and I believe this is why I originally asked for Annotation instead of Timeline though this is not implemented).

Currently AABAA is equivalent to AABCC but we could actually consider the actual labels when computing purity and coverage (in the same way we do it for diarization purity/coverage).

Does it answer you questions?

Rachine commented 5 years ago

This answers my question! Thank you.

Currently AABAA is equivalent to AABCC but we could actually consider the actual labels when computing purity and coverage (in the same way we do it for diarization purity/coverage

To take into account the labels at this stage would be equivalent to the diarization, or there is a difference?

hbredin commented 5 years ago

That would be equivalent to using DiarizationPurity and DiarizationCoverage indeed.

The only difference I see is that the code could be optimized to account for the fact that each hypothesized segment has its own label. But the actual error rates would be the same (just faster to compute).

Rachine commented 5 years ago

Ok thank you very much!