When Hypothesis is a Single Segment, segmentation.py seems to return 1.0 for both coverage and purity

pyannote / pyannote-metrics

A toolkit for reproducible evaluation, diagnostic, and error analysis of speaker diarization systems

http://pyannote.github.io/pyannote-metrics

MIT License

183 stars 30 forks source link

When Hypothesis is a Single Segment, segmentation.py seems to return 1.0 for both coverage and purity #40

Closed picheny-nyu closed 3 years ago

picheny-nyu commented 4 years ago

Description

I have a situation in which the entire hypothesis is being returned as a single segment. This seems to result in both a purity of 1.0 and coverage of 1.0, which is not right. If I understand the code correctly, what seems to be happening is that in segmentation.py, when the method _partition(self, timeline, coverage) is executed, "coverage" is basically the reference labelling, so if timeline is a single segment, when "return partition.crop(coverage, mode='intersection').relabel_tracks()" is called, it crops the timeline to exactly the reference segmentation, resulting in a purity of 1.0 and a coverage of 1.0. Maybe my understanding is faulty, but I can really use some help here.

Thanks Michael Picheny

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

hbredin commented 4 years ago

This might be a duplicate of #20.

If not, can you please provide a minimal reproducible example?

picheny-nyu commented 4 years ago

Yes, it is the same issue. I tried to fix it by adding the lines:

        # hypothesis processing
        filled = hypothesis
        coverage = filled.support()

between the lines

        reference_partition = self._partition(filled, coverage)
        hypothesis_partition = self._partition(hypothesis, coverage)

in the SegmentationCoverage class. I am not sure that is right but at least it gave me better looking values for the calculation of purity........if I am wrong can you explain why?

hbredin commented 4 years ago

I find it difficult to help you without a minimal reproducible example.

Can you please provide one?

reference = Annotation()
reference[...] = ...
...

hypothesis = Annotation()
hypothesis[...] = ...
...

print(purity(reference, hypothesis))
print(coverage(reference, hypothesis))

picheny-nyu commented 4 years ago

I will, sorry - just been busy with a couple other things last few days., should be able to provide this by Monday.

cathyeee77 commented 3 years ago

I have similar problem here and it would be great if you could help fix this:

Here is an example:

ref = Annotation() ref[Segment(0,3)] = 'A'

hyp = Annotation() hyp[Segment(2,4)] = 'a'

purity = SegmentationPurity() coverage = SegmentationCoverage()

print(coverage(ref, hyp) # returns 1.0 but I expected 0.33 print(purity(ref, hyp)) # returns 1.0 but I expected 0.5

Thanks, Cathy

hbredin commented 3 years ago

Did you have a look at issue #20?

There is a whole discussion there trying to explain the behavior of those metrics.

cathyeee77 commented 3 years ago

Thanks for your quick reply! Now I understand that the segmentation purity and coverage, can be applied to full partitions of the file. I'm still kind of confuse about the purity and coverage calculation.

Let's take this for example: reference = Annotation() reference[Segment(0, 3)] = 'A' reference[Segment(5, 7)] = 'B'

hypothesis = Annotation() hypothesis[Segment(2, 4)] = 'a' hypothesis[Segment(4, 7)] = 'b'

diarizationPurity = DiarizationPurity() diarizationCoverage = DiarizationCoverage() print(diarizationPurity(reference, hypothesis)) print(diarizationCoverage(reference, hypothesis))

I would expect the purity to be 3/5 = 0.6, where 3 is the intersection between ref and hyp, and 5 is the total speech duration in hyp. And coverage to be 3/5 = 0.6, where 3 is the intersection and 5 is the total speech duration in ref. But it returns 1.0 for both purity and coverage. Are these expected?

Did you have a look at issue #20?

There is a whole discussion there trying to explain the behavior of those metrics.

hbredin commented 3 years ago

The way diarization purity and coverage are implemented in pyannote.metrics make them only focus on the "speech" regions common to both reference and hypothesis.

Therefore, it starts by removing the following regions from the evaluation...

[0 --> 2] because there is no speech in hypothesis
[3 --> 5] because there is no speech in reference

... and then only compute purity and coverage.

The main motivation is to not mix speech detection errors (for which pyannote.metrics.detection metrics should be used) and speaker confusion errors. I agree that there are other ways to compute purity and coverage and I'd likely consider a PR adding these alternative implementations to pyannote.metrics.

cathyeee77 commented 3 years ago

Got it. Thanks for your reply! Great project👍

The way diarization purity and coverage are implemented in pyannote.metrics make them only focus on the "speech" regions common to both reference and hypothesis.

Therefore, it starts by removing the following regions from the evaluation...

[0 --> 2] because there is no speech in hypothesis

[3 --> 5] because there is no speech in reference

... and then only compute purity and coverage.

The main motivation is to not mix speech detection errors (for which pyannote.metrics.detection metrics should be used) and speaker confusion errors. I agree that there are other ways to compute purity and coverage and I'd likely consider a PR adding these alternative implementations to pyannote.metrics.