pyannote / pyannote-metrics

A toolkit for reproducible evaluation, diagnostic, and error analysis of speaker diarization systems
http://pyannote.github.io/pyannote-metrics
MIT License
183 stars 30 forks source link

Segmentation Purity is always 100% #20

Closed neozhangthe1 closed 6 years ago

neozhangthe1 commented 6 years ago

Description

Segmentation Purity always outputs 1.0

Example

hypothesis = <Timeline(uri=None, segments=[<Segment(-0.0125, 4.52)>, <Segment(4.52, 8.66)>, <Segment(8.66, 15.63)>, <Segment(15.63, 17.22)>, <Segment(17.22, 19.2)>, <Segment(19.2, 25.51)>, <Segment(25.51, 36.39)>, <Segment(36.39, 38.12)>, <Seg ment(38.12, 39.59)>, <Segment(39.59, 46.23)>, <Segment(46.23, 54.05)>, <Segment(54.05, 69.84)>, <Segment(69.84, 71.03)>, <Segment(71.03, 91.51)>, <Segment(91.51, 93.81)>, <Segment(93.81, 101.95)>, <Segment(101.95, 103.75)>, <Segment(103.75 , 105.96)>, <Segment(105.96, 115.82)>, <Segment(115.82, 128.33)>, <Segment(128.33, 166.473)>])>

the timeline of reference = <Timeline(uri=None, segments=[<Segment(3.11, 3.97)>, <Segment(4.61, 8.02)>, <Segment(8.71, 15.57)>, <Segment(17.25, 17.95)>, <Segment(19.21, 20.11)>, <Segment(20.12, 20.71)>, <Segment(20.72, 25.46)>, <Segment(26.86, 27.46)>, <Segment(27.47, 29.86)>, <Segment(29.87, 31.66)>, <Segment(31.67, 32.56)>, <Segment(32.57, 33.46)>, <Segment(33.47, 34.06)>, <Segment(36.35, 37.53)>, <Segment(38.17, 39.53)>, <Segment(42.72, 44.81)>, <Segment(44.82, 46.15)>, <Segment(46.85, 47.75)>, <Segment(47.76, 49.85)>, <Segment(49.86, 52.82)>, <Segment(54.08, 54.79)>, <Segment(55.44, 56.04)>, <Segment(56.05, 57.84)>, <Segment(57.85, 59.27)>, <Segment(59.92, 64.42)>, <Segment(64.43, 67.12)>, <Segment(67.13, 69.75)>, <Segment(71.07, 71.97)>, <Segment(71.98, 72.57)>, <Segment(72.58, 73.47)>, <Segment(73.48, 76.77)>, <Segment(76.78, 84.66)>, <Segment(85.31, 86.51)>, <Segment(86.52, 88.61)>, <Segment(88.62, 90.71)>, <Segment(90.72, 91.43)>, <Segment(93.93, 100.53)>, <Segment(100.54, 101.13)>, <Segment(101.14, 101.87)>, <Segment(103.78, 105.89)>, <Segment(106.53, 107.43)>, <Segment(107.44, 107.85)>, <Segment(108.49, 109.99)>, <Segment(110, 111.19)>, <Segment(111.8, 113.59)>, <Segment(113.6, 114.7)>, <Segment(115.9, 128.27)>, <Segment(128.91, 133.41)>, <Segment(133.42, 135.15)>, <Segment(135.79, 139.09)>, <Segment(139.21, 141.79)>, <Segment(141.8, 145.99)>, <Segment(146, 146.89)>, <Segment(146.9, 147.79)>, <Segment(148.4, 149.89)>, <Segment(149.9, 151.39)>, <Segment(151.4, 152.29)>, <Segment(152.3, 156.19)>, <Segment(156.2, 156.79)>, <Segment(156.8, 159.49)>, <Segment(159.5, 160.29)>, <Segment(160.94, 162.74)>, <Segment(162.75, 164.49)>])>

after self._partition(self, timeline, coverage) the hypothesis becomes <Timeline(uri=None, segments=[<Segment(3.11, 3.97)>, <Segment(4.61, 8.02)>, <Segment(8.71, 15.57)>, <Segment(17.25, 17.95)>, <Segment(19.21, 20.11)>, <Segment(20.12, 20.71)>, <Segment(20.72, 25.46)>, <Segment(26.86, 27.46)>, <Segment(27.47, 29.86)>, <Segment(29.87, 31.66)>, <Segment(31.67, 32.56)>, <Segment(32.57, 33.46)>, <Segment(33.47, 34.06)>, <Segment(36.35, 37.53)>, <Segment(38.17, 39.53)>, <Segment(42.72, 44.81)>, <Segment(44.82, 46.15)>, <Segment(46.85, 47.75)>, <Segment(47.76, 49.85)>, <Segment(49.86, 52.82)>, <Segment(54.08, 54.79)>, <Segment(55.44, 56.04)>, <Segment(56.05, 57.84)>, <Segment(57.85, 59.27)>, <Segment(59.92, 64.42)>, <Segment(64.43, 67.12)>, <Segment(67.13, 69.75)>, <Segment(71.07, 71.97)>, <Segment(71.98, 72.57)>, <Segment(72.58, 73.47)>, <Segment(73.48, 76.77)>, <Segment(76.78, 84.66)>, <Segment(85.31, 86.51)>, <Segment(86.52, 88.61)>, <Segment(88.62, 90.71)>, <Segment(90.72, 91.43)>, <Segment(93.93, 100.53)>, <Segment(100.54, 101.13)>, <Segment(101.14, 101.87)>, <Segment(103.78, 105.89)>, <Segment(106.53, 107.43)>, <Segment(107.44, 107.85)>, <Segment(108.49, 109.99)>, <Segment(110, 111.19)>, <Segment(111.8, 113.59)>, <Segment(113.6, 114.7)>, <Segment(115.9, 128.27)>, <Segment(128.91, 133.41)>, <Segment(133.42, 135.15)>, <Segment(135.79, 139.09)>, <Segment(139.21, 141.79)>, <Segment(141.8, 145.99)>, <Segment(146, 146.89)>, <Segment(146.9, 147.79)>, <Segment(148.4, 149.89)>, <Segment(149.9, 151.39)>, <Segment(151.4, 152.29)>, <Segment(152.3, 156.19)>, <Segment(156.2, 156.79)>, <Segment(156.8, 159.49)>, <Segment(159.5, 160.29)>, <Segment(160.94, 162.74)>, <Segment(162.75, 164.49)>])>

I don't think the purity of this example is 1.0 since <Segment(128.33, 166.473)>])> contains two different speakers.

hbredin commented 6 years ago

Can you please provide me with a simple self-contained script that I could run to reproduce the error?

neozhangthe1 commented 6 years ago

Below is a simple example. Both hypothesis and hypothesis1 achieve a 100% purity. If we set the whole audio file as a single Segment, we will get a 100% for both purity and coverage.

    from pyannote.core import Annotation, Timeline, Segment
    from pyannote.metrics.segmentation import SegmentationPurity, SegmentationCoverage
    hypothesis = Timeline(segments=[Segment(0, 10)])
    hypothesis1 = Timeline(segments=[Segment(0, 3), Segment(3, 4), Segment(4, 5)])
    reference = Annotation()
    reference[Segment(1, 2)] = "a"
    reference[Segment(3, 5)] = "b"
    reference[Segment(7, 8)] = "a"
    purity = SegmentationPurity()(reference, hypothesis)
    coverage = SegmentationCoverage()(reference, hypothesis)
    purity1 = SegmentationPurity()(reference, hypothesis1)
    coverage1 = SegmentationCoverage()(reference, hypothesis1)
    print(purity, coverage)
    print(purity1, coverage1)
hbredin commented 6 years ago

This the expected behavior but I agree that the documentation is not clear.

SegmentationPurity and SegmentationCoverage assume that the supports of reference and hypothesis are the same.

If not, it silently extrudes the hypothesis so that its support matches the one of the reference. This is indeed bad design -- it should probably raise an error instead... Is this something you would like to help contribute? I'd love to merge a pull request on the develop branch :)

What you are looking for is DiarizationPurity and DiarizationCoverage. (Note that it also starts by focusing on the intersection of reference and hypothesis supports)

>>> from pyannote.core import Annotation, Segment
>>> from pyannote.metrics.diarization import DiarizationPurity, DiarizationCoverage
>>> purity = DiarizationPurity()
>>> coverage = DiarizationCoverage()
>>> reference = Annotation()
>>> reference[Segment(1, 2)] = "a"
>>> reference[Segment(3, 5)] = "b"
>>> reference[Segment(7, 8)] = "a"
>>> hypothesis = Annotation()
>>> hypothesis[Segment(0, 10)] = "A"
>>> purity(reference, hypothesis)
# 0.5 
>>> coverage(reference, hypothesis)
# 1.0
neozhangthe1 commented 6 years ago

Thanks for the quick response, but I still a little confusing. I'm currently doing a Speaker Change detection task. Say if I get a prediction with no change detected, the resulting segmentation purity and segmentation coverage will both be 100%. Is this expected?

hbredin commented 6 years ago

SegmentationPurity and SegmentationCoverage can be applied to full partitions of the file, not just speech regions:

>>> reference = Annotation()
>>> reference[Segment(0, 1)] = 'non_speech'
>>> reference[Segment(1, 2)] = 'a'
>>> reference[Segment(2, 4)] = 'b'
>>> reference[Segment(4, 5)] = 'non_speech'
>>> reference[Segment(5, 10)] = 'a'
>>> hypothesis = Annotation()
>>> hypothesis[Segment(0, 10)] = 'A'
>>> SegmentationPurity()(reference, hypothesis)
# 0.5
>>> SegmentationCoverage()(reference, hypothesis)
# 1.0

For speaker change detection, if you only want to evaluate speech regions, DiarizationPurity and DiarizationCoverage are the way to go: just make sure each segment in the hypothesis has its own label. Does it make sense?

neozhangthe1 commented 6 years ago

I'm experimenting with the speaker change detection with the implementation in pyannote-audio https://github.com/pyannote/pyannote-audio/blob/master/pyannote/audio/applications/change_detection.py#L216

    alphas = np.linspace(0, 1, 20)

    purity = [SegmentationPurity(parallel=False) for alpha in alphas]
    coverage = [SegmentationCoverage(parallel=False) for alpha in alphas]

    # -- SAVE RESULTS --
    for i, alpha in enumerate(alphas):
        # initialize peak detection algorithm
        peak = Peak(alpha=alpha, min_duration=min_duration)
        for uri, reference in groundtruth.items():
            # apply peak detection
            hypothesis = peak.apply(predictions[uri])
            # compute purity and coverage
            purity[i](reference, hypothesis)
            coverage[i](reference, hypothesis)

the hypothesis generated by peak.apply(predictions[uri]) is a Timeline object. the segmentation purity is fixed at 1.0 regardless of the value of alpha.

This is where I got confused

neozhangthe1 commented 6 years ago

Oh I think I got the point. I need to fill the gaps with non_speech label. Great project and thanks for your help!