pyannote / pyannote-audio

Neural building blocks for speaker diarization: speech activity detection, speaker change detection, overlapped speech detection, speaker embedding
http://pyannote.github.io
MIT License
5.49k stars 725 forks source link

Speaker Diarization pipeline.get_segmentations produces integer ascending start/ends instead of something useful #1685

Open bschreck opened 3 months ago

bschreck commented 3 months ago

Tested versions

3.1

System information

macOs 13.6 - pyannote 3.1 - M2 air

Issue description

Im running ``` self.pipeline = Pipeline.from_pretrained( "pyannote/speaker-diarization-3.1", use_auth_token=os.environ["HF_API_KEY"] ) segmentations = self.pipeline.get_segmentations({'waveform': torch.from_numpy(waveform), 'sample_rate': sample_rate}) splits = [(segment, data) for segment, data in segmentations]

Each segment has start end times that ascend by one, e.g. (0,10), (1, 11), ... (5, 15)
These sort of match the length of the waveform (14.7 seconds), but clearly don't represent anything useful- the waveform is real speech. When I just run the full diarization pipeline it does diarize correctly, the results are:
```[(<Segment(1.16159, 2.41034)>, 'SPEAKER_00'), (<Segment(4.21597, 5.43097)>, 'SPEAKER_01'), (<Segment(5.76847, 6.39284)>, 'SPEAKER_00'), (<Segment(8.18159, 10.2741)>, 'SPEAKER_01'), (<Segment(11.3372, 12.9741)>, 'SPEAKER_00'), (<Segment(13.2947, 14.4591)>, 'SPEAKER_00')]```
And in both cases there are 6 segments.
Where do these latter segments get constructed?

My use case is:
1. run diarization on a concatenation of many different audio files. Save speaker to centroid mapping
2. user submits a new audio file (audio_new)
3. get embedding for each segment of audio_new
4. find closest speaker centroid by cosine distance for each segment
5. save diarization of each segment of audio_new

There doesn't appear to be a great documented workflow for this.
It's odd to me that get_embeddings returns arrays with num_local_speakers as a dimension, which doesn't even correspond exactly to the existing number of speakers from the original diarization. What does this actually mean? Relative confidence of the mapping to some threshold-gated speakers?
To reduce this dimension and find the closest centroid, I'm doing:
        embeddings = self.pipeline.get_embeddings(audio,segmentations)
        for (segment, _), segment_embedding in zip(splits, embeddings):
            min_distance_idx = np.argmin(
                [
                    np.min(
                        cdist(
                            segment_embedding,
                            center[np.newaxis, :],
                            metric="cosine",
                        )
                    )
                    for center in self.speaker_to_centroids.values()
                ]
            )
            speaker = list(self.speaker_to_centroids.keys())[min_distance_idx]

Not sure if this works as intended, especially since the segmentations aren't yet showing useful start/end times

### Minimal reproduction example (MRE)

see above
bschreck commented 3 months ago

Okay I dug through the code and see that the actual start/ends are created later in to_diarization or to_annotatin.

However, trying to diarize the new audio file this way using existing clusters (with the same speaker- me) results in totally different (and very bad) annotations compared to just running the pretrained pipeline on the file directly. Running by itself produces this set of segments:

        DiarizationSegment(
            speaker="SPEAKER_04", start=1.1370997453310672, end=2.461378183361628
        ),
        DiarizationSegment(
            speaker="SPEAKER_00", start=4.193126910016975, end=5.466471561969438
        ),
        DiarizationSegment(
            speaker="SPEAKER_04", start=5.755096349745333, end=6.4172355687606135
        ),
        DiarizationSegment(
            speaker="SPEAKER_00", start=8.182940152801354, end=10.271225382003397
        ),
        DiarizationSegment(
            speaker="SPEAKER_04", start=11.35781281833616, end=12.953738115449912
        ),
        DiarizationSegment(
            speaker="SPEAKER_04", start=13.344230475382002, end=14.51570755517827
        ),

While doing the method I described with existing clusters gives me:


[DiarizationSegment(speaker='SPEAKER_00', start=5.00909375, end=5.75159375), DiarizationSegment(speaker='SPEAKER_01', start=5.75159375, end=6.443468750000001), DiarizationSegment(speaker='SPEAKER_00', start=6.443468750000001, end=6.59534375)]```

This is totally different