pyannote / pyannote-metrics

A toolkit for reproducible evaluation, diagnostic, and error analysis of speaker diarization systems
http://pyannote.github.io/pyannote-metrics
MIT License
181 stars 30 forks source link

Ideation: Exploring New Metrics for Speaker Diarization in Unlabeled Audio Contexts #68

Open kimdwkimdw opened 3 months ago

kimdwkimdw commented 3 months ago

Description

Hello @hbredin and the pyannote community,

I hope this is the appropriate forum for this discussion; I approach with a bit of caution but much enthusiasm.

I've been extensively using pyannote, including pyannote-core, and have even contributed PRs focused on enhancing performance. Today, I'd like to share an idea for a new metric that I've been using in practice, hoping to spark some discussion and gather your thoughts.

While DER and JER are incredibly valuable metrics, they assume the presence of labeled data for speaker diarization. My recent work involves handling large volumes of unlabeled audio, prompting me to think about how we can leverage this data more effectively for speaker diarization.

An insight struck me: by simply using the "number of speakers" present in an audio as a simple label and comparing this reference to the number of speakers predicted through clustering, we can arrive at a fairly robust metric(like MSE with number of speaker). This metric could serve as a valuable tool, especially in scenarios where detailed labeling is unfeasible or unavailable.

I'm considering crafting a pull request to introduce this concept into pyannote and would love to hear your thoughts, feedback, and any insights you might have on this ideation. Would this be of interest to the community? How might we refine or expand upon this idea to make it even more useful?

hbredin commented 3 months ago

This sounds like an interesting idea to investigate.

Do you have experiments backing the claim that this new metric is a good indicator of the actual performance of a diarization system? I would love to know more...

kimdwkimdw commented 3 months ago

I've expanded my testing to include VoxConverse_test (0.3) and three internal datasets, applying the "pyannote/speaker-diarization-3.1" model with and without the embedding_exclude_overlap feature. Interestingly, while the SND (Speaker Number Difference) metric shows minimal variation on VoxConverse_test when embedding_exclude_overlap is toggled, significant differences emerge across the internal datasets under the embedding_exclude_overlap=True condition. I plan to test further on widely recognized datasets such as AISHELL, AMI, and DIHARD and will share those findings soon.

Regarding the metric itself, I initially adapted a form of Mean Squared Error (MSE) for simplicity. The formulas are as follows:

For Speaker Number Difference, we have two versions:

I will bring measured DER, JER, and SND soon.

kimdwkimdw commented 3 months ago
Benchmark (DER) pyannote v2.1 pyannote v3.1 pyannote Premium pyannote v3.1 (evaluation by me )(DER / JER / SND_1 / SND_2)
AISHELL-4 14.1 12.2 11.9 9.35 / 14.06 / 31.40 / 70.60
AliMeeting (channel 1) 27.4 24.4 22.5 15.42 / 22.62 / 49.17 / 140.00
AMI (IHM) 18.9 18.8 16.6 11.00 / 15.63 / 52.60 / 156.77
AMI (SDM) 27.1 22.4 20.9 14.79 / 20.65 / 54.94 / 165.88
AVA-AVD 66.3 50 39.8 47.77 / 69.23 / 35.89 / 158.47
CALLHOME (part 2) 31.6 28.4 22.2 N/A
DIHARD 3 (full) 26.9 21.7 17.2 N/A
Earnings21 17 9.4 9 9.67 / 14.85 / 10.83 / 22.20
Ego4D (dev.) 61.5 51.2 43.8 N/A
MSDWild 32.8 25.3 19.8 27.04 / 57.42 / 32.79 / 72.40
RAMC 22.5 22.2 18.4 22.09 / 24.82 / 44.16 / 90.31
REPERE (phase2) 8.2 7.8 7.6 N/A
VoxConverse (v0.3) 11.2 11.3 9.4 9.85 / 32.86 / 20.92 / 58.22

thanks to @upskyy (my co-worker), I've added column pyannote v3.1 (by Ours )(DER / JER / SND_1 / SND_2)

Upon analyzing the metrics, we noticed that while the DER for AVA-AVD was unexpectedly high, indicating a need for further examination, other datasets like AISHELL-4, AliMeeting, DIHARD 3, Earnings21, and MSDWild showed promising results. When DER/JER is high, SND is high.

@hbredin

hbredin commented 2 months ago

Sorry for the delay in coming back to this.

Can you please explain the difference between v3.1 and v3.1 (ours)? I am not sure how this relates to ths inital idea.

I thought the point was to show that the new proposed metric correlated with the legacy DER metric. But those numbers do not really show that. We would need to add SND for v3.1 (original) as well... Or did I miss something?

kimdwkimdw commented 2 months ago

Sorry for any confusion in my previous messages.

To clarify, my intention was to demonstrate that the DER metric is correlated with the newly proposed SND metric when using the original v3.1 version of pyannote. The numbers listed under "pyannote v3.1 (by Ours)" were intended to show the evaluation results of both DER and SND across all datasets, to ensure reproducibility and comprehensive understanding.

For clarify, I modify from by Ours to evaluation by me And I fixed error in AVA-AVD, those numbers now are similar to original metrics.

hbredin commented 2 months ago

I am sorry but there is still something that I don’t understand. Why don’t 3.1 and 3.1 (yours) reach the same DER values?

kimdwkimdw commented 2 months ago

I'm not entirely sure why there is a discrepancy between the DER values of 3.1 and 3.1 (by me). However, I used the latest pyannote library and recently organized datasets such as AISHELL-4 and AVA-AVD, which I downloaded from the official repository.

While the DER trends between 3.1 and 3.1 (by me) seem generally similar, I too am curious about the reasons for these differences. It appears there might be subtle variations between the two, but since the overall trends in DER are similar, it suggested that measuring SND and DER together would facilitate clearer communication, hence my decision to share all metrics.

Here are the versions of the libraries in my environment:

torch                   2.3.0
torch-audiomentations   0.11.1
torch-pitch-shift       1.2.4
torchaudio              2.3.0
torchmetrics            1.3.2

pyannote.audio          3.1.1
pyannote.core           5.0.0
pyannote.database       5.1.0
pyannote.metrics        3.2
pyannote.pipeline       3.0.1
hbredin commented 2 months ago

The output I computed are available here: https://huggingface.co/pyannote/speaker-diarization-3.1/tree/main/reproducible_research

Can you please compare with yours?

kimdwkimdw commented 2 months ago

With reproducible_research, I found that I found some errors in above table. Above table was evaluated with "skip_overlap=True". I am sorry for confusing you.

Below table is reproduced again.

Benchmark (DER) pyannote v2.1 pyannote v3.1 pyannote Premium pyannote v3.1 (reproduced) (DER / JER / SND_1 / SND_2)
AISHELL-4 14.1 12.2 11.9 12.1 / 17.6 / 31.4 / 70.6
AliMeeting (channel 1) 27.4 24.4 22.5 24.2 / 28.8 / 54.6 / 160.4
AMI (IHM) 18.9 18.8 16.6 18.8 / 23.8 / 51.0 / 152.1
AMI (SDM) 27.1 22.4 20.9 22.7 / 27.9 / 65.6 / 221.9
AVA-AVD 66.3 50 39.8 49.6 / 69.9 / 34.8 / 155.8
CALLHOME (part 2) 31.6 28.4 22.2 N/A
DIHARD 3 (full) 26.9 21.7 17.2 N/A
Earnings21 17 9.4 9 10.0 / 16.1 / 8.3 / 17.9
Ego4D (dev.) 61.5 51.2 43.8 N/A
MSDWild 32.8 25.3 19.8 N/A
RAMC 22.5 22.2 18.4 22.2 / 24.9 / 45.2 / 92.5
REPERE (phase2) 8.2 7.8 7.6 N/A
VoxConverse (v0.3) 11.2 11.3 9.4 11.2 / 34.8 / 20.7 / 56.4

By the way, AVA-AVD's rttm has different number of segment. I have downloaded AVA-AVD recently from https://github.com/zcxu-eric/AVA-AVD/tree/main/dataset using script. For example, record id 1j20qq1JyX4_c_01 has 98 lines of segments in official repository, but 1j20qq1JyX4_c_01 in rttm from reproducible_research has only 81 lines.

So, except AVA-AVD, finally I think I'm on the same page with you.

hbredin commented 2 months ago

That's reassuring, thanks :)

I have created scatter plots with your numbers to get a better (visual) idea. They do seem quite correlated indeed, which is nice.

SND1

SND2

That being said, my understanding (tell me if I am wrong) is that you would like to use SND_x as a (cheaper than DER) way to tune hyper-parameters.

So, instead of scatter plots accross datasets, one should rather do those plots accross systems (e.g. pyannote 2.1 vs. pyannote 3.1 vs. speechbrain's vs. NeMo) and see if the correlation still holds.

Also, I suspect SND_x metrics are correlated to speaker confusion error rates rather than more global DER. Would it be possible for you to draw those plots as a function of confusion instead of DER?

This seems promising indeed.

kimdwkimdw commented 2 months ago

That being said, my understanding (tell me if I am wrong) is that you would like to use SND_x as a (cheaper than DER) way to tune hyper-parameters.

Yes, that's correct. :)

Although I haven't worked with SpeechBrain, I've tested all metrics using my own model and NeMo. These metrics proved useful for developing models and monitoring their performance. They were particularly valuable when assessing model robustness across different domains beyond popular datasets.

Dataset: aishell_4 alimeeting AMI (IHM) AMI (SDM) AVA-AVD Earnings21 RAMC VoxConverse (v0.3)
diarization error rate % 12.11 24.21 18.81 22.64 49.59 10.02 22.19 11.19
correct % 91.8 80.16 84.76 81.16 60.1 92.29 87.03 92.89
false alarm % 3.92 4.37 3.57 3.8 9.69 2.31 9.22 4.08
missed detection % 3.9 10.17 9.53 11.18 16.9 2.91 5.72 3.39
confusion % 4.29 9.67 5.71 7.67 23.01 4.8 7.25 3.73
JER 17.6 28.8 23.8 27.9 69.9 16.1 24.9 34.8
SND_1 31.4 54.6 51 65.6 34.8 8.3 45.2 20.7
SND_2 70.6 160.4 152.1 221.9 155.8 17.9 92.5 56.4

Also, I suspect SND_x metrics are correlated to speaker confusion error rates rather than more global DER. Would it be possible for you to draw those plots as a function of confusion instead of DER?

I attached "false alarm" / "missed detection" / "confusion" for scatter plots.

DER confusion % missed detection % false alarm %

hbredin commented 2 months ago

Thanks. You said you attached false alarm and missed detection plots as well but I don’t see them.

kimdwkimdw commented 2 months ago

Sorry, I added missing two images. 😹

hbredin commented 2 months ago

Thanks. So, as expected, SND correlates with speaker confusion, but not really with false alarm or missed detection, which is kind of expected.

Overall, it means that it can be used for tuning clustering hyperparameters but not so much to train segmentation.