Open kimdwkimdw opened 3 months ago
This sounds like an interesting idea to investigate.
Do you have experiments backing the claim that this new metric is a good indicator of the actual performance of a diarization system? I would love to know more...
I've expanded my testing to include VoxConverse_test (0.3) and three internal datasets, applying the "pyannote/speaker-diarization-3.1" model with and without the embedding_exclude_overlap
feature. Interestingly, while the SND (Speaker Number Difference) metric shows minimal variation on VoxConverse_test when embedding_exclude_overlap
is toggled, significant differences emerge across the internal datasets under the embedding_exclude_overlap=True
condition. I plan to test further on widely recognized datasets such as AISHELL, AMI, and DIHARD and will share those findings soon.
Regarding the metric itself, I initially adapted a form of Mean Squared Error (MSE) for simplicity. The formulas are as follows:
For Speaker Number Difference, we have two versions:
I will bring measured DER, JER, and SND soon.
Benchmark (DER) | pyannote v2.1 | pyannote v3.1 | pyannote Premium | pyannote v3.1 (evaluation by me )(DER / JER / SND_1 / SND_2) |
---|---|---|---|---|
AISHELL-4 | 14.1 | 12.2 | 11.9 | 9.35 / 14.06 / 31.40 / 70.60 |
AliMeeting (channel 1) | 27.4 | 24.4 | 22.5 | 15.42 / 22.62 / 49.17 / 140.00 |
AMI (IHM) | 18.9 | 18.8 | 16.6 | 11.00 / 15.63 / 52.60 / 156.77 |
AMI (SDM) | 27.1 | 22.4 | 20.9 | 14.79 / 20.65 / 54.94 / 165.88 |
AVA-AVD | 66.3 | 50 | 39.8 | 47.77 / 69.23 / 35.89 / 158.47 |
CALLHOME (part 2) | 31.6 | 28.4 | 22.2 | N/A |
DIHARD 3 (full) | 26.9 | 21.7 | 17.2 | N/A |
Earnings21 | 17 | 9.4 | 9 | 9.67 / 14.85 / 10.83 / 22.20 |
Ego4D (dev.) | 61.5 | 51.2 | 43.8 | N/A |
MSDWild | 32.8 | 25.3 | 19.8 | 27.04 / 57.42 / 32.79 / 72.40 |
RAMC | 22.5 | 22.2 | 18.4 | 22.09 / 24.82 / 44.16 / 90.31 |
REPERE (phase2) | 8.2 | 7.8 | 7.6 | N/A |
VoxConverse (v0.3) | 11.2 | 11.3 | 9.4 | 9.85 / 32.86 / 20.92 / 58.22 |
thanks to @upskyy (my co-worker), I've added column pyannote v3.1 (by Ours )(DER / JER / SND_1 / SND_2)
Upon analyzing the metrics, we noticed that while the DER for AVA-AVD was unexpectedly high, indicating a need for further examination, other datasets like AISHELL-4, AliMeeting, DIHARD 3, Earnings21, and MSDWild showed promising results. When DER/JER is high, SND is high.
@hbredin
Sorry for the delay in coming back to this.
Can you please explain the difference between v3.1 and v3.1 (ours)? I am not sure how this relates to ths inital idea.
I thought the point was to show that the new proposed metric correlated with the legacy DER metric. But those numbers do not really show that. We would need to add SND for v3.1 (original) as well... Or did I miss something?
Sorry for any confusion in my previous messages.
To clarify, my intention was to demonstrate that the DER metric is correlated with the newly proposed SND metric when using the original v3.1 version of pyannote. The numbers listed under "pyannote v3.1 (by Ours)" were intended to show the evaluation results of both DER and SND across all datasets, to ensure reproducibility and comprehensive understanding.
For clarify, I modify from by Ours
to evaluation by me
And I fixed error in AVA-AVD
, those numbers now are similar to original metrics.
I am sorry but there is still something that I don’t understand. Why don’t 3.1 and 3.1 (yours) reach the same DER values?
I'm not entirely sure why there is a discrepancy between the DER values of 3.1 and 3.1 (by me). However, I used the latest pyannote library and recently organized datasets such as AISHELL-4 and AVA-AVD, which I downloaded from the official repository.
While the DER trends between 3.1 and 3.1 (by me) seem generally similar, I too am curious about the reasons for these differences. It appears there might be subtle variations between the two, but since the overall trends in DER are similar, it suggested that measuring SND and DER together would facilitate clearer communication, hence my decision to share all metrics.
Here are the versions of the libraries in my environment:
torch 2.3.0
torch-audiomentations 0.11.1
torch-pitch-shift 1.2.4
torchaudio 2.3.0
torchmetrics 1.3.2
pyannote.audio 3.1.1
pyannote.core 5.0.0
pyannote.database 5.1.0
pyannote.metrics 3.2
pyannote.pipeline 3.0.1
The output I computed are available here: https://huggingface.co/pyannote/speaker-diarization-3.1/tree/main/reproducible_research
Can you please compare with yours?
With reproducible_research, I found that I found some errors in above table. Above table was evaluated with "skip_overlap=True". I am sorry for confusing you.
Below table is reproduced again.
Benchmark (DER) | pyannote v2.1 | pyannote v3.1 | pyannote Premium | pyannote v3.1 (reproduced) (DER / JER / SND_1 / SND_2) |
---|---|---|---|---|
AISHELL-4 | 14.1 | 12.2 | 11.9 | 12.1 / 17.6 / 31.4 / 70.6 |
AliMeeting (channel 1) | 27.4 | 24.4 | 22.5 | 24.2 / 28.8 / 54.6 / 160.4 |
AMI (IHM) | 18.9 | 18.8 | 16.6 | 18.8 / 23.8 / 51.0 / 152.1 |
AMI (SDM) | 27.1 | 22.4 | 20.9 | 22.7 / 27.9 / 65.6 / 221.9 |
AVA-AVD | 66.3 | 50 | 39.8 | 49.6 / 69.9 / 34.8 / 155.8 |
CALLHOME (part 2) | 31.6 | 28.4 | 22.2 | N/A |
DIHARD 3 (full) | 26.9 | 21.7 | 17.2 | N/A |
Earnings21 | 17 | 9.4 | 9 | 10.0 / 16.1 / 8.3 / 17.9 |
Ego4D (dev.) | 61.5 | 51.2 | 43.8 | N/A |
MSDWild | 32.8 | 25.3 | 19.8 | N/A |
RAMC | 22.5 | 22.2 | 18.4 | 22.2 / 24.9 / 45.2 / 92.5 |
REPERE (phase2) | 8.2 | 7.8 | 7.6 | N/A |
VoxConverse (v0.3) | 11.2 | 11.3 | 9.4 | 11.2 / 34.8 / 20.7 / 56.4 |
By the way, AVA-AVD's rttm has different number of segment. I have downloaded AVA-AVD recently from https://github.com/zcxu-eric/AVA-AVD/tree/main/dataset using script. For example, record id 1j20qq1JyX4_c_01
has 98 lines of segments in official repository, but 1j20qq1JyX4_c_01
in rttm from reproducible_research has only 81 lines.
So, except AVA-AVD, finally I think I'm on the same page with you.
That's reassuring, thanks :)
I have created scatter plots with your numbers to get a better (visual) idea. They do seem quite correlated indeed, which is nice.
That being said, my understanding (tell me if I am wrong) is that you would like to use SND_x
as a (cheaper than DER) way to tune hyper-parameters.
So, instead of scatter plots accross datasets, one should rather do those plots accross systems (e.g. pyannote 2.1 vs. pyannote 3.1 vs. speechbrain's vs. NeMo) and see if the correlation still holds.
Also, I suspect SND_x metrics are correlated to speaker confusion error rates rather than more global DER. Would it be possible for you to draw those plots as a function of confusion instead of DER?
This seems promising indeed.
That being said, my understanding (tell me if I am wrong) is that you would like to use SND_x as a (cheaper than DER) way to tune hyper-parameters.
Yes, that's correct. :)
Although I haven't worked with SpeechBrain, I've tested all metrics using my own model
and NeMo
.
These metrics proved useful for developing models and monitoring their performance.
They were particularly valuable when assessing model robustness across different domains beyond popular datasets.
Dataset: | aishell_4 | alimeeting | AMI (IHM) | AMI (SDM) | AVA-AVD | Earnings21 | RAMC | VoxConverse (v0.3) |
---|---|---|---|---|---|---|---|---|
diarization error rate % | 12.11 | 24.21 | 18.81 | 22.64 | 49.59 | 10.02 | 22.19 | 11.19 |
correct % | 91.8 | 80.16 | 84.76 | 81.16 | 60.1 | 92.29 | 87.03 | 92.89 |
false alarm % | 3.92 | 4.37 | 3.57 | 3.8 | 9.69 | 2.31 | 9.22 | 4.08 |
missed detection % | 3.9 | 10.17 | 9.53 | 11.18 | 16.9 | 2.91 | 5.72 | 3.39 |
confusion % | 4.29 | 9.67 | 5.71 | 7.67 | 23.01 | 4.8 | 7.25 | 3.73 |
JER | 17.6 | 28.8 | 23.8 | 27.9 | 69.9 | 16.1 | 24.9 | 34.8 |
SND_1 | 31.4 | 54.6 | 51 | 65.6 | 34.8 | 8.3 | 45.2 | 20.7 |
SND_2 | 70.6 | 160.4 | 152.1 | 221.9 | 155.8 | 17.9 | 92.5 | 56.4 |
Also, I suspect SND_x metrics are correlated to speaker confusion error rates rather than more global DER. Would it be possible for you to draw those plots as a function of confusion instead of DER?
I attached "false alarm" / "missed detection" / "confusion" for scatter plots.
Thanks. You said you attached false alarm and missed detection plots as well but I don’t see them.
Sorry, I added missing two images. 😹
Thanks. So, as expected, SND correlates with speaker confusion, but not really with false alarm or missed detection, which is kind of expected.
Overall, it means that it can be used for tuning clustering hyperparameters but not so much to train segmentation.
Description
Hello @hbredin and the pyannote community,
I hope this is the appropriate forum for this discussion; I approach with a bit of caution but much enthusiasm.
I've been extensively using pyannote, including pyannote-core, and have even contributed PRs focused on enhancing performance. Today, I'd like to share an idea for a new metric that I've been using in practice, hoping to spark some discussion and gather your thoughts.
While DER and JER are incredibly valuable metrics, they assume the presence of labeled data for speaker diarization. My recent work involves handling large volumes of unlabeled audio, prompting me to think about how we can leverage this data more effectively for speaker diarization.
An insight struck me: by simply using the "number of speakers" present in an audio as a simple label and comparing this reference to the number of speakers predicted through clustering, we can arrive at a fairly robust metric(like MSE with number of speaker). This metric could serve as a valuable tool, especially in scenarios where detailed labeling is unfeasible or unavailable.
I'm considering crafting a pull request to introduce this concept into pyannote and would love to hear your thoughts, feedback, and any insights you might have on this ideation. Would this be of interest to the community? How might we refine or expand upon this idea to make it even more useful?