huggingface / speechbox

Apache License 2.0
342 stars 33 forks source link

ASP Diarization Performance #21

Open venuv opened 1 year ago

venuv commented 1 year ago

This diarization doesn't compare favorably with Whisper. Wondering if I'm missing something in call parameters or other

On this two minute video - https://www.youtube.com/watch?v=xbyEs7DJshw&ab_channel=HipronarySchool%23Callcenter , speechbox segments as 14 speaker transitions as below (vs something between 37 and 45). Code used is pretty trivial but attached as context

-- speaker text \ 0 SPEAKER_00 Hello, can you take a picture for a spot in h...
1 SPEAKER_01 How don't I have any?
2 SPEAKER_00 Yes, it's ATK 0804949. Okay, just let me veri...
3 SPEAKER_01 You did? For the ninth time, the only is not ...
4 SPEAKER_00 Okay, sir. I totally understand your situatio...
5 SPEAKER_01 Okay, yeah, yeah, we usually then our brother...
6 SPEAKER_00 Could you take a look design the boss and ver...
7 SPEAKER_01 But they're not a cable in the dogs. Okay, le...
8 SPEAKER_00 Sores are my mistake current in system, the e...
9 SPEAKER_01 Okay, but if you have a lot of trouble going ...
10 SPEAKER_00 If you want anything already wrong from us, w...
11 SPEAKER_01 No, I don't need different deals, guys. Thank...
12 SPEAKER_00 For doing your doctor, I will just cut your f...
13 SPEAKER_01 Yeah, yeah, I'll write right now.
14 SPEAKER_00 How night is?

      timestamp  

0 (0.0, 13.2)
1 (13.2, 14.7)
2 (14.7, 53.0)
3 (53.0, 57.0)
4 (57.0, 66.8)
5 (66.8, 71.16)
6 (72.0, 76.8)
7 (77.28, 80.72)
8 (82.56, 101.76)
9 (101.76, 107.0)
10 (107.0, 111.0)
11 (111.0, 114.0)
12 (114.0, 117.0)
13 (117.0, 119.0)
sharpenspeechbrain.py.zip

14 (119.0, 120.0)

venuv commented 1 year ago

Was comparing with the performance of https://huggingface.co/spaces/vumichien/whisper-speaker-diarization on the same audio .wav extracted from YT

patrickvonplaten commented 1 year ago

cc @sanchit-gandhi here

sanchit-gandhi commented 1 year ago

Hey @venuv - the Pyannote Speaker Diarization model is trained on the AMI dataset, which is probably out-of-domain with your call centre recordings.

The ECAPA-TDNN model from SpeechBrain is probably more in-domain. We'd likely get the best performance here using NVIDIA NeMo's TitaNet-L model, which is trained on a composite dataset that includes 3.6k hours of call-centre recordings (Switchboard + Fisher): https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/titanet_large

Pyannote seems to generalise quite poorly to OOD data. NVIDIA's model is trained on a more diverse corpus and thus handles more downstream ASR + Diarization scenarios.