m-bain / whisperX

WhisperX: Automatic Speech Recognition with Word-level Timestamps (& Diarization)
BSD 4-Clause "Original" or "Old" License
10k stars 1.04k forks source link

Speaker recognition in diarization #323

Open mirix opened 1 year ago

mirix commented 1 year ago

I have started to play with whisperX for diarization and it looks very promising.

Except for one little issue: labelling the speakers.

Are there any parameters that one could tune in order to improve speaker recognition?

All the rest, both transcription and alignment, is almost flawless so far.

I have downloaded three videos and converted them to 16kHz wav files, then:

whisperx Michael_Jim_Dwight_epic_scene_qHrN5Mf5sgo.wav --hf_token "hf_MY_TOKEN" --diarize --language en --compute_type 'float32' --min_speakers 2 --max_speakers 3 --align_model "jonatasgrosman/wav2vec2-large-xlsr-53-english"

Here it works very well:

https://www.youtube.com/watch?v=Fyb2AiF1feI

But here it is a disaster:

https://www.youtube.com/watch?v=DxxAwDHgQhE

https://www.youtube.com/watch?v=qHrN5Mf5sgo

In the first one there are only two characters, a man and a woman, and therefore it seems like an easy scenario.

The second one has three characters with very different pitches and accents.

The third one has three males with similar pitches and accents.

mirix commented 1 year ago

I have made some (modest) progress on this, if anyone wishes to have a look:

https://github.com/mirix/approaches-to-diarisation/tree/main

mirix commented 1 year ago

I have updated the update repo to provide samples that enable one to compare the standard WhisperX procedure with mine.

mirix commented 1 year ago

My current approach:

https://github.com/mirix/approaches-to-diarisation

jun297 commented 12 months ago

Did you provide min_speaker and max_speaker for the diarization?

mirix commented 11 months ago

I did, but the idea is precisely not to have to do it. With my new approach it should not be necessary.