Open mirix opened 1 year ago
I have made some (modest) progress on this, if anyone wishes to have a look:
https://github.com/mirix/approaches-to-diarisation/tree/main
I have updated the update repo to provide samples that enable one to compare the standard WhisperX procedure with mine.
My current approach:
Did you provide min_speaker and max_speaker for the diarization?
I did, but the idea is precisely not to have to do it. With my new approach it should not be necessary.
I have started to play with whisperX for diarization and it looks very promising.
Except for one little issue: labelling the speakers.
Are there any parameters that one could tune in order to improve speaker recognition?
All the rest, both transcription and alignment, is almost flawless so far.
I have downloaded three videos and converted them to 16kHz wav files, then:
whisperx Michael_Jim_Dwight_epic_scene_qHrN5Mf5sgo.wav --hf_token "hf_MY_TOKEN" --diarize --language en --compute_type 'float32' --min_speakers 2 --max_speakers 3 --align_model "jonatasgrosman/wav2vec2-large-xlsr-53-english"
Here it works very well:
https://www.youtube.com/watch?v=Fyb2AiF1feI
But here it is a disaster:
https://www.youtube.com/watch?v=DxxAwDHgQhE
https://www.youtube.com/watch?v=qHrN5Mf5sgo
In the first one there are only two characters, a man and a woman, and therefore it seems like an easy scenario.
The second one has three characters with very different pitches and accents.
The third one has three males with similar pitches and accents.