m-bain / whisperX

WhisperX: Automatic Speech Recognition with Word-level Timestamps (& Diarization)
BSD 2-Clause "Simplified" License
11.96k stars 1.26k forks source link

Train/Recognize speaker? #371

Open sam1am opened 1 year ago

sam1am commented 1 year ago

Given previously recorded and recognized speaker embeddings used for diarization, it seems like it would be possible to match any new voice to a previously recorded database of known voices with associated speakers/users/names/ids. Is there a way to do this currently?

Arche151 commented 1 year ago

This would be amazing!

I tried using WhisperX for transcribing a podcast with two speakers but the diarization was really bad. I also thought: Hey, since I and I want to use WhisperX mainly to transcribe the same podcast with the same speakers, it would be great to be able to "teach" WhisperX on these specific speakers and how to identify them instead of having to do it from scratch for every single transcription.

caryknoop commented 1 year ago

Pyannote-audio 3.0x with the embeddings functionality should be able to do what you want. Make a voiceprint database, take the embedding, and find the best "cosine" match.

Short of waiting for WhisperX 4.0 is anyone taking a stab at passing on the embeddings in WhisperX so that we can start the testing in version 3?