m-bain / whisperX

WhisperX: Automatic Speech Recognition with Word-level Timestamps (& Diarization)
BSD 2-Clause "Simplified" License
12k stars 1.26k forks source link

Is it possible to include word-level confidence? #141

Open HalukMaestra opened 1 year ago

HalukMaestra commented 1 year ago

First of all I would like to thank you for all the great work done with this application. It's a joy to use and way better then the Vanilla whisper. One question I have is, is it possible to include word-level confidence scores inside result_aligned["word_segments"]? Obtaining ["segments"] and parsing it to get to the score is a tedious process for me since I have no need for other data expect the ones in word_segments? I was just wondering if this is viable to do?

cristiantg commented 1 year ago

Yes plesae, it would be nice. In a similar way as in whisper_timestamped.

GBurg commented 1 year ago

ok, I did some digging, and got some insight to share:

faster_whisper (which whisperx uses) could give word alignement probabilities with the whisperx.load_model(model, device, language=lang, asr_options={"word_timestamps": True,}) , however, as whisperx uses it's own alignment it is not relevant.

so there are 2 paths to go:

1) get an 'forced alignment' prediction from wav2vec2.0 (or other alignment method), which is a reasonable way to get a probability score for the transcription 2) dig deeper in faster_whisper and see where and how the real whisper probability scores where determined