linto-ai / whisper-timestamped

Multilingual Automatic Speech Recognition with word-level timestamps and confidence
GNU Affero General Public License v3.0
2.01k stars 156 forks source link

Sentences that definitely should be getting picked up are not #118

Closed Thomasssb1 closed 1 year ago

Thomasssb1 commented 1 year ago

When running whisper_timestamped on command line, I pass a mp3 file around 25 seconds and also give the accurate text that is spoken in the mp3 file through --initial_prompt.
Yet, the output is either nothing or only the last sentence in the mp3 file.
I also tried one time to give only the first sentence and then it correctly got the sentences after but not the one I gave (?).
I have no idea what is going on..

Jeronymous commented 1 year ago

--initial_prompt is not designed to give the expected transcription of the file, but rather the content that comes before (what was said before the extract) or some text that gives the style in which we want to transcribe (e.g. no punctuation, disfluencies, ...).

If you give what is said in the audio, it's not surprising that the model is "puzzled" and estimates that he has to transcribe what comes after that prompt.

The feature you might want to request here is to align a given transcription to the audio signal. As Whisper is not the most efficient model to do that, this feature was not implemented in this repository. It could, but it's not a priority given that wav2vec model are more convenient and efficient to do alignment between transcription and audio signal.