Sentences that definitely should be getting picked up are not

linto-ai / whisper-timestamped

Multilingual Automatic Speech Recognition with word-level timestamps and confidence

GNU Affero General Public License v3.0

2.01k stars 156 forks source link

--initial_prompt is not designed to give the expected transcription of the file, but rather the content that comes before (what was said before the extract) or some text that gives the style in which we want to transcribe (e.g. no punctuation, disfluencies, ...).

If you give what is said in the audio, it's not surprising that the model is "puzzled" and estimates that he has to transcribe what comes after that prompt.

The feature you might want to request here is to align a given transcription to the audio signal. As Whisper is not the most efficient model to do that, this feature was not implemented in this repository. It could, but it's not a priority given that wav2vec model are more convenient and efficient to do alignment between transcription and audio signal.

linto-ai / whisper-timestamped

Sentences that definitely should be getting picked up are not #118