m-bain / whisperX

WhisperX: Automatic Speech Recognition with Word-level Timestamps (& Diarization)
BSD 2-Clause "Simplified" License
11.39k stars 1.2k forks source link

Hallucination causes failure to align - uncleaned input in whisper dataset #230

Open werkamsus opened 1 year ago

werkamsus commented 1 year ago

Hey all!

I'm transcribing a 90min long file in German, and whisper hallucinates the following patterns: "Untertitel der Amara.org-Community" "Untertitel im Auftrag des ZDF für funk, 2017"

Which causes the following error: Failed to align segment (" Untertitel der Amara.org-Community"): backtrack failed, resorting to original...

Here's a Github issue on whisper that identifies this + more patterns

Any idea of how to fix this? Tried a couple initial prompts, researched token suppression but found no fix so far.

Would be awesome to make the alignment more robust to just skip segments it cannot align.

Thanks for putting this awesome library together 🙏

Best,

Nick

m-bain commented 1 year ago

Yes seems this is a common problem for non-english.

You could try prompting the model with: "Untertitel der Amara.org-Community" for every segment in the batch.

Would be awesome to make the alignment more robust to just skip segments it cannot align.

Failed alignment doesn't always mean hallucination so I keep them in the transcript. You could remove them by removing segments that have an empty "words": [] entry.