m-bain / whisperX

WhisperX: Automatic Speech Recognition with Word-level Timestamps (& Diarization)
BSD 2-Clause "Simplified" License
10.58k stars 1.13k forks source link

Model loses segment of audio, offsetting ensuing transcriptions #402

Open MyriadRivers opened 1 year ago

MyriadRivers commented 1 year ago

Hello!

I've been using the WhisperX large-v2 model in English on a project to transcribe vocals taken from songs, which I derive using source separation with spleeter. If it matters, I've been running WhisperX in a Google Colab notebook, though I had this problem running the demo on Replicate with my audio samples.

Problem

Working with some audio, I encountered a situation in which WhisperX transcribes some lines with correct timestamps, fails to detect some lines, and then transcribes the rest of the lines. However, a few of the transcribed lines after this gap all have incorrect time stamps, starting right after the previous lines as if the time spanned by the gap did not exist. After a couple lines, the timestamps resync and are correct again.

Steps to Reproduce

vocals.webm is the offending audio, originally a .wav file. The transcription fails from around 0:46 - 0:55. It resumes transcription of lines at 0:56, but instead places the timestamps of those lines starting at 0:46. At around 1:08 the transcription resyncs again.

image

"So can we let sleeping dogs lie" is marked as starting around 0:46 in this transcription from the Replicate demo, but is clearly heard at around ~0:56 of the vocal audio file.

I also tried running the transcription on only the offending segment.webm of audio; this too drops the first lines and screws up the timestamps.

The problem should be replicated by running the large-v2 english model, using default batch size (16) and compute type (16float) to transcribe the attached audio files with word-level timestamps.

Attempts to Resolve

Running OpenAI's original large-v2 english whisper model captures audio from the offending segment. Changing the ASR or VAD options such as raising the no_speech_threshold or lowering the vad_onset and offset have also gotten me no luck.

Has anybody else experienced a similar issue or can anybody shed any light on why this might be happening and what I can possibly do to get around this?

Besides this, WhisperX has been great and easy to use. Thanks for such a wonderful model!

crisprin17 commented 11 months ago

@MyriadRivers Did you ever solved this issue? I am having the same

MyriadRivers commented 11 months ago

Unfortunately, no. It's been a while, but maybe you could do a slight workaround where you run both whisperX and the original whisper on the audio file, match up the detected words a la greatest common subsequence of matched words between the two transcription files, and then compare whisperX's word-level timestamps to whisper's phrase-level timestamps, where you fall back to the original whisper's transcription and phrase-level timestamps if the timestamps given by whisperX are too different or it's missing words? You could possibly then heuristically guess the remaining word-level timestamps by interpolating between the phrase level timestamps by # of syllables in the words and the phrase.

This is basically what I did for my use-case, except instead of using the original whisper, I had a different source that provided me a known transcription of the words but only sentence-level timestamps. Perhaps it could work to a lesser extent using the two whispers though.

haiderasad commented 9 months ago

@m-bain is there fix for this as timestamps are missing the actual silence that occured between dialogues

heilrahc commented 2 months ago

I encountered the same issue, the timestamp alignment started to go off sync after a while