nyrahealth / CrisperWhisper

Verbatim Automatic Speech Recognition with improved word-level timestamps and filler detection
Other
260 stars 11 forks source link

IndexError: list index out of range #4

Closed TechInterMezzo closed 2 months ago

TechInterMezzo commented 2 months ago

When I try to use the model with your python code for inference I get the following error:

  File "G:\Python\CrisperWhisper\.venv\lib\site-packages\transformers\pipelines\automatic_speech_recognition.py", line 284, in __call__
    return super().__call__(inputs, **kwargs)
  File "G:\Python\CrisperWhisper\.venv\lib\site-packages\transformers\pipelines\base.py", line 1246, in __call__
    return next(
  File "G:\Python\CrisperWhisper\.venv\lib\site-packages\transformers\pipelines\pt_utils.py", line 125, in __next__
    processed = self.infer(item, **self.params)
  File "G:\Python\CrisperWhisper\.venv\lib\site-packages\transformers\pipelines\automatic_speech_recognition.py", line 587, in postprocess
    text, optional = self.tokenizer._decode_asr(
  File "G:\Python\CrisperWhisper\.venv\lib\site-packages\transformers\models\whisper\tokenization_whisper_fast.py", line 562, in _decode_asr
    return _decode_asr(
  File "G:\Python\CrisperWhisper\.venv\lib\site-packages\transformers\models\whisper\tokenization_whisper.py", line 1052, in _decode_asr
    start_time = round(token_timestamps[i] + time_offset, 2)
IndexError: list index out of range

That was under Windows 10 but I also get the same error under Ubuntu 24.04. Since you didn't specify the versions for the dependencies, maybe there was an update that broke the tokenization for your model. Can you please give me the dependency verasion that work for you?

TechInterMezzo commented 2 months ago

The code throwing the error is only executed when return_timestamps='word' is set. So without word timestamps it works for me. Of course this doesn't explain the bug.

LaurinmyReha commented 2 months ago

If you can run it with another return_timestamps configuration i think your setup is fine. I have sometimes seen this when using a different language tag then the one of the speaker in the audio sample you are trying to transcribe. Please also try it when installing our custom transformers fork that improves some aspects of the DTW alignment running in the background. You can install it with:

pip install git+https://github.com/nyrahealth/transformers.git@crisper_whisper

Also the model is only finetuned for English and German. So Languages outside of these two will most likely not work well. Let me know if this solves your problem.

TechInterMezzo commented 2 months ago

I am using German speech recordings so that is no issue. It now works with your fork. Maybe it would also have worked with an older transformers version but I didn't test that yet. Do your improvements on DTW also involve reduced GPU memory usage? Besides the word timestamps not working with the original transformers package I had to choose a smaller batch size than with your fork.

And one side question: Is your model only trained for word level timestamps? When I try to use the general timestamps for segments I get "Whisper did not predict an ending timestamp, which can happen if audio is cut off in the middle of a word. Also make sure WhisperTimeStampLogitsProcessor was used during generation." and the return intervals are (0.00, None).

LaurinmyReha commented 2 months ago

I am glad this resolved your issue. I do not believe any of the DTW modifications to have a impact on the GPU memory footprint of our model.

To your question: The Whisper model is originally trained to predict ,,segment timestamps". When finetuning we do not predict these segment timestamps anymore so the model will loose its ability to predict these and will stop doing so consistently. I actually should add these timestamps to the suppress token ids in the generation_config.json on the Huggingface repo since we do not need them anymore. So therefore the ,,segment timestamps" are not predicted anymore after finetuning. However you should also not need them anymore since the word level timestamps should be more precise anyway.

These Segment timestamps are however used in the original longform transcription algortihm of whisper but i transformers is using a longform algorithm that resembles the CTC longform logic described here: https://huggingface.co/blog/asr-chunking

Therefore you can ignore this warning. I will check out the longform transcription algorithm used in transformers in detail soon. I think having this improved timestamps should enable some improvements there aswell.

Side note: Another way to get word level timestamps would actually be to predict these timestamp tokens before and after each word. By doing this one would not necessarily need the retokenization proposed in our paper. The researches on the Qwen Team https://arxiv.org/pdf/2311.07919 even found this to improve transcription performance. However, this would slow down inference a bit.

Let me know if you need further clarification.