nyrahealth / CrisperWhisper

Verbatim Automatic Speech Recognition with improved word-level timestamps and filler detection
Other
372 stars 16 forks source link

IndexError in Whisper model: Index out of bounds during token timestamp extraction #12

Open GrahLnn opened 2 days ago

GrahLnn commented 2 days ago

I tried to transcribe an hour-long audio, but I got this error. I had good results with a two-minute task attempt, so I wanted to try the long audio. Is there any way to fix it? Thank you.

def transcribe_audio(file_path):
    device = "cuda:0" if torch.cuda.is_available() else "cpu"
    print(f"{device=}")
    torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

    model_id = "nyrahealth/CrisperWhisper"

    model = AutoModelForSpeechSeq2Seq.from_pretrained(
        model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
    )
    model.to(device)

    processor = AutoProcessor.from_pretrained(model_id)

    pipe = pipeline(
        "automatic-speech-recognition",
        model=model,
        tokenizer=processor.tokenizer,
        feature_extractor=processor.feature_extractor,
        chunk_length_s=30,
        stride_length_s=4,
        batch_size=1,
        return_timestamps="word",
        torch_dtype=torch_dtype,
        device=device,
    )

    result = pipe(file_path)
    return result

and error

Traceback (most recent call last):
  File "C:\Users\grahlnn\test\CrisperWhisper.py", line 71, in <module>
    res = transcribe_audio(
          ^^^^^^^^^^^^^^^^^
  File "C:\Users\grahlnn\test\CrisperWhisper.py", line 66, in transcribe_audio
    result = pipe(file_path)
             ^^^^^^^^^^^^^^^
  File "C:\Users\grahlnn\test\.venv\Lib\site-packages\transformers\pipelines\automatic_speech_recognition.py", line 283, in __call__
    return super().__call__(inputs, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\grahlnn\test\.venv\Lib\site-packages\transformers\pipelines\base.py", line 1294, in __call__
    return next(
           ^^^^^
  File "C:\Users\grahlnn\test\.venv\Lib\site-packages\transformers\pipelines\pt_utils.py", line 124, in __next__
    item = next(self.iterator)
           ^^^^^^^^^^^^^^^^^^^
  File "C:\Users\grahlnn\test\.venv\Lib\site-packages\transformers\pipelines\pt_utils.py", line 269, in __next__
    processed = self.infer(next(self.iterator), **self.params)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\grahlnn\test\.venv\Lib\site-packages\transformers\pipelines\base.py", line 1209, in forward
    model_outputs = self._forward(model_inputs, **forward_params)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\grahlnn\test\.venv\Lib\site-packages\transformers\pipelines\automatic_speech_recognition.py", line 515, in _forward
    tokens = self.model.generate(
             ^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\grahlnn\test\.venv\Lib\site-packages\transformers\models\whisper\generation_whisper.py", line 684, in generate
    ) = self.generate_with_fallback(
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\grahlnn\test\.venv\Lib\site-packages\transformers\models\whisper\generation_whisper.py", line 862, in generate_with_fallback
    seek_sequences, seek_outputs = self._postprocess_outputs(
                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\grahlnn\test\.venv\Lib\site-packages\transformers\models\whisper\generation_whisper.py", line 963, in _postprocess_outputs
    seek_outputs["token_timestamps"] = self._extract_token_timestamps(
                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\grahlnn\test\.venv\Lib\site-packages\transformers\models\whisper\generation_whisper.py", line 221, in _extract_token_timestamps
    [
  File "C:\Users\grahlnn\test\.venv\Lib\site-packages\transformers\models\whisper\generation_whisper.py", line 222, in <listcomp>
    torch.index_select(weights[:, :, i, :], dim=0, index=beam_indices[:, i])
                       ~~~~~~~^^^^^^^^^^^^
IndexError: index 447 is out of bounds for dimension 2 with size 447
LaurinmyReha commented 2 days ago

Hey,

the longform logic is something we will work on next since the transformers implementation is not ideal for our model.

However, hopefully for a quick fix you can try to install our custom fork and see if this fixes your problem: pip install git+https://github.com/nyrahealth/transformers.git@crisper_whisper

If this does not do it let me know and we look into it further together.

Best,

Laurin

GrahLnn commented 1 day ago

Thank you for your help, now there is a new error.

Traceback (most recent call last):
  File "C:\Users\grahl\criwhisper\test.py", line 76, in <module>
    res = transcribe_audio(
          ^^^^^^^^^^^^^^^^^
  File "C:\Users\grahl\criwhisper\test.py", line 71, in transcribe_audio
    result = pipe(file_path)
             ^^^^^^^^^^^^^^^
  File "C:\Users\grahl\criwhisper\.venv\Lib\site-packages\transformers\pipelines\automatic_speech_recognition.py", line 292, in __call__
    return super().__call__(inputs, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\grahl\criwhisper\.venv\Lib\site-packages\transformers\pipelines\base.py", line 1154, in __call__
    return next(
           ^^^^^
  File "C:\Users\grahl\criwhisper\.venv\Lib\site-packages\transformers\pipelines\pt_utils.py", line 124, in __next__
    item = next(self.iterator)
           ^^^^^^^^^^^^^^^^^^^
  File "C:\Users\grahl\criwhisper\.venv\Lib\site-packages\transformers\pipelines\pt_utils.py", line 266, in __next__
    processed = self.infer(next(self.iterator), **self.params)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\grahl\criwhisper\.venv\Lib\site-packages\transformers\pipelines\base.py", line 1068, in forward
    model_outputs = self._forward(model_inputs, **forward_params)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\grahl\criwhisper\.venv\Lib\site-packages\transformers\pipelines\automatic_speech_recognition.py", line 507, in _forward
    tokens = self.model.generate(
             ^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\grahl\criwhisper\.venv\Lib\site-packages\transformers\models\whisper\generation_whisper.py", line 624, in generate
    outputs["token_timestamps"] = self._extract_token_timestamps(
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\grahl\criwhisper\.venv\Lib\site-packages\transformers\models\whisper\generation_whisper.py", line 316, in _extract_token_timestamps
    timestamps[batch_idx, 1:] = torch.tensor(jump_times)
    ~~~~~~~~~~^^^^^^^^^^^^^^^
RuntimeError: The expanded size of the tensor (4) must match the existing size (5) at non-singleton dimension 0.  Target sizes: [4].  Tensor sizes: [5]