linto-ai / whisper-timestamped

Multilingual Automatic Speech Recognition with word-level timestamps and confidence
GNU Affero General Public License v3.0
2.01k stars 156 forks source link

Fatal Error: Got inconsistent text for segment 10 #47

Closed maptz closed 1 year ago

maptz commented 1 year ago

Hi,

I was trying out this for the first time and run into the following error when using CUDA:

ot start time outside of audio boundary
Got start time outside of audio boundary
Got start time outside of audio boundary
Got start time outside of audio boundary
Got start time outside of audio boundary
Got start time outside of audio boundary
Got start time outside of audio boundary
Got start time outside of audio boundary
Got start time outside of audio boundary
Got start time outside of audio boundary
Got start time outside of audio boundary
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 36000/36000 [00:27<00:00, 1287.63frames/s]
Traceback (most recent call last):
  File "E:\test.py", line 5, in <module>
    result = whisper.transcribe(model, audio)
  File "C:\Users\Stephen\AppData\Local\Programs\Python\Python310\lib\site-packages\whisper_timestamped\transcribe.py", line 226, in transcribe_timestamped
    (transcription, words) = _transcribe_timestamped_efficient(model, audio, trust_whisper_timestamps=trust_whisper_timestamps, **alignment_options, **whisper_options, **other_options)
  File "C:\Users\Stephen\AppData\Local\Programs\Python\Python310\lib\site-packages\whisper_timestamped\transcribe.py", line 797, in _transcribe_timestamped_efficient
    assert len(timestamped_tokens) < len(whisper_tokens) and timestamped_tokens == whisper_tokens[:len(timestamped_tokens)], \
AssertionError: Fatal Error: Got inconsistent text for segment 10:

My file is

import whisper_timestamped as whisper
audio = whisper.load_audio("o.mp3")

model = whisper.load_model("tiny", device="cuda") 
result = whisper.transcribe(model, audio)

This works correctly, when the device is specified as "cpu"

Any hints as to how I can solve this?

Jeronymous commented 1 year ago

Oh dear, that was an unfortunate bad first experience. Can you by chance the mp3 audio that's causing this failure?

Jeronymous commented 1 year ago

Actually, it could be a problem that came up with the last version of whisper. I'm investigating such a problem with whisper version 20230306 What gives for you:

pip freeze | grep whisper
maptz commented 1 year ago

Hi @Jeronymous ,

I'm not able to upload the audio file I'm afraid (licensing issues), but I'll try to see if I can find another file that duplicates the issue when using the CUDA device.

FYI, I'm on Windows, so the command above doesn't work verbatim, but I think the information you're after is:

openai-whisper @ git+https://github.com/openai/whisper.git@b80bcf610d89960bc658b61af9c333fc6d978d78
whisper-timestamped==1.10.1

BTW, I though this was restricted to CPU processing, but trying it with another audio file, it seems that the issue occurs with some files in cpu mode too.

Is there a version of whisper that you think works correctly?

Jeronymous commented 1 year ago

Indeed it is a matter of using version 2023036. You can update whisper-timestamped, it should be fixed now. And I recommend for now to use for now

openai-whisper==20230124

(because that version of whisper is super recent and might still include bugs. see https://github.com/openai/whisper/discussions/1058 and I see that whisper repo is moving a lot now)

maptz commented 1 year ago

Thanks.