linto-ai / whisper-timestamped

Multilingual Automatic Speech Recognition with word-level timestamps and confidence
GNU Affero General Public License v3.0
2.01k stars 156 forks source link

Inconsistent number of segments: whisper_segments (462) != timestamped_word_segments (461) #24

Closed ItakeLs closed 1 year ago

ItakeLs commented 1 year ago

Hello again, I have some more reproducible errors for you. The error "Got start time outside of audio boundary" and "Inconsistent number of segments: whisper_segments (462) != timestamped_word_segments (461)"

This youtube video can reproduce the error: youtube I downloaded the mp4 file of the youtube video from here: youtube downloader

"condition_on_previous_text" is true and the rest of the parameters are default settings

Jeronymous commented 1 year ago

Thanks a lot for reporting and giving the audio @ItakeLs I'll have a look asap.

Are you using the command line (CLI) or calling python function transcribe? (I think default option differ)

Also, to be sure what is your version (whisper_timestamped -v in the CLI / whisper_timestamped.__version__ in python)? And are you running on GPU or CPU? (it's unfortunate to ask this, but there seems to be a butterfly effect that make results significantly different on different devices)

Jeronymous commented 1 year ago

"condition_on_previous_text" is true

Actually it's already True by default. Sure about this?

ItakeLs commented 1 year ago

It's fine, the "condition_on_previous_text" is True, in the last issue I sent it was false so I put that for clarification, sorry for the confusion. I am using the transcribe function, not the CLI. I am running on the latest version, it says version "1.7.5" using whisper_timestamped.__version__. I am also running on GPU.

Edit: Just tried it again, did you make any changes regarding this? because I believe it working now at least on google colab, let me try it where I initially got the error.

Jeronymous commented 1 year ago

Thanks for the feedback

Just tried it again, did you make any changes regarding this? because I believe it working now at least on google colab, let me try it where I initially got the error.

I fixed some issues in the meantime, but nothing related to the issue you saw (I think).

But it's possible that this issue appears very seldom in very strict conditions (e.g. hard to reproduce with a different GPU card...). So I'm interested to know if you are able to reproduce. For the moment I'm not.

ItakeLs commented 1 year ago

Yeah, I think you fixed the issue, or at least it is not giving an error with that audio, I'll continue testing with different audio to see if I can reproduce it again.

solismaa commented 1 year ago

I have this same problem, using the commit 0c4e015510089a3e42081c7cddb82931d7b4b5dd with GPU. It seems that the problem occurs only using .en models but not with multilingual models. I ran transcript with models tiny, tiny.en, medium and medium.en for set of 1000 wavs and got the 'inconsistent length for segment' error once with both .en models, but for different audio files.

Jeronymous commented 1 year ago

Wow! Thanks a lot for this thorough investigation @solismaa

I have an idea how to fix this, but to be sure i have to find a way to reproduce. 1 chance / 1000 with en models, OK, it's gonna be challenging...

a-rogalska commented 1 year ago

I have the same error with the latest version and multilingual models, running on GPU.

misutoneko commented 1 year ago

I've seen this too (with CPU/medium.en model). In my case --accurate helped, maybe that's worth a try.

Jeronymous commented 1 year ago

I fixed two possible errors that could occur (especially with *.en models), but I never saw the exact same error as reported in the title of this issue.

So @solismaa @a-rogalska @misutoneko @Mike327327 and others who encounter the same issue, can you please (if you have time) retry with the last version, and if it fails:

Especially if you see the error on CPU, it will be easier for me to reproduce.

And really... it's impossible to fix an issue that I can't reproduce...

misutoneko commented 1 year ago

Yeah it still fails for me (I'm at 219699c). Please do note though, that this may be a separate issue as I haven't done testing with that youtube video, only with my own (mostly very short) clips. The error seems to happen if there is no actual speech in the wav file, and so far I've only seen it with medium.en (I think). Also, with small.en and tiny.en everything seems to go smoothly without errors, well with this sample at least.

Here's a sample and the corresponding log: clip_652.zip

$ /usr/local/bin/whisper_timestamped --threads 4 --language en --device cpu --output_format srt --model medium.en --output_dir . clip_652_132492_Inconsistent_number_of_segs.wav /usr/local/lib/python3.8/dist-packages/whisper/transcribe.py:77: UserWarning: Performing inference on CPU when CUDA is available warnings.warn("Performing inference on CPU when CUDA is available") 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 140/140 [01:23<00:00, 1.65frames/s] WARNING:whisper_timestamped:Inconsistent number of segments: whisper_segments (2) != timestamped_word_segments (1) Traceback (most recent call last): File "/usr/local/bin/whisper_timestamped", line 11, in <module> load_entry_point('whisper-timestamped==1.7.8', 'console_scripts', 'whisper_timestamped')() File "/usr/local/lib/python3.8/dist-packages/whisper_timestamped-1.7.8-py3.8.egg/whisper_timestamped/transcribe.py", line 1461, in cli File "/usr/local/lib/python3.8/dist-packages/whisper_timestamped-1.7.8-py3.8.egg/whisper_timestamped/transcribe.py", line 216, in transcribe_timestamped File "/usr/local/lib/python3.8/dist-packages/whisper_timestamped-1.7.8-py3.8.egg/whisper_timestamped/transcribe.py", line 635, in _transcribe_timestamped_efficient AssertionError: Inconsistent number of segments: whisper_segments (2) != timestamped_word_segments (1)

Jeronymous commented 1 year ago

Thank you so much @misutoneko I could reproduce easily thanks to what you posted :) It's awesome that you could provide a short audio, which allow to reproduce both on CPU and on GPU.

This issue should be fixed now.