jianfch / stable-ts

Transcription, forced alignment, and audio indexing with OpenAI's Whisper
MIT License
1.61k stars 178 forks source link

Error decoded_full[unicode_offset + decoded.index(replacement_char)] #112

Closed kanjieater closed 1 year ago

kanjieater commented 1 year ago

It seems that the latest 2.1 fixed most of the issues I was dealing with in quality of the transcription's timestamps. Unfortunately after running a few tests, I ran into this error. I'll see if I can reproduce it consistently.

  0%|                   | 0/1 [00:00<?, ?it/s]/home/ke/.pyenv/versions/ats/lib/python3.9/site-packages/torch/nn/modules/module.py:1194: UserWarning: operator() profile_node %668 : int[] = prim::profile_ivalue(%666)
 does not have profile information (Triggered internally at ../torch/csrc/jit/codegen/cuda/graph_fuser.cpp:105.)
  return forward_call(*input, **kwargs)
Predicted silence(s) with VAD       
 90%|▉| 44794.45/49901.59 [14:56<01:42, 49.94s
  0%|                   | 0/1 [23:23<?, ?it/s]
Traceback (most recent call last):
  File "/home/ke/code/subgen/split_run.py", line 186, in <module>
    run()
  File "/home/ke/code/subgen/split_run.py", line 172, in run
    generate_transcript_from_audio_wrapper(audio_path_dict)
  File "/home/ke/code/subgen/split_run.py", line 139, in generate_transcript_from_audio_wrapper
    generate_transcript_from_audio(audio_file, full_timings_path)
  File "/home/ke/code/subgen/split_run.py", line 35, in generate_transcript_from_audio
    run_stable_whisper(audio_file, full_timings_path)
  File "/home/ke/code/subgen/split_run.py", line 23, in run_stable_whisper
    result = model.transcribe(
  File "/home/ke/.pyenv/versions/ats/lib/python3.9/site-packages/stable_whisper/whisper_word_level.py", line 485, in transcribe_stable
    add_word_timestamps_stable(
  File "/home/ke/.pyenv/versions/ats/lib/python3.9/site-packages/stable_whisper/timing.py", line 175, in add_word_timestamps_stable
    text_tokens, token_split, seg_indices = split_word_tokens(segments, tokenizer, padding=' ...')
  File "/home/ke/.pyenv/versions/ats/lib/python3.9/site-packages/stable_whisper/timing.py", line 123, in split_word_tokens
    curr_words, curr_word_tokens = tokenizer.split_to_word_tokens([t for t in s['tokens'] if t < tokenizer.eot])
  File "/home/ke/.pyenv/versions/ats/lib/python3.9/site-packages/whisper/tokenizer.py", line 277, in split_to_word_tokens
    return self.split_tokens_on_unicode(tokens)
  File "/home/ke/.pyenv/versions/ats/lib/python3.9/site-packages/whisper/tokenizer.py", line 296, in split_tokens_on_unicode
    or decoded_full[unicode_offset + decoded.index(replacement_char)]
kanjieater commented 1 year ago

It seems that it does not happen every time I run the same file unfortunately.

jianfch commented 1 year ago

You can try the new commit to see if it prevents this error.

Keith-Hon commented 1 year ago

I experience the same issue. It happens with some specific finetuned models, but not all. I will also try the new commit and report back.

Keith-Hon commented 1 year ago

I used the latest commit and now has another error

whisper_model.transcribe(filename, language="zh", regroup=True, demucs=True, vad=True)

/usr/local/lib/python3.9/dist-packages/stable_whisper/timing.py in _split_tokens(tokens, tokenizer) 134 curr_tokens = [] 135 --> 136 assert len(text) == 0 137 138 return words, word_tokens

AssertionError:

Keith-Hon commented 1 year ago

why do we need to have this line?

assert len(text) == 0

jianfch commented 1 year ago

why do we need to have this line?

assert len(text) == 0

The line serves to ensure that every word/character has been paired with the tokens that makes it up. If there is still text left at the end of the pairing process that should mean there was mismatch in previous pairs.

Can you transcribe the same audio with word_timestamps=False then save the result as a json and share it?

Alternately, you can try to install the previous version of whisper to see if you can replicate this error.

pip install openai-whisper==20230308

if you can't replicate the error with this older version of whisper, then it's likely an issue with the new tokenizer that is used in the newer version.