jianfch / stable-ts

Transcription, forced alignment, and audio indexing with OpenAI's Whisper
MIT License
1.6k stars 177 forks source link

Subtitles appear way too soon before speech #196

Closed peterstavrou closed 10 months ago

peterstavrou commented 1 year ago

Some subtitles appear and stay on the screen 10-15 seconds before anyone even talks. It's not all like this but it happens frequently. Some subtitles disappear way too fast (not sure if it's related).

model = whisper.load_model('large-v2')
modify_model(model)

result = model.transcribe(
    input_file,
    language="nl",
    task="translate",
    fp16=False,
    suppress_silence=True,
    ts_num=16,
    no_speech_threshold = 0.6,
    )

# Save as an SRT file
input_file_name, extension = os.path.splitext(input_file)
subtitle_file_name  = input_file_name + '.srt'
result.to_srt_vtt(subtitle_file_name, word_level=False)
Hemangpandey commented 1 year ago

def translate(audio): options = dict(beam_size=5, best_of=5) translate_options = dict(task="translate", options) result = model.transcribe(audio_file,translate_options,demucs=True,vad=True)

add ( **translate_option parameter only then you want to translate if you are just transcribing you can remove this from the

above function)

demus=True,vad=True is used if video has song/music ,if not present kindly remove both parameters

return result.to_dict()

peterstavrou commented 1 year ago

What is beam_size=5, best_of=5? It doesn't seem to work for me, I get an AssertionError. What exactly is wrong with what I'm doing?

jianfch commented 1 year ago

The recent update should generally prevent text from appearing too early. Which version are you using?

peterstavrou commented 1 year ago

The recent update should generally prevent text from appearing too early. Which version are you using?

The latest. I did the below before logging this issue.

pip install -U git+https://github.com/jianfch/stable-ts.git
pip install --upgrade --no-deps --force-reinstall git+https://github.com/openai/whisper.git
jianfch commented 1 year ago

Avoid using ts_num.

What is beam_size=5, best_of=5? It doesn't seem to work for me, I get an AssertionError. What exactly is wrong with what I'm doing?

What was the error message?

peterstavrou commented 1 year ago

Avoid using ts_num.

What is beam_size=5, best_of=5? It doesn't seem to work for me, I get an AssertionError. What exactly is wrong with what I'm doing?

What was the error message?

I commented out ts_num=16 but didn't make a difference.

Error:

Traceback (most recent call last):
  File "c:\Z\Programming\Python\OpenAI_Whisper\Video_Translated_Subtitles.py", line 13, in <module>
    result = model.transcribe(
             ^^^^^^^^^^^^^^^^^
  File "C:\Z\Programming\Python\OpenAI_Whisper\venv\Lib\site-packages\stable_whisper\whisper_word_level.py", line 458, in transcribe_stable
    result: DecodingResult = decode_with_fallback(mel_segment, ts_token_mask=ts_token_mask)
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Z\Programming\Python\OpenAI_Whisper\venv\Lib\site-packages\stable_whisper\whisper_word_level.py", line 335, in decode_with_fallback
    decode_result, audio_features = model.decode(seg,
                                    ^^^^^^^^^^^^^^^^^
  File "C:\Z\Programming\Python\OpenAI_Whisper\venv\Lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "C:\Z\Programming\Python\OpenAI_Whisper\venv\Lib\site-packages\stable_whisper\decode.py", line 112, in decode_stable
    result = task.run(mel)
             ^^^^^^^^^^^^^
  File "C:\Z\Programming\Python\OpenAI_Whisper\venv\Lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "C:\Z\Programming\Python\OpenAI_Whisper\venv\Lib\site-packages\whisper\decoding.py", line 732, in run
    tokens, sum_logprobs, no_speech_probs = self._main_loop(audio_features, tokens)
                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Z\Programming\Python\OpenAI_Whisper\venv\Lib\site-packages\stable_whisper\decode.py", line 36, in _main_loop
    assert audio_features.shape[0] == tokens.shape[0]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError
jianfch commented 1 year ago

You might have installed Whisper from the its repo which is not compatible with Stable-ts. Try:

pip install --upgrade --no-deps --force-reinstall openai-whisper==20230314
peterstavrou commented 1 year ago

You might have installed Whisper from the its repo which is not compatible with Stable-ts. Try:

pip install --upgrade --no-deps --force-reinstall openai-whisper==20230314

Unfortunately the issue still happens

jianfch commented 1 year ago

Unfortunately the issue still happens

Try to install Stable-ts in a new environment and not install Whisper but just Stable-ts directly.

peterstavrou commented 1 year ago

Unfortunately the issue still happens

Try to install Stable-ts in a new environment and not install Whisper but just Stable-ts directly.

I deleted completely my venv and then installed Stable-ts using the latest commit. The AssertionError error is gone but the issue with some subtitles appearing way before any speech still happens when translating a video file (tv show).

result = model.transcribe(
    input_file,
    language="nl",
    task="translate",
    fp16=False,
    suppress_silence=True,
    no_speech_threshold=0.6,
    beam_size=5,
    best_of=5,
    )
Hemangpandey commented 1 year ago

What type of audio you are passing,if it contains music also,you can add parameters like: model.transcribe(audio_file,demucs=True,vad=True)

Hemangpandey commented 1 year ago

Unfortunately the issue still happens

Try to install Stable-ts in a new environment and not install Whisper but just Stable-ts directly.

I deleted my venv completely and then installed Stable-ts using the latest commit but it's still happening.

What type of error is coming can you mention that also

peterstavrou commented 1 year ago

Unfortunately the issue still happens

Try to install Stable-ts in a new environment and not install Whisper but just Stable-ts directly.

I deleted my venv completely and then installed Stable-ts using the latest commit but it's still happening.

What type of error is coming can you mention that also

Sorry I have updated my reply. The original issue of subtitles appearing way before any speech still happens.

jianfch commented 1 year ago

Unfortunately the issue still happens

Try to install Stable-ts in a new environment and not install Whisper but just Stable-ts directly.

I deleted completely my venv and then installed Stable-ts using the latest commit. The AssertionError error is gone but the issue with some subtitles appearing way before any speech still happens when translating a video file (tv show).

result = model.transcribe(
    input_file,
    language="nl",
    task="translate",
    fp16=False,
    suppress_silence=True,
    no_speech_threshold=0.6,
    beam_size=5,
    best_of=5,
    )

If it fails to detect the non speech with vad=True and demucs=True then try including min_word_dur=0 as well. You can also use a lower value of medium_factor or even set a max_dur value for clamp_max() .

Hemangpandey commented 1 year ago

@jianfch can you provide in detail when to use which parameter inside the transcribe function and what range it covers because there are many parameters and each have different characteristics to play

jianfch commented 1 year ago

@Hemangpandey a detailed documentation is on the roadmap, but for now there is only the docstring