jhj0517 / Whisper-WebUI

A Web UI for easy subtitle using whisper model.
Apache License 2.0
1.38k stars 193 forks source link

Subtitle timing and synchronization issue #396

Open borahanarslan opened 1 day ago

borahanarslan commented 1 day ago

Hello, I am experiencing some issues while generating subtitles for the video attached below. Despite trying various values in the Advanced Parameters and Voice Detection sections, I am not able to achieve the desired results.

For example, I keep testing, but the text either appears before or after the audio, or the words are too long. Sometimes, very simple two-word subtitles stay on the screen for 30 seconds. Occasionally, there are 2 or 3 different languages in the uploaded file, and in such cases, the behavior changes as well.

I have enabled background music removal, activated VAD, and tested with the large v2 and v3 versions. I increased the Best of and Beam Size values up to 30. I tried many parameters with the sample file I provided, but I still didn’t get the exact results I wanted. What parameters should I use? There is a link to the sample file and subtitles. Are there any settings you would recommend?

https://easyupload.io/d1w4fi (file) https://easyupload.io/0p558m (srt)

jhj0517 commented 23 hours ago

Thanks for uploading the sample! I'll test & try to find out what the problem is, and what could be better.

+) The first halluication part is 18:27 ~ 19:21

jhj0517 commented 23 hours ago

OK, based on the subtitle you posted, the first hallucination part is ( 00:18:27,870 --> 00:19:23,870 ).

Let's focus on removing this hallucination part. First of all, the part is very likely to cause whisper to hallucinate, because it is full of the monster's grr sound, gun sound, and some string instrument sound to add tension to the scene.

So it is recommended to turn on the VAD, and also the Background Music Separator if it gives better result.

You can try this setting:

  1. Since the audio is full of noise, large-v2 is recommended rather than large-v3.

  2. Enable the Background Music Remover Filter: image

  3. Enable VAD, I used Minimum Silence Duration (ms) as 250 specifically. All others are just defaults. image

I think I got a better result with this setting than the previous one:

At least I didn't observe repetitive phrases like "Go, go, go, go!"s.

borahanarslan commented 22 hours ago

I will try both v2 and v3 and get back to you, thanks.

borahanarslan commented 21 hours ago

Sorry but the result is still disappointing, maybe it may be necessary to use it in different settings. It is ideal for short 5-6 minute content, but it is not ideal for movies or documentaries right now. I am adding both files, both v2 and v3 synchronization problem continues and it seems like it started to get really ridiculous in the end :( Subtitle.zip

jhj0517 commented 2 hours ago

Subtitle.zip

That's too different result than mine, would you copy + paste this into default_parameters.yaml and try again? The app will be automatically start with the settings below.

whisper:
  model_size: large-v2
  lang: Automatic Detection
  is_translate: false
  beam_size: 5
  log_prob_threshold: -1.0
  no_speech_threshold: 0.6
  compute_type: float16
  best_of: 5
  patience: 1.0
  condition_on_previous_text: true
  prompt_reset_on_temperature: 0.5
  initial_prompt: null
  temperature: 0.0
  compression_ratio_threshold: 2.4
  length_penalty: 1.0
  repetition_penalty: 1.0
  no_repeat_ngram_size: 0
  prefix: null
  suppress_blank: true
  suppress_tokens: '[-1]'
  max_initial_timestamp: 1.0
  word_timestamps: false
  prepend_punctuations: '"''“¿([{-'
  append_punctuations: '"''.。,,!!??::”)]}、'
  max_new_tokens: null
  chunk_length: 30
  hallucination_silence_threshold: null
  hotwords: null
  language_detection_threshold: null
  language_detection_segments: 1
  batch_size: 24
  add_timestamp: true
  file_format: SRT
vad:
  vad_filter: true
  threshold: 0.5
  min_speech_duration_ms: 250
  max_speech_duration_s: 9999
  min_silence_duration_ms: 250
  speech_pad_ms: 2000
diarization:
  is_diarize: false
  device: cuda
  hf_token: ''
bgm_separation:
  is_separate_bgm: true
  model_size: UVR-MDX-NET-Inst_HQ_4
  device: cuda
  segment_size: 256
  save_file: false
  enable_offload: true
translation:
  deepl:
    api_key: ''
    is_pro: false
    source_lang: Automatic Detection
    target_lang: English
  nllb:
    model_size: facebook/nllb-200-1.3B
    source_lang: null
    target_lang: null
    max_length: 200
  add_timestamp: true