Open borahanarslan opened 1 day ago
Thanks for uploading the sample! I'll test & try to find out what the problem is, and what could be better.
+) The first halluication part is 18:27 ~ 19:21
OK, based on the subtitle you posted, the first hallucination part is ( 00:18:27,870 --> 00:19:23,870 ).
Let's focus on removing this hallucination part. First of all, the part is very likely to cause whisper to hallucinate, because it is full of the monster's grr sound, gun sound, and some string instrument sound to add tension to the scene.
So it is recommended to turn on the VAD, and also the Background Music Separator if it gives better result.
You can try this setting:
Since the audio is full of noise, large-v2
is recommended rather than large-v3
.
Enable the Background Music Remover Filter:
Enable VAD, I used Minimum Silence Duration (ms)
as 250 specifically. All others are just defaults.
I think I got a better result with this setting than the previous one:
At least I didn't observe repetitive phrases like "Go, go, go, go!"s.
I will try both v2 and v3 and get back to you, thanks.
Sorry but the result is still disappointing, maybe it may be necessary to use it in different settings. It is ideal for short 5-6 minute content, but it is not ideal for movies or documentaries right now. I am adding both files, both v2 and v3 synchronization problem continues and it seems like it started to get really ridiculous in the end :( Subtitle.zip
Subtitle.zip
That's too different result than mine, would you copy + paste this into default_parameters.yaml and try again? The app will be automatically start with the settings below.
whisper:
model_size: large-v2
lang: Automatic Detection
is_translate: false
beam_size: 5
log_prob_threshold: -1.0
no_speech_threshold: 0.6
compute_type: float16
best_of: 5
patience: 1.0
condition_on_previous_text: true
prompt_reset_on_temperature: 0.5
initial_prompt: null
temperature: 0.0
compression_ratio_threshold: 2.4
length_penalty: 1.0
repetition_penalty: 1.0
no_repeat_ngram_size: 0
prefix: null
suppress_blank: true
suppress_tokens: '[-1]'
max_initial_timestamp: 1.0
word_timestamps: false
prepend_punctuations: '"''“¿([{-'
append_punctuations: '"''.。,,!!??::”)]}、'
max_new_tokens: null
chunk_length: 30
hallucination_silence_threshold: null
hotwords: null
language_detection_threshold: null
language_detection_segments: 1
batch_size: 24
add_timestamp: true
file_format: SRT
vad:
vad_filter: true
threshold: 0.5
min_speech_duration_ms: 250
max_speech_duration_s: 9999
min_silence_duration_ms: 250
speech_pad_ms: 2000
diarization:
is_diarize: false
device: cuda
hf_token: ''
bgm_separation:
is_separate_bgm: true
model_size: UVR-MDX-NET-Inst_HQ_4
device: cuda
segment_size: 256
save_file: false
enable_offload: true
translation:
deepl:
api_key: ''
is_pro: false
source_lang: Automatic Detection
target_lang: English
nllb:
model_size: facebook/nllb-200-1.3B
source_lang: null
target_lang: null
max_length: 200
add_timestamp: true
Hello, I am experiencing some issues while generating subtitles for the video attached below. Despite trying various values in the Advanced Parameters and Voice Detection sections, I am not able to achieve the desired results.
For example, I keep testing, but the text either appears before or after the audio, or the words are too long. Sometimes, very simple two-word subtitles stay on the screen for 30 seconds. Occasionally, there are 2 or 3 different languages in the uploaded file, and in such cases, the behavior changes as well.
I have enabled background music removal, activated VAD, and tested with the large v2 and v3 versions. I increased the Best of and Beam Size values up to 30. I tried many parameters with the sample file I provided, but I still didn’t get the exact results I wanted. What parameters should I use? There is a link to the sample file and subtitles. Are there any settings you would recommend?
https://easyupload.io/d1w4fi (file) https://easyupload.io/0p558m (srt)