Closed curiousbee2020 closed 1 year ago
Hi,
the first word would have a duration ~0.07 second because the audio was mark silent up to ~7.75 seconds. Since the new duration of word was below min_word_dur
(0.1 second), the change was not applied.
Lowering min_word_dur
to 0.01 will fix it (note changing min_word_dur
only works in the recent commit/latest version 2.7.2+):
result = model.transcribe("in1-vocals.wav", language="en",vad=True, min_word_dur=0.01)
The alternate is to use clamp_max()
:
result.clamp_max()
Thank you for clarifying! Appreciate your responses.
Changing the min_word_dur to 0.01 does not seem to work for me on its own. Calling result.clamp_max() fixed the issue for me though.
I'm trying to transcribe any audio that begins with silence (e.g., https://www.youtube.com/watch?v=QzWCrkvbJo0) as follows:
import stable_whisper model = stable_whisper.load_model('base') result = model.transcribe("in1-vocals.wav", language="en", suppress_silence=True, vad=True) result.to_tsv("in2.out", segment_level=True, word_level=False)
in1-vocals.wav is the vocals only (minus any music/noise background). This file has silence for the first ~7 seconds.
in2.out.tsv always begins the first segment at <1 second mark but it should begin it at ~7 seconds.
Anything I'm missing here?
Thanks!