jianfch / stable-ts

Transcription, forced alignment, and audio indexing with OpenAI's Whisper
MIT License
1.59k stars 177 forks source link

Never get the silence at the start correctly #180

Closed curiousbee2020 closed 1 year ago

curiousbee2020 commented 1 year ago

I'm trying to transcribe any audio that begins with silence (e.g., https://www.youtube.com/watch?v=QzWCrkvbJo0) as follows:

import stable_whisper model = stable_whisper.load_model('base') result = model.transcribe("in1-vocals.wav", language="en", suppress_silence=True, vad=True) result.to_tsv("in2.out", segment_level=True, word_level=False)

in1-vocals.wav is the vocals only (minus any music/noise background). This file has silence for the first ~7 seconds.

in2.out.tsv always begins the first segment at <1 second mark but it should begin it at ~7 seconds.

Anything I'm missing here?

Thanks!

jianfch commented 1 year ago

Hi, the first word would have a duration ~0.07 second because the audio was mark silent up to ~7.75 seconds. Since the new duration of word was below min_word_dur (0.1 second), the change was not applied.

Lowering min_word_dur to 0.01 will fix it (note changing min_word_dur only works in the recent commit/latest version 2.7.2+):

result = model.transcribe("in1-vocals.wav", language="en",vad=True, min_word_dur=0.01)

The alternate is to use clamp_max():

result.clamp_max()
curiousbee2020 commented 1 year ago

Thank you for clarifying! Appreciate your responses.

Changing the min_word_dur to 0.01 does not seem to work for me on its own. Calling result.clamp_max() fixed the issue for me though.