Open bylate opened 4 months ago
Try to use a higher value for no_speech_threshold
(default: 0.6). Or set it to None
to disable all skipping triggered to this threshold (do this only when there is not non speech gaps longer than 30 seconds in the audio or it will hallucinate for that gap).
result = model.transcribe(file, no_speech_threshold=0.9)
Hi, really do appreciate your feedback; however, it still does not work even when I set no_speech_threshold to none. For the other song that I'm working on, it skips from 00:00:01.740 people 00:00:02.160 to 00:00:31.000 Sometimes 00:00:31.500 when there's around 10 seconds of pure music and 20 seconds of music + vocal. Is there a way to work on that?
It generally does not perform well with music. Try to use denoiser="demucs"
to only transcribe the isolated vocals.
That works! It also got better after I switch my model to small.en
model = stable_whisper.load_model('small') result = model.transcribe(file) result.to_srt_vtt('audio.vtt', False, True) for caption in webvtt.read('audio.vtt'): print(caption.start +" "+caption.text+" "+caption.end)
With the code above, during the transcription, it would skip different parts of the audio for different files uploaded. For example, it jumps from 00:00:30.920 yourself 00:00:32.440 to 00:01:00.000 too 00:01:00.200. Is there any way to fix it?