jianfch / stable-ts

Transcription, forced alignment, and audio indexing with OpenAI's Whisper
MIT License
1.57k stars 174 forks source link

ramdomly skipped a random part of audio (usually around 30 seconds) during transcription #382

Open bylate opened 3 months ago

bylate commented 3 months ago

model = stable_whisper.load_model('small') result = model.transcribe(file) result.to_srt_vtt('audio.vtt', False, True) for caption in webvtt.read('audio.vtt'): print(caption.start +" "+caption.text+" "+caption.end)

With the code above, during the transcription, it would skip different parts of the audio for different files uploaded. For example, it jumps from 00:00:30.920 yourself 00:00:32.440 to 00:01:00.000 too 00:01:00.200. Is there any way to fix it?

jianfch commented 3 months ago

Try to use a higher value for no_speech_threshold (default: 0.6). Or set it to None to disable all skipping triggered to this threshold (do this only when there is not non speech gaps longer than 30 seconds in the audio or it will hallucinate for that gap).

result = model.transcribe(file, no_speech_threshold=0.9)
bylate commented 3 months ago

Hi, really do appreciate your feedback; however, it still does not work even when I set no_speech_threshold to none. For the other song that I'm working on, it skips from 00:00:01.740 people 00:00:02.160 to 00:00:31.000 Sometimes 00:00:31.500 when there's around 10 seconds of pure music and 20 seconds of music + vocal. Is there a way to work on that?

jianfch commented 3 months ago

It generally does not perform well with music. Try to use denoiser="demucs" to only transcribe the isolated vocals.

bylate commented 3 months ago

That works! It also got better after I switch my model to small.en