jianfch / stable-ts

Transcription, forced alignment, and audio indexing with OpenAI's Whisper
MIT License
1.57k stars 174 forks source link

Always a bit delay and a bit early stops #343

Open terryops opened 7 months ago

terryops commented 7 months ago

I'm utilizing stable-ts alongside faster-whisper's integrated VAD parameters, and I've noticed that when executing the following code snippet: result = model.transcribe_stable(filename, regroup=False, k_size=9, vad_filter=True), the outcomes generally exhibit a slight delay and cease prematurely compared to the original faster-whisper performance. Despite tweaking several parameters within stable-ts, I haven't found a successful adjustment yet. In my previous workflow, all my audio files are pre-processed with demucs before being fed into faster-whisper, which typically yields satisfactory results. However, in scenarios where the audio contains considerable noise, particularly coughs and other disruptions, the timestamps are excessively extended, spanning from the cough to the actual content. This issue led me to experiment with stable-ts, though it hasn't met my expectations so far. Could you offer any advice on this matter? I've experimented with the k_size and q_levels settings without finding a viable solution. Thanks in advance.

jianfch commented 7 months ago

If faster-whisper was yielding satisfactory results with vad_filter=True, you might find better results with vad=True instead of k_size and q_levels which could be causing the "slight delay and cease prematurely" especially audio preprocessed with demucs. Since vad_filter=True already filters the result, completely disabling the silence suppression with suppress_silence=False is an option to consider if the issue persists even vad=True.

terryops commented 7 months ago

I tried vad=True as well but I can’t find a way to set min_silence_duration_ms=2000, I found it the best to use in faster-whisper’s vad_filter.

2024年4月12日 02:10,jian @.***> 写道:

If faster-whisper was yielding satisfactory results with vad_filter=True, you might find better results with vad=True instead of k_size and q_levels which could be causing the "slight delay and cease prematurely" especially audio preprocessed with demucs. Since vad_filter=True already filters the result, completely disabling the silence suppression with suppress_silence=False is an option to consider if the issue persists even vad=True.

— Reply to this email directly, view it on GitHub https://github.com/jianfch/stable-ts/issues/343#issuecomment-2050241596, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB3NLOM7EKM3FW4UW4JJJD3Y43GZZAVCNFSM6AAAAABGCS2UZOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANJQGI2DCNJZGY. You are receiving this because you authored the thread.

terryops commented 7 months ago

I've noticed that setting vad=True doesn't improve outcomes compared to the built-in VAD filter in faster-whisper. Could it be that the inference process of the Silero VAD has been modified in your implementation? My review of your code revealed the absence of the min_silence_duration_ms feature, which might result in frequent brief silences interspersed between speech segments.

jianfch commented 7 months ago

I've noticed that setting vad=True doesn't improve outcomes compared to the built-in VAD filter in faster-whisper.

Likely due to the different approaches. Faster-Whisper uses the VAD predictions to trim the audio into chunks that meet the threshold and only transcribe those chunks. Stable-ts uses the VAD predictions to trim the timings after the transcription is completed (see https://github.com/jianfch/stable-ts?#silence-suppression). You can check if the latter is working as intended with the nonspeech timings in the attribute, nonspeech_sections, of the transcription result object returned by transcribe_stable(). Any of the nonspeech_sections that do not satisfy the required conditions (determined by parameters) are ignored.