Open terryops opened 7 months ago
If faster-whisper
was yielding satisfactory results with vad_filter=True
, you might find better results with vad=True
instead of k_size
and q_levels
which could be causing the "slight delay and cease prematurely" especially audio preprocessed with demucs
. Since vad_filter=True
already filters the result, completely disabling the silence suppression with suppress_silence=False
is an option to consider if the issue persists even vad=True
.
I tried vad=True as well but I can’t find a way to set min_silence_duration_ms=2000, I found it the best to use in faster-whisper’s vad_filter.
2024年4月12日 02:10,jian @.***> 写道:
If faster-whisper was yielding satisfactory results with vad_filter=True, you might find better results with vad=True instead of k_size and q_levels which could be causing the "slight delay and cease prematurely" especially audio preprocessed with demucs. Since vad_filter=True already filters the result, completely disabling the silence suppression with suppress_silence=False is an option to consider if the issue persists even vad=True.
— Reply to this email directly, view it on GitHub https://github.com/jianfch/stable-ts/issues/343#issuecomment-2050241596, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB3NLOM7EKM3FW4UW4JJJD3Y43GZZAVCNFSM6AAAAABGCS2UZOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANJQGI2DCNJZGY. You are receiving this because you authored the thread.
I've noticed that setting vad=True
doesn't improve outcomes compared to the built-in VAD filter in faster-whisper. Could it be that the inference process of the Silero VAD has been modified in your implementation? My review of your code revealed the absence of the min_silence_duration_ms
feature, which might result in frequent brief silences interspersed between speech segments.
I've noticed that setting
vad=True
doesn't improve outcomes compared to the built-in VAD filter in faster-whisper.
Likely due to the different approaches. Faster-Whisper uses the VAD predictions to trim the audio into chunks that meet the threshold and only transcribe those chunks. Stable-ts uses the VAD predictions to trim the timings after the transcription is completed (see https://github.com/jianfch/stable-ts?#silence-suppression).
You can check if the latter is working as intended with the nonspeech timings in the attribute, nonspeech_sections
, of the transcription result object returned by transcribe_stable()
. Any of the nonspeech_sections
that do not satisfy the required conditions (determined by parameters) are ignored.
I'm utilizing
stable-ts
alongsidefaster-whisper
's integrated VAD parameters, and I've noticed that when executing the following code snippet:result = model.transcribe_stable(filename, regroup=False, k_size=9, vad_filter=True)
, the outcomes generally exhibit a slight delay and cease prematurely compared to the originalfaster-whisper
performance. Despite tweaking several parameters withinstable-ts
, I haven't found a successful adjustment yet. In my previous workflow, all my audio files are pre-processed with demucs before being fed intofaster-whisper
, which typically yields satisfactory results. However, in scenarios where the audio contains considerable noise, particularly coughs and other disruptions, the timestamps are excessively extended, spanning from the cough to the actual content. This issue led me to experiment withstable-ts
, though it hasn't met my expectations so far. Could you offer any advice on this matter? I've experimented with thek_size
andq_levels
settings without finding a viable solution. Thanks in advance.