Open reasv opened 1 month ago
I have noticed a pattern: In a video that has this issue, there is a first part of clear speech, and then loud noises. If I cut the last part with loud noises out, the transcription works correctly.
So, for some reason later noise prevents earlier speech from being transcribed? Still, OpenAI Whisper transcribes everything correctly, including speech during the part with loud noises.
I have the same problem when using the new batched backend for faster_whisper. So perhaps batching is at fault here. This is despite the videos that have problems being too short for batching (eg, 8s)
I have noticed a pattern: In a video that has this issue, there is a first part of clear speech, and then loud noises. If I cut the last part with loud noises out, the transcription works correctly.
So, for some reason later noise prevents earlier speech from being transcribed? Still, OpenAI Whisper transcribes everything correctly, including speech during the part with loud noises.
Good delving into the issue, thanks.
As I mentioned in my comment on the faster_whisper PR, I have the same problem when enabling batching on faster_whisper, but the issue disappears when not using the batched pipeline (on faster_whisper) https://github.com/SYSTRAN/faster-whisper/pull/856
I found a video on the internet that replicates this problem. Audio: https://litter.catbox.moe/kyu2q8.wav
@reasv can you reupload the video to a permanent storage and share the link?
@reasv can you reupload the video to a permanent storage and share the link?您可以将视频重新上传到永久存储并分享链接吗?
https://drive.google.com/file/d/1JKsYQZYQDrKuRFciFhh1aA5ftAGr-eud/view?usp=sharing,this video can replicates this problem
Hi, Anyone found a solution? I noticed that the old version medium model works fine but the new version medium and large-v3 has these problems
Hi, Anyone found a solution? I noticed that the old version medium model works fine but the new version medium and large-v3 has these problems
The problem is with pyannote vad model, https://github.com/SYSTRAN/faster-whisper/pull/936 is a possible solution, but you have to use faster-whisper for transcription
I'm getting poor transcription results using whisperx, specifically I am sometimes not getting any transcription out of some short videos, whereas OpenAI's official whisper model transcribes them correctly.
On the OpenAI side, I am using their official HF Space (https://huggingface.co/spaces/openai/whisper) which employs
large-v3
.On the whisperx side, I am using
Systran/faster-whisper-large-v3
for comparison, with the latest whisperx from github, and pytorch with CUDA on Windows 11 (on an RTX 4090).Here's the code for the simple gradio UI I use for testing whisperx: https://github.com/reasv/panoptikon/blob/master/src/ui/test_models/whisper.py The transcription function is very simple:
I am only testing it with audio file paths at the moment, so assume audio_file is populated, and not audio_tuple. The audio seems to be loaded correctly from the video file since I can listen to the extracted audio output to the gradio Audio component.
This is the output I get:
Some videos work as expected, others I get
No active speech found
even though the speech seems relatively clear (and in english).At the moment I cannot give an example of a video that causes this problem as it's happening with personal videos and I haven't found publicly available videos that reproduce the issue yet.
Any ideas of why this is happening?