Closed joelai0101 closed 2 months ago
Without knowing your language, it's hard for me to tell if I'm having the same issue, but I'm doing a similar thing - streaming from a browser over WebRTC to whisper_server_online.py. It seems to work well when there is a continuous stream of speech, but as soon as there is a long pause - maybe 30s or more, it starts to have issues. On the console, I will see DEBUG --- last segment not within commited area
and/or DEBUG --- not enough segments to chunk
. Once this happens, the recognized speech becomes ... odd. Often it will report the same word or phrase with increasing repetition. For example:
recognized: 2640 4080 I'm going to talk for a moment.
recognized: 4120 6220 to demonstrate
recognized: 6220 8040 that I have
recognized: 8040 9000 automatic speech
recognized: 9000 9560 recognition
recognized: 9560 11320 working, but then
recognized: 11320 12500 in order to
recognized: 13340 13580 what
recognized: 13580 15180 happens when there is a
recognized: 15180 16080 long gap
recognized: 16080 16820 in the audio
recognized: 16840 18120 I will stop
recognized: 18120 18720 talking
recognized: 18720 21060 right
recognized: 21060 21400 now.
recognized: 74020 76820 to demonstrate that I have automatic speech recognition working, but then in order to
recognized: 82480 83880 to demonstrate that I have automatic speech recognition working, but then in order to demonstrate
recognized: 88620 90020 demonstrate that I have automatic speech recognition working, but then in order to demonstrate that
recognized: 94760 96160 demonstrate that I have automatic speech recognition working, but then in order to demonstrate that
recognized: 98200 98200 I have automatic speech recognition working,
recognized: 100900 102300 demonstrate that I have a long gap in the
recognized: 105000 106400 demonstrate that I have a long gap in the audio.
recognized: 106800 110500 that I have automatic speech recognition working, but then in order to demonstrate that I have
recognized: 115240 116640 I have
recognized: 121380 122780 I have a long gap in the audio. that I have a long gap in the audio.
recognized: 125480 126880 I have a long gap in the
recognized: 129580 130980 I have a long gap in the
recognized: 132280 135080 I have a long gap in the
recognized: 135380 136780 have a long gap in the audio, but then in order to demonstrate that I have a long gap
recognized: 141860 143260 gap in the audio.
Without knowing your language, it's hard for me to tell if I'm having the same issue, but I'm doing a similar thing - streaming from a browser over WebRTC to whisper_server_online.py. It seems to work well when there is a continuous stream of speech, but as soon as there is a long pause - maybe 30s or more, it starts to have issues. On the console, I will see
DEBUG --- last segment not within commited area
and/orDEBUG --- not enough segments to chunk
. Once this happens, the recognized speech becomes ... odd. Often it will report the same word or phrase with increasing repetition. For example:
I just got the same problem but i don't know how to reproduce it.
hi, lagging: check what is the max packet size loaded in every processing iteration. It seems that 65536 is 2 seconds and a bit. Make it much larger so it can cacth up the long pause. This bug is fixed in the branch vad_streaming.
Halucination: check whether the offline Whisper model with VAD halucinates on your content. If yes, it's the model problem. Use another model.
Or check the audio quality, remove the noise, make ppl speak fluently. Then the model works better.
@Gldkslfmsd Thanks for the suggestions. I stepped away from my project but I'm sure I'll appreciate the advice when I come back to it.
then feel free to reopen if you follow up
I set up a Flask web server and used a web microphone to record audio and perform real-time speech recognition successfully. The web server captures audio using the RecordRTC library and sends the audio data to whisper_online_server for processing.
Here is part of my code. app.py:
static/js/app.js:
And I edited part of whisper_online_server.py and whisper_online.py: whisper_online.py:
whisper_online_server.py:
But I have encountered some issues. Initially, the latency is acceptable (around 1-3 seconds), but sometimes it suddenly spikes. And the model occasionally generates hallucinated content during processing. When hallucinations occur, the latency tends to become even longer.
My terminal:
and another terminal:
My environment:
Here is part of my terminal output:
My web screenshots: