Closed zjyellow closed 2 days ago
I recommend checking the source sound quality. If it's bad, the latency is bad. Maybe you don't speak short distance to your mic, or it's too silent or too noisy... Then check your hardware, check e.g. faster-whisper offline mode whether it's also slow.
Reopen if you have more enquiries. Good luck!
I recommend checking the source sound quality. If it's bad, the latency is bad. Maybe you don't speak short distance to your mic, or it's too silent or too noisy... Then check your hardware, check e.g. faster-whisper offline mode whether it's also slow.
Reopen if you have more enquiries. Good luck!
Thank you for your kind suggestions, and I found the reason. "the [30s,INF] of audio streaming didn't output" is because the function _chunk_completedsegment() didn't work:
def chunk_completed_segment(self, res): if self.commited == []: return ends = self.asr.segments_end_ts(res) t = self.commited[-1][1] if len(ends) > 1: e = ends[-2]+self.buffer_time_offset while len(ends) > 2 and e > t: # Here: Sadly, I found 'e' was always larger than t, the loop unwork ends.pop(-1) e = ends[-2]+self.buffer_time_offset if e <= t: # Here: Sadly, I found 'e' was always larger than t, the chunk_at unwork logger.debug(f"--- segment chunked at {e:2.2f}") self.chunk_at(e) else: logger.debug(f"--- last segment not within commited area") else: logger.debug(f"--- not enough segments to chunk")
Only when audio_buffer come to more than 30 seconds, the 'ends' could be forced to output e=29.96, and t=29.96, by whisper(ps: vanilla Whisper is trained with in fixed length of 30-second.),in this condition _chunk_completedsegment() work.
I gave up the function _chunk_completedsegment, but luckily find your commented code (from Line 404 to 412):
# alternative: on any word #l = self.buffer_time_offset + len(self.audio_buffer)/self.SAMPLING_RATE - 10 # '10' is remaining length of audio buffer for next transcription. # let's find commited word that is less #k = len(self.commited)-1 #while k>0 and self.commited[k][1] > l: # k -= 1 #t = self.commited[k][1] logger.debug("chunking segment") #self.chunk_at(t)
There is no change needed here but also happens slower and slower transcription. That's because the function self.chunk_at(t)(from Line 458 to 464):
def chunk_at(self, time): """trims the hypothesis and audio buffer at "time" """ self.transcript_buffer.pop_commited(time) cut_seconds = time - self.buffer_time_offset self.audio_buffer = self.audio_buffer[int(cut_seconds*self.SAMPLING_RATE):] self.buffer_time_offset = time
It popped the words in self.transcript_buffer.commited_in_buffer according to "time". But the self.commited still preserved raw words from the beginning until now. That makes the code:
prompt, non_prompt = self.prompt() # prompt came from self.commited res = self.asr.transcribe(self.audio_buffer, init_prompt=prompt) # prompt is longer and longer
a longer and longer self.commited caused the res output slower and slower.
Just like "self.transcript_buffer.pop_commited(time)", I added 'self.pop_commited(time)' to self.commited, now it will be same with self.transcript_buffer, and won't be longer and longer, and transcription quality decrease not much.
Best wishes. I hope the issue would not bother you and hope this repository be more influential.
Thanks for feedback and analysis. I'm not sure whether I understood. Did you edit the code and observe better latency/quality? Can you share the edit + test results, or description to reproduce the test? E.g. via PR?
Thanks!
I facing same issue when testing with audio 4 mins. After 30s very slow output for out.txt
file
@zjyellow can you share your code to fix this issue ?
@danghieuan -- can you share your audio and parameter setup that reproduces it? I also noticed the issue recently but it's hard for me to reproduce it
@danghieuan -- can you share your audio and parameter setup that reproduces it? I also noticed the issue recently but it's hard for me to reproduce it
Sorry for my late reply.
I tried with my recorded audio (length ~5 minutes) using the parameters below: --min-chunk-size=1
, --task=transcribe
, --vad=True
, --buffer_trimming_sec=10
.
I noticed that the results in the output.txt
file are OK for the first ~30 seconds, but after that, the transcription becomes very slow.
I'm only testing with my recorded audio, not real-time mic input.
Could you help me check if the resource handles very long audio properly?
Sure. Can you send me the audio? And what model do you use?
Sure. Can you send me the audio? And what model do you use?
I am only using my recording with good quality and no noise, I tried other audio files but faced the same issue. I don’t think it’s a problem with audio quality or settings, as I have tried many different configurations. I believe the issue may be that it cannot handle very long audio files. I am using a Whisper model fine-tuned on Vietnamese data.
I notice the problem with long delay very rarely on English and Czech, and therefore it's hard for me to get a sample audio file to debug on. It would help me if you sent it to me.
Maybe your lagging is caused by your specific model. Does it expect to use prompts? Maybe it's overfitted on shorter audio and not on longer, so on longer it often hallucinates or fluctuates, so that two updates rarely agree.
I notice the problem with long delay very rarely on English and Czech, and therefore it's hard for me to get a sample audio file to debug on. It would help me if you sent it to me.
Maybe your lagging is caused by your specific model. Does it expect to use prompts? Maybe it's overfitted on shorter audio and not on longer, so on longer it often hallucinates or fluctuates, so that two updates rarely agree.
@Gldkslfmsd Would there be any impact if my Whisper model was fine-tuned on very short audio clips, such as those ranging from 2 to 5 seconds?
probably. It sounds like train-test mismatch. Test the quality of your model on longer audios in offline mode, and you will see
probably. It sounds like train-test mismatch. Test the quality of your model on longer audios in offline mode, and you will see
For my purpose of performing this task for real-time streaming, I took the Whisper model and fine-tuned it with my available Vietnamese data, with each audio clip used in fine-tuning averaging between 2 to 5 seconds.
Do you think I should retrain the model or use any other approach to better utilize your code?
Thank you!
I don't know, you need to experiment. Retrain with other data sounds good.
@Gldkslfmsd Thanks, I will try to investigate the source code to improve it.
Hi, This is a very excellent work. And I use it in a speech translation work. But I find it that the _whisper_onlineserver.py runs best at the first 30-seconds. But from the 30s to more buffers, the transcription output very slow or even nothing output.
I tried different settings: min-chunk-size : [1.0,2.0,3.0,4.0,5.0] buffer-trimming-second : [5.0,6.0,7.0,8.0,9.0,10.0] vac : [True, False] vad : [True, False] buffer trimming : segment
The [0,30s] output is cool. But the [30s,INF] is very slow or no output.