ufal / whisper_streaming

Whisper realtime streaming for long speech-to-text transcription and translation
MIT License
2.14k stars 262 forks source link

Being very very slow after 30 seconds audio buffers #120

Closed zjyellow closed 2 days ago

zjyellow commented 2 months ago

Hi, This is a very excellent work. And I use it in a speech translation work. But I find it that the _whisper_onlineserver.py runs best at the first 30-seconds. But from the 30s to more buffers, the transcription output very slow or even nothing output.

I tried different settings: min-chunk-size : [1.0,2.0,3.0,4.0,5.0] buffer-trimming-second : [5.0,6.0,7.0,8.0,9.0,10.0] vac : [True, False] vad : [True, False] buffer trimming : segment

The [0,30s] output is cool. But the [30s,INF] is very slow or no output.

Gldkslfmsd commented 2 months ago

I recommend checking the source sound quality. If it's bad, the latency is bad. Maybe you don't speak short distance to your mic, or it's too silent or too noisy... Then check your hardware, check e.g. faster-whisper offline mode whether it's also slow.

Reopen if you have more enquiries. Good luck!

zjyellow commented 2 months ago

I recommend checking the source sound quality. If it's bad, the latency is bad. Maybe you don't speak short distance to your mic, or it's too silent or too noisy... Then check your hardware, check e.g. faster-whisper offline mode whether it's also slow.

Reopen if you have more enquiries. Good luck!

Thank you for your kind suggestions, and I found the reason. "the [30s,INF] of audio streaming didn't output" is because the function _chunk_completedsegment() didn't work:

def chunk_completed_segment(self, res):
    if self.commited == []: return
    ends = self.asr.segments_end_ts(res)
    t = self.commited[-1][1]
    if len(ends) > 1:
        e = ends[-2]+self.buffer_time_offset
        while len(ends) > 2 and e > t: # Here: Sadly, I found  'e' was always larger than t, the loop unwork
            ends.pop(-1)
            e = ends[-2]+self.buffer_time_offset
        if e <= t: # Here: Sadly, I found  'e' was always larger than t, the chunk_at unwork
            logger.debug(f"--- segment chunked at {e:2.2f}")
            self.chunk_at(e)
        else:
            logger.debug(f"--- last segment not within commited area")
    else:
        logger.debug(f"--- not enough segments to chunk")

Only when audio_buffer come to more than 30 seconds, the 'ends' could be forced to output e=29.96, and t=29.96, by whisper(ps: vanilla Whisper is trained with in fixed length of 30-second.),in this condition _chunk_completedsegment() work.

I gave up the function _chunk_completedsegment, but luckily find your commented code (from Line 404 to 412):

        # alternative: on any word
        #l = self.buffer_time_offset + len(self.audio_buffer)/self.SAMPLING_RATE - 10 # '10' is remaining length of audio buffer for next transcription.
        # let's find commited word that is less
        #k = len(self.commited)-1
        #while k>0 and self.commited[k][1] > l:
        #    k -= 1
        #t = self.commited[k][1] 
        logger.debug("chunking segment")
        #self.chunk_at(t)

There is no change needed here but also happens slower and slower transcription. That's because the function self.chunk_at(t)(from Line 458 to 464):

def chunk_at(self, time):
    """trims the hypothesis and audio buffer at "time"
    """
    self.transcript_buffer.pop_commited(time)
    cut_seconds = time - self.buffer_time_offset
    self.audio_buffer = self.audio_buffer[int(cut_seconds*self.SAMPLING_RATE):]
    self.buffer_time_offset = time

It popped the words in self.transcript_buffer.commited_in_buffer according to "time". But the self.commited still preserved raw words from the beginning until now. That makes the code:

prompt, non_prompt = self.prompt() # prompt came from self.commited res = self.asr.transcribe(self.audio_buffer, init_prompt=prompt) # prompt is longer and longer

a longer and longer self.commited caused the res output slower and slower.

Just like "self.transcript_buffer.pop_commited(time)", I added 'self.pop_commited(time)' to self.commited, now it will be same with self.transcript_buffer, and won't be longer and longer, and transcription quality decrease not much.

Best wishes. I hope the issue would not bother you and hope this repository be more influential.

Gldkslfmsd commented 2 months ago

Thanks for feedback and analysis. I'm not sure whether I understood. Did you edit the code and observe better latency/quality? Can you share the edit + test results, or description to reproduce the test? E.g. via PR?

Thanks!

danghieuan commented 1 month ago

I facing same issue when testing with audio 4 mins. After 30s very slow output for out.txt file

danghieuan commented 1 month ago

@zjyellow can you share your code to fix this issue ?

Gldkslfmsd commented 1 month ago

@danghieuan -- can you share your audio and parameter setup that reproduces it? I also noticed the issue recently but it's hard for me to reproduce it

danghieuan commented 1 month ago

@danghieuan -- can you share your audio and parameter setup that reproduces it? I also noticed the issue recently but it's hard for me to reproduce it

Sorry for my late reply.

I tried with my recorded audio (length ~5 minutes) using the parameters below: --min-chunk-size=1, --task=transcribe, --vad=True, --buffer_trimming_sec=10.

I noticed that the results in the output.txt file are OK for the first ~30 seconds, but after that, the transcription becomes very slow.

I'm only testing with my recorded audio, not real-time mic input.

Could you help me check if the resource handles very long audio properly?

Gldkslfmsd commented 1 month ago

Sure. Can you send me the audio? And what model do you use?

danghieuan commented 1 month ago

Sure. Can you send me the audio? And what model do you use?

I am only using my recording with good quality and no noise, I tried other audio files but faced the same issue. I don’t think it’s a problem with audio quality or settings, as I have tried many different configurations. I believe the issue may be that it cannot handle very long audio files. I am using a Whisper model fine-tuned on Vietnamese data.

Gldkslfmsd commented 1 month ago

I notice the problem with long delay very rarely on English and Czech, and therefore it's hard for me to get a sample audio file to debug on. It would help me if you sent it to me.

Maybe your lagging is caused by your specific model. Does it expect to use prompts? Maybe it's overfitted on shorter audio and not on longer, so on longer it often hallucinates or fluctuates, so that two updates rarely agree.

danghieuan commented 1 month ago

I notice the problem with long delay very rarely on English and Czech, and therefore it's hard for me to get a sample audio file to debug on. It would help me if you sent it to me.

Maybe your lagging is caused by your specific model. Does it expect to use prompts? Maybe it's overfitted on shorter audio and not on longer, so on longer it often hallucinates or fluctuates, so that two updates rarely agree.

@Gldkslfmsd Would there be any impact if my Whisper model was fine-tuned on very short audio clips, such as those ranging from 2 to 5 seconds?

Gldkslfmsd commented 1 month ago

probably. It sounds like train-test mismatch. Test the quality of your model on longer audios in offline mode, and you will see

danghieuan commented 1 month ago

probably. It sounds like train-test mismatch. Test the quality of your model on longer audios in offline mode, and you will see

For my purpose of performing this task for real-time streaming, I took the Whisper model and fine-tuned it with my available Vietnamese data, with each audio clip used in fine-tuning averaging between 2 to 5 seconds.

Do you think I should retrain the model or use any other approach to better utilize your code?

Thank you!

Gldkslfmsd commented 1 month ago

I don't know, you need to experiment. Retrain with other data sounds good.

danghieuan commented 1 month ago

@Gldkslfmsd Thanks, I will try to investigate the source code to improve it.