ufal / whisper_streaming

Whisper realtime streaming for long speech-to-text transcription and translation
MIT License
1.84k stars 223 forks source link

Occasional Increasing Delay and Hallucination Issues #102

Closed joelai0101 closed 2 months ago

joelai0101 commented 3 months ago

I set up a Flask web server and used a web microphone to record audio and perform real-time speech recognition successfully. The web server captures audio using the RecordRTC library and sends the audio data to whisper_online_server for processing.

Here is part of my code. app.py:

# Using web browser mic with RecordRTC

from flask import Flask, render_template, request, jsonify
import socket
import threading

app = Flask(__name__)

# Constants
SERVER_HOST = '140.117.169.202'
SERVER_PORT = 43007
BUFFER_SIZE = 1024

# Initialize variables
audio_streaming = False
response_buffer = []

# Initialize Socket
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
server_address = (SERVER_HOST, SERVER_PORT)
sock.connect(server_address)

def is_allowed_char(char):
    return char.isalnum() or char in [' ', '.', ',', '!', '?', ':', ';', '-', "'", '"']

def receive_response():
    global audio_streaming
    while True:
        response = sock.recv(BUFFER_SIZE)
        if not response:
            break
        decoded_response = response.decode('utf-8')
        cleaned_response = ''.join(char for char in decoded_response if is_allowed_char(char))
        if cleaned_response:
            response_buffer.append(cleaned_response)

# Start thread for receiving audio response
receive_thread = threading.Thread(target=receive_response)
receive_thread.start()

@app.route('/')
def index():
    return render_template('index.html')

@app.route('/start', methods=['POST'])
def start_streaming():
    global audio_streaming, response_buffer
    audio_streaming = True
    response_buffer = []
    return jsonify({"status": "started"})

@app.route('/stop', methods=['POST'])
def stop_streaming():
    global audio_streaming
    audio_streaming = False
    return jsonify({"status": "stopped"})

@app.route('/audio', methods=['POST'])
def receive_audio():
    if audio_streaming:
        audio_data = request.data
        sock.sendall(audio_data)
    return jsonify({"status": "received"})

@app.route('/get_transcription', methods=['GET'])
def get_transcription():
    response = "\n".join(response_buffer)
    return jsonify({"transcription": response})

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5000)

static/js/app.js:

// for app.py
document.addEventListener("DOMContentLoaded", function() {
    var startButton = document.getElementById('startButton');
    var stopButton = document.getElementById('stopButton');
    var transcription = document.getElementById('transcription');
    var intervalId = null;
    var mediaRecorder = null;

    startButton.addEventListener('click', function() {
        fetch('/start', { method: 'POST' })
            .then(response => response.json())
            .then(data => {
                console.log(data.status);
                startButton.disabled = true;
                stopButton.disabled = false;

                navigator.mediaDevices.getUserMedia({ audio: true })
                    .then(stream => {
                        mediaRecorder = RecordRTC(stream, {
                            type: 'audio',
                            mimeType: 'audio/webm',
                            sampleRate: 44100,
                            desiredSampRate: 16000,
                            recorderType: StereoAudioRecorder,
                            numberOfAudioChannels: 1,
                            timeSlice: 250, // 每250毫秒發送一次數據
                            ondataavailable: function(blob) {
                                var reader = new FileReader();
                                reader.onload = function() {
                                    fetch('/audio', {
                                        method: 'POST',
                                        body: reader.result
                                    });
                                };
                                reader.readAsArrayBuffer(blob);
                            }
                        });
                        mediaRecorder.startRecording();
                    });

                intervalId = setInterval(fetchTranscription, 1000);
            });
    });

    stopButton.addEventListener('click', function() {
        fetch('/stop', { method: 'POST' })
            .then(response => response.json())
            .then(data => {
                console.log(data.status);
                startButton.disabled = false;
                stopButton.disabled = true;

                if (mediaRecorder) {
                    mediaRecorder.stopRecording(function() {
                        if (mediaRecorder.stream && mediaRecorder.stream.stop) {
                            mediaRecorder.stream.stop();
                        }
                    });
                }

                if (intervalId) {
                    clearInterval(intervalId);
                    intervalId = null;
                }
            });
    });

    function fetchTranscription() {
        fetch('/get_transcription')
            .then(response => response.json())
            .then(data => {
                transcription.textContent = data.transcription;
            });
    }
});

And I edited part of whisper_online_server.py and whisper_online.py: whisper_online.py:

def transcribe(self, audio, init_prompt=""):

        # tested: beam_size=5 is faster and better than 1 (on one 200 second document from En ESIC, min chunk 0.01)
        segments, info = self.model.transcribe(
            audio, 
            language=self.original_language, 
            initial_prompt=init_prompt, 
            beam_size=5, 
            temperature=0, # add this to avoid "Compression ratio threshold is not met with temperature..."
            word_timestamps=True, 
            condition_on_previous_text=True, 
            **self.transcribe_kargs)
        #print(info)  # info contains language detection result

whisper_online_server.py:

def process(self):
        # handle one client connection
        self.online_asr_proc.init()
        beg = 0.0
        start = time.time()-beg
        end = 0
        while True:
            now = time.time() - start
            if now < end+self.min_chunk:
                time.sleep(self.min_chunk+end-now)
            end = time.time() - start
            a = self.receive_audio_chunk()
            if a is None:
                break
            beg = end
            self.online_asr_proc.insert_audio_chunk(a)
            try:
                o = online.process_iter()
                try:
                    self.send_result(o)
                except BrokenPipeError:
                    logger.info("broken pipe -- connection closed?")
                    break
            except AssertionError as e:
                logger.error(f"assertion error: {e}")
                pass

            now = time.time() - start
            latency = now-end
            logger.debug(f"## last processed {end:.2f} s, now is {now:.2f}, the latency is {latency:.2f}")

But I have encountered some issues. Initially, the latency is acceptable (around 1-3 seconds), but sometimes it suddenly spikes. And the model occasionally generates hallucinated content during processing. When hallucinations occur, the latency tends to become even longer.

My terminal:

python3 whisper_online_server.py —min-chunk-size 1

and another terminal:

python3 app.py

My environment:

Here is part of my terminal output:

DEBUG:__main__:Received raw bytes: 65536
DEBUG:__main__:Processed audio chunk of length: 32768
DEBUG:whisper_online:PROMPT: 辦法弄啊好啦主要就是他的型態32比較慢然後我的卡應該是這個問題不然就是可能就是多個人講啊講話沒有間隔可能更新比較慢會更新比較慢講話沒有間隔會比較慢他會他比較不會沒有間隔就比較不會那個去去切斷嗎對啊是因為這樣子是你那邊硬的比較慢還是這邊哇
DEBUG:whisper_online:CONTEXT: 這個
DEBUG:whisper_online:transcribing 9.32 seconds from 830.95
INFO:faster_whisper:Processing audio with duration 00:09.323
INFO:faster_whisper:VAD filter removed 00:01.760 of audio
DEBUG:faster_whisper:VAD filter kept the following audio segments: [00:00.000 -> 00:04.240], [00:06.000 -> 00:09.323]
INFO:faster_whisper:Detected language 'zh' with probability 0.19
DEBUG:faster_whisper:Processing segment at 00:00.000
DEBUG:whisper_online:>>>>COMPLETE NOW: (None, None, '')
DEBUG:whisper_online:INCOMPLETE: (836.9499999999997, 839.3699999999997, '是你那邊硬的比較慢')
DEBUG:whisper_online:len of buffer now: 9.32
DEBUG:__main__:No text in this segment
DEBUG:__main__:## last processed 822.22 s, now is 823.37, the latency is 1.15
DEBUG:__main__:Received raw bytes: 65536
DEBUG:__main__:Processed audio chunk of length: 32768
DEBUG:whisper_online:PROMPT: 辦法弄啊好啦主要就是他的型態32比較慢然後我的卡應該是這個問題不然就是可能就是多個人講啊講話沒有間隔可能更新比較慢會更新比較慢講話沒有間隔會比較慢他會他比較不會沒有間隔就比較不會那個去去切斷嗎對啊是因為這樣子是你那邊硬的比較慢還是這邊哇
DEBUG:whisper_online:CONTEXT: 這個
DEBUG:whisper_online:transcribing 11.37 seconds from 830.95
INFO:faster_whisper:Processing audio with duration 00:11.371
INFO:faster_whisper:VAD filter removed 00:01.760 of audio
DEBUG:faster_whisper:VAD filter kept the following audio segments: [00:00.000 -> 00:04.240], [00:06.000 -> 00:11.371]
INFO:faster_whisper:Detected language 'ja' with probability 0.27
DEBUG:faster_whisper:Processing segment at 00:00.000
DEBUG:whisper_online:>>>>COMPLETE NOW: (None, None, '')
DEBUG:whisper_online:INCOMPLETE: (836.9699999999997, 839.5499999999997, 'すぐになりにくくしちゃうもん')
DEBUG:whisper_online:len of buffer now: 11.37
DEBUG:__main__:No text in this segment
DEBUG:__main__:## last processed 823.37 s, now is 824.61, the latency is 1.24
DEBUG:__main__:Received raw bytes: 46272
DEBUG:__main__:Processed audio chunk of length: 23136
DEBUG:whisper_online:PROMPT: 辦法弄啊好啦主要就是他的型態32比較慢然後我的卡應該是這個問題不然就是可能就是多個人講啊講話沒有間隔可能更新比較慢會更新比較慢講話沒有間隔會比較慢他會他比較不會沒有間隔就比較不會那個去去切斷嗎對啊是因為這樣子是你那邊硬的比較慢還是這邊哇
DEBUG:whisper_online:CONTEXT: 這個
DEBUG:whisper_online:transcribing 12.82 seconds from 830.95
INFO:faster_whisper:Processing audio with duration 00:12.817
INFO:faster_whisper:VAD filter removed 00:01.760 of audio
DEBUG:faster_whisper:VAD filter kept the following audio segments: [00:00.000 -> 00:04.240], [00:06.000 -> 00:12.817]
INFO:faster_whisper:Detected language 'ja' with probability 0.27
DEBUG:faster_whisper:Processing segment at 00:00.000
DEBUG:faster_whisper:Log probability threshold is not met with temperature 0.0 (-1.045780 < -1.000000)
DEBUG:whisper_online:>>>>COMPLETE NOW: (836.9699999999997, 839.5499999999997, 'すぐになりにくくしちゃうもん')
DEBUG:whisper_online:INCOMPLETE: (None, None, '')
DEBUG:whisper_online:len of buffer now: 12.82
836970 839550 すぐになりにくくしちゃうもん
DEBUG:__main__:## last processed 824.61 s, now is 825.93, the latency is 1.32
DEBUG:__main__:Received raw bytes: 47772
DEBUG:__main__:Processed audio chunk of length: 23886
DEBUG:whisper_online:PROMPT: 辦法弄啊好啦主要就是他的型態32比較慢然後我的卡應該是這個問題不然就是可能就是多個人講啊講話沒有間隔可能更新比較慢會更新比較慢講話沒有間隔會比較慢他會他比較不會沒有間隔就比較不會那個去去切斷嗎對啊是因為這樣子是你那邊硬的比較慢還是這邊哇
DEBUG:whisper_online:CONTEXT: 這個すぐになりにくくしちゃうもん
DEBUG:whisper_online:transcribing 14.31 seconds from 830.95
INFO:faster_whisper:Processing audio with duration 00:14.310
INFO:faster_whisper:VAD filter removed 00:01.760 of audio
DEBUG:faster_whisper:VAD filter kept the following audio segments: [00:00.000 -> 00:04.240], [00:06.000 -> 00:14.310]
INFO:faster_whisper:Detected language 'ja' with probability 0.41
DEBUG:faster_whisper:Processing segment at 00:00.000
DEBUG:faster_whisper:Compression ratio threshold is not met with temperature 0.0 (30.285714 > 2.400000)
DEBUG:faster_whisper:Processing segment at 00:12.520
DEBUG:faster_whisper:Compression ratio threshold is not met with temperature 0.0 (2.470588 > 2.400000)
DEBUG:whisper_online:>>>>COMPLETE NOW: (None, None, '')
DEBUG:whisper_online:INCOMPLETE: (839.6699999999997, 845.2299999999997, 'っはっはっはっはっはっはっはっはっはっはっはっはっはっはっはっはっはっはっはっはっはっはっはっはっはっはっはっはっはっはっはっはっはっはっはっはっはっはっはっはっはっはっはっはっはっはっはっはっはっはっはっはっはっはっはっはっはっはっはっはっはっはっはっはっはっはっはっはっはっはっはっはっはっはっはっはっはっはっはっはっはっはっはっはっはっはっはっはっはっはっはっはっ')
DEBUG:whisper_online:len of buffer now: 14.31
DEBUG:__main__:No text in this segment
DEBUG:__main__:## last processed 825.93 s, now is 833.56, the latency is 7.63
DEBUG:__main__:Received raw bytes: 65536
DEBUG:__main__:Processed audio chunk of length: 32768
DEBUG:whisper_online:PROMPT: 辦法弄啊好啦主要就是他的型態32比較慢然後我的卡應該是這個問題不然就是可能就是多個人講啊講話沒有間隔可能更新比較慢會更新比較慢講話沒有間隔會比較慢他會他比較不會沒有間隔就比較不會那個去去切斷嗎對啊是因為這樣子是你那邊硬的比較慢還是這邊哇
DEBUG:whisper_online:CONTEXT: 這個すぐになりにくくしちゃうもん
DEBUG:whisper_online:transcribing 16.36 seconds from 830.95
INFO:faster_whisper:Processing audio with duration 00:16.358
INFO:faster_whisper:VAD filter removed 00:01.760 of audio
DEBUG:faster_whisper:VAD filter kept the following audio segments: [00:00.000 -> 00:04.240], [00:06.000 -> 00:16.358]
INFO:faster_whisper:Detected language 'zh' with probability 0.88
DEBUG:faster_whisper:Processing segment at 00:00.000
DEBUG:faster_whisper:Compression ratio threshold is not met with temperature 0.0 (60.818182 > 2.400000)
DEBUG:faster_whisper:Processing segment at 00:14.560
DEBUG:whisper_online:>>>>COMPLETE NOW: (None, None, '')
DEBUG:whisper_online:INCOMPLETE: (839.6499999999997, 847.2699999999998, '哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈')
DEBUG:whisper_online:--- not enough segments to chunk
DEBUG:whisper_online:chunking segment
DEBUG:whisper_online:len of buffer now: 16.36
DEBUG:__main__:No text in this segment
DEBUG:__main__:## last processed 833.56 s, now is 841.20, the latency is 7.64

My web screenshots:

image image
risacher commented 3 months ago

Without knowing your language, it's hard for me to tell if I'm having the same issue, but I'm doing a similar thing - streaming from a browser over WebRTC to whisper_server_online.py. It seems to work well when there is a continuous stream of speech, but as soon as there is a long pause - maybe 30s or more, it starts to have issues. On the console, I will see DEBUG --- last segment not within commited area and/or DEBUG --- not enough segments to chunk. Once this happens, the recognized speech becomes ... odd. Often it will report the same word or phrase with increasing repetition. For example:

recognized: 2640 4080  I'm going to talk for a moment.
recognized: 4120 6220  to demonstrate
recognized: 6220 8040  that I have
recognized: 8040 9000  automatic speech
recognized: 9000 9560  recognition
recognized: 9560 11320  working, but then
recognized: 11320 12500  in order to
recognized: 13340 13580  what
recognized: 13580 15180  happens when there is a
recognized: 15180 16080  long gap
recognized: 16080 16820  in the audio
recognized: 16840 18120  I will stop
recognized: 18120 18720  talking
recognized: 18720 21060  right
recognized: 21060 21400  now.
recognized: 74020 76820  to demonstrate that I have automatic speech recognition working, but then in order to
recognized: 82480 83880  to demonstrate that I have automatic speech recognition working, but then in order to demonstrate
recognized: 88620 90020  demonstrate that I have automatic speech recognition working, but then in order to demonstrate that
recognized: 94760 96160  demonstrate that I have automatic speech recognition working, but then in order to demonstrate that
recognized: 98200 98200  I have automatic speech recognition working,
recognized: 100900 102300  demonstrate that I have a long gap in the
recognized: 105000 106400  demonstrate that I have a long gap in the audio.
recognized: 106800 110500  that I have automatic speech recognition working, but then in order to demonstrate that I have
recognized: 115240 116640  I have
recognized: 121380 122780  I have a long gap in the audio. that I have a long gap in the audio.
recognized: 125480 126880  I have a long gap in the
recognized: 129580 130980  I have a long gap in the
recognized: 132280 135080  I have a long gap in the
recognized: 135380 136780  have a long gap in the audio, but then in order to demonstrate that I have a long gap
recognized: 141860 143260  gap in the audio.
joelai0101 commented 3 months ago

Without knowing your language, it's hard for me to tell if I'm having the same issue, but I'm doing a similar thing - streaming from a browser over WebRTC to whisper_server_online.py. It seems to work well when there is a continuous stream of speech, but as soon as there is a long pause - maybe 30s or more, it starts to have issues. On the console, I will see DEBUG --- last segment not within commited area and/or DEBUG --- not enough segments to chunk. Once this happens, the recognized speech becomes ... odd. Often it will report the same word or phrase with increasing repetition. For example:

I just got the same problem but i don't know how to reproduce it.

Screenshot 2024-07-02 at 10 40 48 AM
Gldkslfmsd commented 2 months ago

hi, lagging: check what is the max packet size loaded in every processing iteration. It seems that 65536 is 2 seconds and a bit. Make it much larger so it can cacth up the long pause. This bug is fixed in the branch vad_streaming.

Halucination: check whether the offline Whisper model with VAD halucinates on your content. If yes, it's the model problem. Use another model.

Or check the audio quality, remove the noise, make ppl speak fluently. Then the model works better.

risacher commented 2 months ago

@Gldkslfmsd Thanks for the suggestions. I stepped away from my project but I'm sure I'll appreciate the advice when I come back to it.

Gldkslfmsd commented 2 months ago

then feel free to reopen if you follow up