MohammedShokr commented 4 days ago

Issue

I implemented a WebSocket-based version of the whisper_online_server to handle audio streams from clients over WebSocket connections. The implementation works as expected when a single client is streaming; however, when two clients stream simultaneously, significant issues arise:

High Latency: Latency increases drastically, sometimes reaching up to a minute for both clients.
Non-concurrent Handling: Instead of processing both streams concurrently, it appears as if the server handles them in turns, causing delays and bottlenecks.
Troubleshooting Attempts

I’ve tried both of the following approaches, but neither resolved the issue:
Shared ASR Model: Using a single ASR model instance shared across both streams with threading lock.

Separate ASR Model Instances: Creating a separate ASR model instance for each client stream. Both approaches resulted in the same high-latency, turn-taking behavior.

Code


import sys
import asyncio
import io
import os
import threading  
import librosa
import soundfile as sf
import numpy as np
import websockets
import uuid
from whisper_online import FasterWhisperASR, VACOnlineASRProcessor

Initialize the model once and share it among all clients

model = FasterWhisperASR(lan="ar", modelsize="large-v2", compute_type="float16", device="cuda") model.use_vad()

Lock to ensure thread-safe access to the model

model_lock = threading.Lock()

def process_audio_sync(online_asr_processor, audio): """Synchronous function to process audio using the ASR model.""" online_asr_processor.insert_audio_chunk(audio)

Lock the model during the critical section

with model_lock:
    output = online_asr_processor.process_iter()
return output

async def process_audio(websocket: websockets.WebSocketServerProtocol, path):

Create a per-client processor using the shared model

online_asr_processor = VACOnlineASRProcessor(
    online_chunk_size=1,
    asr=model,
    tokenizer=None,
    buffer_trimming=("segment", 15),
    logfile=sys.stderr
)

loop = asyncio.get_running_loop()

async for message in websocket:
    if not isinstance(message, bytes):
        print(message)
        continue

    # Process the audio data
    sound_file = sf.SoundFile(
        io.BytesIO(message),
        channels=1,
        endian="LITTLE",
        samplerate=16000,
        subtype="PCM_16",
        format="RAW"
    )
    audio, _ = librosa.load(sound_file, sr=16000, dtype=np.float32, mono=True)

    # Offload the blocking ASR processing to a thread pool executor
    output = await loop.run_in_executor(None, process_audio_sync, online_asr_processor, audio)

    if output[0] is None:
        continue
    else:
        output_formatted = online_asr_processor.to_flush([output])
        await websocket.send(output_formatted[2].encode())

async def main():

Disable the server's keepalive pings to prevent timeouts

async with websockets.serve(process_audio, "localhost", 8765, ping_interval=None, ping_timeout=None):
    await asyncio.Future()

if name == "main": asyncio.run(main())

Gldkslfmsd commented 4 days ago

hi, it's known limitation. GPUs are the fastest with one process only. Sequential processing is faster than concurrent. Current Whisper-Streaming is intended for one client at a time. Batching -- #42 , could help, but still, there will be slow down. Refer to #42.

MohammedShokr commented 4 days ago

Hi @Gldkslfmsd, thanks for you reply. so the current Whisper-Streaming cannot be used in production applications with many users? is it just a POC for the streaming?

Gldkslfmsd commented 3 days ago

yes, it's demo, POC. Not for many users concurrently.

ufal / whisper_streaming

Cannot handle multiple streams concurrently #138