ufal / whisper_streaming

Whisper realtime streaming for long speech-to-text transcription and translation
MIT License
2.08k stars 252 forks source link

Cannot handle multiple streams concurrently #138

Closed MohammedShokr closed 3 days ago

MohammedShokr commented 4 days ago

Issue

I implemented a WebSocket-based version of the whisper_online_server to handle audio streams from clients over WebSocket connections. The implementation works as expected when a single client is streaming; however, when two clients stream simultaneously, significant issues arise:

Initialize the model once and share it among all clients

model = FasterWhisperASR(lan="ar", modelsize="large-v2", compute_type="float16", device="cuda") model.use_vad()

Lock to ensure thread-safe access to the model

model_lock = threading.Lock()

def process_audio_sync(online_asr_processor, audio): """Synchronous function to process audio using the ASR model.""" online_asr_processor.insert_audio_chunk(audio)

Lock the model during the critical section

with model_lock:
    output = online_asr_processor.process_iter()
return output

async def process_audio(websocket: websockets.WebSocketServerProtocol, path):

Create a per-client processor using the shared model

online_asr_processor = VACOnlineASRProcessor(
    online_chunk_size=1,
    asr=model,
    tokenizer=None,
    buffer_trimming=("segment", 15),
    logfile=sys.stderr
)

loop = asyncio.get_running_loop()

async for message in websocket:
    if not isinstance(message, bytes):
        print(message)
        continue

    # Process the audio data
    sound_file = sf.SoundFile(
        io.BytesIO(message),
        channels=1,
        endian="LITTLE",
        samplerate=16000,
        subtype="PCM_16",
        format="RAW"
    )
    audio, _ = librosa.load(sound_file, sr=16000, dtype=np.float32, mono=True)

    # Offload the blocking ASR processing to a thread pool executor
    output = await loop.run_in_executor(None, process_audio_sync, online_asr_processor, audio)

    if output[0] is None:
        continue
    else:
        output_formatted = online_asr_processor.to_flush([output])
        await websocket.send(output_formatted[2].encode())

async def main():

Disable the server's keepalive pings to prevent timeouts

async with websockets.serve(process_audio, "localhost", 8765, ping_interval=None, ping_timeout=None):
    await asyncio.Future()

if name == "main": asyncio.run(main())

Gldkslfmsd commented 4 days ago

hi, it's known limitation. GPUs are the fastest with one process only. Sequential processing is faster than concurrent. Current Whisper-Streaming is intended for one client at a time. Batching -- #42 , could help, but still, there will be slow down. Refer to #42.

MohammedShokr commented 4 days ago

Hi @Gldkslfmsd, thanks for you reply. so the current Whisper-Streaming cannot be used in production applications with many users? is it just a POC for the streaming?

Gldkslfmsd commented 3 days ago

yes, it's demo, POC. Not for many users concurrently.