Approach for enabling multi client connection

kdcyberdude commented 2 months ago

I'd like to explore the best approach for managing multi-client connections in both single and multi-GPU environments.

Often, GPUs are underutilized by a single client, especially when smaller models are in use (e.g., Wav2Vec 2.0 instead of Whisper), models are accessed via APIs (such as GPT-4), or clients remain idle for extended periods. In these cases, I believe it should be possible for multiple clients (at least 3-4) to connect simultaneously and more efficiently utilize the available GPU resources.

I want to discuss how to architect a system where a single model can handle inference requests from multiple clients concurrently, ensuring GPU resources are optimized.

My current thought is that each client should have its own dedicated VAD thread, while the STT, LLM, and TTS threads should be shared across clients. These shared threads could use a queue to handle pending requests, batching them together to process the next group efficiently.

I'd love to hear your thoughts on this approach or any potential improvements.

NEALWE commented 2 months ago

I have the same idea, and i think we should move the vad part into client and change the stream form into audio file. That will be much more easy, right?

kdcyberdude commented 2 months ago

@NEALWE, this approach should work and be relatively straightforward to implement. However, there may be a potential trade-off in terms of latency.

Also, it’s important to note that this won’t follow a streaming model. Instead, it will function by making sequential calls to an endpoint for asr->llm->tts

NEALWE commented 2 months ago

I mean, you can add user_id into the input_sequences and keep it out. Then according the user_id to send the response to the correct client. However, the latency won't be handled very well, I've tried.

NEALWE commented 2 months ago

@kdcyberdude of course, you can run another python script and take over 2 more ports to serve another client, but it will be dangerous when there're a lot of people online.

huggingface / speech-to-speech

Approach for enabling multi client connection #104