Open kdcyberdude opened 1 month ago
I have the same idea, and i think we should move the vad part into client and change the stream form into audio file. That will be much more easy, right?
@NEALWE, this approach should work and be relatively straightforward to implement. However, there may be a potential trade-off in terms of latency.
Also, it’s important to note that this won’t follow a streaming model. Instead, it will function by making sequential calls to an endpoint for asr->llm->tts
I mean, you can add user_id into the input_sequences and keep it out. Then according the user_id to send the response to the correct client. However, the latency won't be handled very well, I've tried.
@kdcyberdude of course, you can run another python script and take over 2 more ports to serve another client, but it will be dangerous when there're a lot of people online.
I'd like to explore the best approach for managing multi-client connections in both single and multi-GPU environments.
Often, GPUs are underutilized by a single client, especially when smaller models are in use (e.g., Wav2Vec 2.0 instead of Whisper), models are accessed via APIs (such as GPT-4), or clients remain idle for extended periods. In these cases, I believe it should be possible for multiple clients (at least 3-4) to connect simultaneously and more efficiently utilize the available GPU resources.
I want to discuss how to architect a system where a single model can handle inference requests from multiple clients concurrently, ensuring GPU resources are optimized.
My current thought is that each client should have its own dedicated
VAD
thread, while theSTT
,LLM
, andTTS
threads should be shared across clients. These shared threads could use a queue to handle pending requests, batching them together to process the next group efficiently.I'd love to hear your thoughts on this approach or any potential improvements.