h2oai / h2ogpt

Private chat with local GPT with document, images, video, etc. 100% private, Apache 2.0. Supports oLLaMa, Mixtral, llama.cpp, and more. Demo: https://gpt.h2o.ai/ https://gpt-docs.h2o.ai/
http://h2o.ai
Apache License 2.0
11.28k stars 1.24k forks source link

[Question] Concurrent user requests handling for Windows system without vLLM? #1189

Open hpxiong opened 9 months ago

hpxiong commented 9 months ago

I've discovered that vLLM is only available for Linux systems. Currently my h2oGPT setup has caused a backlog of requests on our Windows-based system, since requests are processed in a queue one at a time. Many suggestions here are to use vLLM, but it's not compatible with Windows-based system. Any suggestions for running concurrent requests on a Windows system?

hpxiong commented 9 months ago

@pseudotensor - Can you please share some more details in terms of how-to setup h2oGPT as Gradio inference server? I tried to follow the documentation to run two processes,

pseudotensor commented 9 months ago

If you mean the LLM itself handle concurrent requests, that's not possible without TGI/vLLM unless one disables cache for the LLM. That's not recommended as will be slower. For quantized model there is same issue, llama.cpp is not multi-threaded so fails to work when used with gradio-like multi-threading.

If you mean have separate gradio server hosting the LLM, to free-up the main gradio from being blocked by its own queue, that's possible with gradio inference server, but it won't speed-up the LLM calling.

In principle one could batch things in gradio, like here: https://github.com/gradio-app/gradio/blob/main/demo/diffusers_with_batching/run.py

Then one could use the gradio inference server but change the code a bit to pass fewer things and use the pure predict endpoint.

I haven't seen a demo code where gradio uses batching with LLMs for some reason. Here is chatbot case for streaming that could be used as inference server, but has no batching:

https://www.gradio.app/guides/creating-a-chatbot-fast#example-using-a-local-open-source-llm-with-hugging-face

Feel free to investigate this.