[Question] Concurrent user requests handling for Windows system without vLLM?

hpxiong commented 9 months ago

I've discovered that vLLM is only available for Linux systems. Currently my h2oGPT setup has caused a backlog of requests on our Windows-based system, since requests are processed in a queue one at a time. Many suggestions here are to use vLLM, but it's not compatible with Windows-based system. Any suggestions for running concurrent requests on a Windows system?

hpxiong commented 9 months ago

@pseudotensor - Can you please share some more details in terms of how-to setup h2oGPT as Gradio inference server? I tried to follow the documentation to run two processes,

first process as inference server with the following script, started at http://192.168.0.10:7680
```
SAVE_DIR=./save/ python generate.py --base_model=h2oai/h2ogpt-oasst1-512-12b
```
second process, in the same environment, as client server, started at http://192.168.0.10:7681
```
python generate.py --inference_server="http://192.168.0.10:7680" --base_model=h2oai/h2ogpt-oasst1-falcon-40b
```
But apparently, this does not work. So my question is that if the current h2oGPT code base can be used as an inference server or not? if yes, how should I set it up properly?

pseudotensor commented 9 months ago

If you mean the LLM itself handle concurrent requests, that's not possible without TGI/vLLM unless one disables cache for the LLM. That's not recommended as will be slower. For quantized model there is same issue, llama.cpp is not multi-threaded so fails to work when used with gradio-like multi-threading.

If you mean have separate gradio server hosting the LLM, to free-up the main gradio from being blocked by its own queue, that's possible with gradio inference server, but it won't speed-up the LLM calling.

In principle one could batch things in gradio, like here: https://github.com/gradio-app/gradio/blob/main/demo/diffusers_with_batching/run.py

Then one could use the gradio inference server but change the code a bit to pass fewer things and use the pure predict endpoint.

I haven't seen a demo code where gradio uses batching with LLMs for some reason. Here is chatbot case for streaming that could be used as inference server, but has no batching:

https://www.gradio.app/guides/creating-a-chatbot-fast#example-using-a-local-open-source-llm-with-hugging-face

Feel free to investigate this.

h2oai / h2ogpt

[Question] Concurrent user requests handling for Windows system without vLLM? #1189