Open hpxiong opened 9 months ago
@pseudotensor - Can you please share some more details in terms of how-to setup h2oGPT as Gradio inference server? I tried to follow the documentation to run two processes,
SAVE_DIR=./save/ python generate.py --base_model=h2oai/h2ogpt-oasst1-512-12b
python generate.py --inference_server="http://192.168.0.10:7680" --base_model=h2oai/h2ogpt-oasst1-falcon-40b
But apparently, this does not work. So my question is that if the current h2oGPT code base can be used as an inference server or not? if yes, how should I set it up properly?
If you mean the LLM itself handle concurrent requests, that's not possible without TGI/vLLM unless one disables cache for the LLM. That's not recommended as will be slower. For quantized model there is same issue, llama.cpp is not multi-threaded so fails to work when used with gradio-like multi-threading.
If you mean have separate gradio server hosting the LLM, to free-up the main gradio from being blocked by its own queue, that's possible with gradio inference server, but it won't speed-up the LLM calling.
In principle one could batch things in gradio, like here: https://github.com/gradio-app/gradio/blob/main/demo/diffusers_with_batching/run.py
Then one could use the gradio inference server but change the code a bit to pass fewer things and use the pure predict endpoint.
I haven't seen a demo code where gradio uses batching with LLMs for some reason. Here is chatbot case for streaming that could be used as inference server, but has no batching:
Feel free to investigate this.
I've discovered that vLLM is only available for Linux systems. Currently my h2oGPT setup has caused a backlog of requests on our Windows-based system, since requests are processed in a queue one at a time. Many suggestions here are to use vLLM, but it's not compatible with Windows-based system. Any suggestions for running concurrent requests on a Windows system?