Open kimanli opened 1 year ago
Is it solved?
The default worker uses batch size = 1. To enable advanced batching, we can use vLLM worker for LLaMA models. However, vLLM does not support ChatGLM.
To process parallel requests, now you can either launch more machines or use streaming (which will interleave the execution of parallel requests).
Contributions/PRs are welcome if you can implement a model worker that supports real batching for ChatGLM.
Would it be possible to have this as a condition, and run LLaMA models under VLLM always like that?
I need to deal with parallel request base on chatglm model. But I send multi request to fastchat server, and found the server can only process request one by one. Can someone give me some advice? thanks!