Support batching for chatglm models.

lm-sys / FastChat

An open platform for training, serving, and evaluating large language models. Release repo for Vicuna and Chatbot Arena.

Apache License 2.0

36.51k stars 4.5k forks source link

Support batching for chatglm models. #2188

Open kimanli opened 1 year ago

kimanli commented 1 year ago

I need to deal with parallel request base on chatglm model. But I send multi request to fastchat server, and found the server can only process request one by one. Can someone give me some advice? thanks!

Huangyajuan-123 commented 1 year ago

Is it solved?

merrymercy commented 1 year ago

The default worker uses batch size = 1. To enable advanced batching, we can use vLLM worker for LLaMA models. However, vLLM does not support ChatGLM.

To process parallel requests, now you can either launch more machines or use streaming (which will interleave the execution of parallel requests).

Contributions/PRs are welcome if you can implement a model worker that supports real batching for ChatGLM.

surak commented 11 months ago

Would it be possible to have this as a condition, and run LLaMA models under VLLM always like that?