lm-sys / FastChat

An open platform for training, serving, and evaluating large language models. Release repo for Vicuna and Chatbot Arena.
Apache License 2.0
36.83k stars 4.54k forks source link

inference with multiple GPUs is too slow #2702

Closed garyyang85 closed 1 month ago

garyyang85 commented 11 months ago

Fastchat to enable baichuan2 LLM to use openai invoke, two V100 32G GPUs. It is more slower than model running with one GPU when inference. Nearly 3 tokens in 5 seconds.

python3 -m fastchat.serve.controller

python3 -m fastchat.serve.model_worker --model-path /app/model --num-gpus=2 --gpus=0,1 --max-gpu-memory=32GB --model-names=Baichuan2-13B-Chat

python3 -m fastchat.serve.openai_api_server --host 0.0.0.0 --port 8000
infwinston commented 11 months ago

Can you try fastchat.serve.vllm_worker which has better tensor parallelism support?

garyyang85 commented 11 months ago

@infwinston Thanks for your advice. I tried fastchat.serve.vllm_worker and it works well. But this way will cost about 27G on both GPUs. When use fastchat.serve.model_worker, it will cost 14G on both GPUs and the speed is two bad. Fastchat use the device_map and vllm seems not, how to speed up in such scenario? Thanks

Dandelionym commented 2 months ago

same question on the task of 70B LLaMA-70B-fintuned inference. A800*2, 80G for each.

xymou commented 1 month ago

same question here.

surak commented 1 month ago

It makes sense. The transformer architecture is made in a way that a huge part of the model's memory is visited for each token. If you have this distributed into multiple gpus, you're incurring the extra overhead of moving this data around via the gpu's interconnect: NVLink or worse, the PCI Express bus, depending on your gpus. The more data movement over slow buses, the worse it gets. There's no way a model can run faster by having more gpus when it can fit into a single one.