Closed garyyang85 closed 1 month ago
Can you try fastchat.serve.vllm_worker
which has better tensor parallelism support?
@infwinston Thanks for your advice. I tried fastchat.serve.vllm_worker and it works well. But this way will cost about 27G on both GPUs. When use fastchat.serve.model_worker, it will cost 14G on both GPUs and the speed is two bad. Fastchat use the device_map and vllm seems not, how to speed up in such scenario? Thanks
same question on the task of 70B LLaMA-70B-fintuned inference
. A800*2, 80G for each.
same question here.
It makes sense. The transformer architecture is made in a way that a huge part of the model's memory is visited for each token. If you have this distributed into multiple gpus, you're incurring the extra overhead of moving this data around via the gpu's interconnect: NVLink or worse, the PCI Express bus, depending on your gpus. The more data movement over slow buses, the worse it gets. There's no way a model can run faster by having more gpus when it can fit into a single one.
Fastchat to enable baichuan2 LLM to use openai invoke, two V100 32G GPUs. It is more slower than model running with one GPU when inference. Nearly 3 tokens in 5 seconds.