[Performance]: The impact of CPU on vLLM performance is significant.

vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

https://docs.vllm.ai

Apache License 2.0

29.8k stars 4.5k forks source link

[Performance]: The impact of CPU on vLLM performance is significant. #8147

Open skylee-01 opened 2 months ago

skylee-01 commented 2 months ago

Proposal to improve performance

We used the same GPU on two machines but different CPUs. The following experimental conclusions were drawn: Experimental results: The GPU is 3090, and the CPU was upgraded from Xeon Gold 6240 to i9-12900k. The impact is as follows. a. vLLM achieved a 3.8x speedup in the agent scenario. b. TGi achieved a 1.23x speedup in the agent scenario. c. vLLM still has latency issues, but the time has been reduced to 100ms (previously 300ms). e. GPU utilization has increased from 70% to 90%.

From the stress test data, it is evident that vLLM heavily relies on the performance of the CPU. What are the main factors affecting CPU performance, and how can they be optimized?

skylee-01 commented 2 months ago

Related experiments： https://github.com/vllm-project/vllm/issues/7540

skylee-01 commented 2 months ago

@WoosukKwon @youkaichao Please provide some assistance.

youkaichao commented 2 months ago

what is the vllm version you use?

skylee-01 commented 2 months ago

what is the vllm version you use?

0.5.5

youkaichao commented 2 months ago

we are optimizing the cpu time, please stay tuned. it should not be so dependent on CPU performance in the future.

skylee-01 commented 2 months ago

we are optimizing the cpu time, please stay tuned. it should not be so dependent on CPU performance in the future.

What is the reason for VLLM's current heavy dependence on CPU, and what are the directions for optimization? Our team is also trying to participate in the work of VLLM, hoping to contribute to the VLLM community. We hope to be able to submit code for VLLM.

youkaichao commented 2 months ago

cpu needs to serve http requests, and also prepare lots of input data for the GPU, which changes for every step (because of continuous batching and auto-regressive LLM decoding).

for some examples on this line of optimization, see #7000 and #8092

contributions are definitely welcome!

skylee-01 commented 2 months ago

cpu needs to serve http requests, and also prepare lots of input data for the GPU, which changes for every step (because of continuous batching and auto-regressive LLM decoding).

for some examples on this line of optimization, see #7000 and #8092

contributions are definitely welcome!

Our team has developed some spec decoding features based on VLLM, which have been used internally and have yielded good performance benefits. How can we join the VLLM project, and where would be a good place to start?

youkaichao commented 2 months ago

welcome to send emails to vllm-questions@lists.berkeley.edu

robertgshaw2-neuralmagic commented 2 months ago

Really interesting. Thanks for reporting. The GPUs are getting fast :)

WoosukKwon commented 2 months ago

Hi @skylee-01 Thanks for reporting this! We also recently discovered the same problem. We plan to do more optimizations to mitigate the CPU effect.

vLLM is a fully open community-driven project, so we'd appreciate any contributions, including submitting or reviewing PRs, answering questions, and helping documentation.