Open skylee-01 opened 2 months ago
Related experiments: https://github.com/vllm-project/vllm/issues/7540
@WoosukKwon @youkaichao Please provide some assistance.
what is the vllm version you use?
what is the vllm version you use?
0.5.5
we are optimizing the cpu time, please stay tuned. it should not be so dependent on CPU performance in the future.
we are optimizing the cpu time, please stay tuned. it should not be so dependent on CPU performance in the future.
What is the reason for VLLM's current heavy dependence on CPU, and what are the directions for optimization? Our team is also trying to participate in the work of VLLM, hoping to contribute to the VLLM community. We hope to be able to submit code for VLLM.
cpu needs to serve http requests, and also prepare lots of input data for the GPU, which changes for every step (because of continuous batching and auto-regressive LLM decoding).
for some examples on this line of optimization, see #7000 and #8092
contributions are definitely welcome!
cpu needs to serve http requests, and also prepare lots of input data for the GPU, which changes for every step (because of continuous batching and auto-regressive LLM decoding).
for some examples on this line of optimization, see #7000 and #8092
contributions are definitely welcome!
Our team has developed some spec decoding features based on VLLM, which have been used internally and have yielded good performance benefits. How can we join the VLLM project, and where would be a good place to start?
welcome to send emails to vllm-questions@lists.berkeley.edu
Really interesting. Thanks for reporting. The GPUs are getting fast :)
Hi @skylee-01 Thanks for reporting this! We also recently discovered the same problem. We plan to do more optimizations to mitigate the CPU effect.
vLLM is a fully open community-driven project, so we'd appreciate any contributions, including submitting or reviewing PRs, answering questions, and helping documentation.
Proposal to improve performance
We used the same GPU on two machines but different CPUs. The following experimental conclusions were drawn: Experimental results: The GPU is 3090, and the CPU was upgraded from Xeon Gold 6240 to i9-12900k. The impact is as follows. a. vLLM achieved a 3.8x speedup in the agent scenario. b. TGi achieved a 1.23x speedup in the agent scenario. c. vLLM still has latency issues, but the time has been reduced to 100ms (previously 300ms). e. GPU utilization has increased from 70% to 90%.
From the stress test data, it is evident that vLLM heavily relies on the performance of the CPU. What are the main factors affecting CPU performance, and how can they be optimized?