Open m-harmonic opened 8 months ago
I believe @Yard1 is upstreaming the work to solve this problem
@rkooo567 @Yard1 Thanks — is there an existing task or issue you can link to so we can understand the proposal and follow along?
@m-harmonic this is not correct - CUDA calls are asynchronous and are only synced by specific operations (like .cpu()
, .tolist()
, .item()
). What you are seeing is the CPU operation for GPU to CPU transfer awaiting the previously enqueued GPU operations to finish - GPU operations of which the lion's share is model forward pass, not sampler operations. Indeed, the fact that the CPU is blocked that far in the pipeline is an indication of no bottlenecking by CPU. Using CPU profilers for profiling GPU code is not recommended as it doesn't show you the whole picture. I recommend using pytorch profiler or nsight. See https://pytorch.org/docs/stable/notes/cuda.html#asynchronous-execution for more information.
tl;dr sampling code accounts for a small fraction of the forward pass (though it can be further optimized, which is what I am working on).
I found this thread when looking for potential CPU bottlenecks. I moved from a Ryzen 5600X to an older XEON E5 v4 chip and saw that tok/s halved even though I went from x4 PCIe to x8 PCIe. CPU was 100% during inference with 4 GPUs and tensor parallel.
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!
When running inference we see that the CPU of the VLLM process is maxed at 100%, but the GPU varies between 50-70%. For a single request, our avg generated throughput is only about 100 tokens/second.
We used a Python profiler and found that more than 90% of the overall CPU time is spent in sampler.py:_sample(). In particular the slowness is exclusively due the the GPU->CPU sync from the
.cpu()
call in this code:Is this a known performance issue and are there any plans for a fix? Are there any other settings we should be aware of to increase our throughput? Of course, it shouldn't be the case that the CPU is the bottleneck as opposed to the GPU. Thanks