vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
27.91k stars 4.12k forks source link

Sampling is very slow, causing a CPU bottleneck #3384

Open m-harmonic opened 6 months ago

m-harmonic commented 6 months ago

When running inference we see that the CPU of the VLLM process is maxed at 100%, but the GPU varies between 50-70%. For a single request, our avg generated throughput is only about 100 tokens/second.

We used a Python profiler and found that more than 90% of the overall CPU time is spent in sampler.py:_sample(). In particular the slowness is exclusively due the the GPU->CPU sync from the .cpu() call in this code:

def _random_sample(
    selected_seq_groups: List[Tuple[List[int], SamplingParams]],
    is_prompts: List[bool],
    random_samples: torch.Tensor,
) -> List[Tuple[List[int], List[int]]]:
    # Find the maximum best_of value of the prompt phase requests.
    random_samples = random_samples.cpu() 

Is this a known performance issue and are there any plans for a fix? Are there any other settings we should be aware of to increase our throughput? Of course, it shouldn't be the case that the CPU is the bottleneck as opposed to the GPU. Thanks

rkooo567 commented 6 months ago

I believe @Yard1 is upstreaming the work to solve this problem

m-harmonic commented 6 months ago

@rkooo567 @Yard1 Thanks — is there an existing task or issue you can link to so we can understand the proposal and follow along?

Yard1 commented 6 months ago

@m-harmonic this is not correct - CUDA calls are asynchronous and are only synced by specific operations (like .cpu(), .tolist(), .item()). What you are seeing is the CPU operation for GPU to CPU transfer awaiting the previously enqueued GPU operations to finish - GPU operations of which the lion's share is model forward pass, not sampler operations. Indeed, the fact that the CPU is blocked that far in the pipeline is an indication of no bottlenecking by CPU. Using CPU profilers for profiling GPU code is not recommended as it doesn't show you the whole picture. I recommend using pytorch profiler or nsight. See https://pytorch.org/docs/stable/notes/cuda.html#asynchronous-execution for more information.

tl;dr sampling code accounts for a small fraction of the forward pass (though it can be further optimized, which is what I am working on).

cduk commented 3 months ago

I found this thread when looking for potential CPU bottlenecks. I moved from a Ryzen 5600X to an older XEON E5 v4 chip and saw that tok/s halved even though I went from x4 PCIe to x8 PCIe. CPU was 100% during inference with 4 GPUs and tensor parallel.