[Bug]: TRACKING ISSUE: CUDA OOM with Logprobs

robertgshaw2-neuralmagic commented 4 months ago

Your current environment

The output of `python collect_env.py`

🐛 Describe the bug

vLLM has an issue where we can go OOM if too many logprobs are requested.

The reason that this happens is that there are three sources of memory usage:

Model weights
KV caches
Activations

When determining the KV cache size, we calculate peak memory running a long prefill without logprobs

If a prompt requests many logprobs, however, this is an additional source of memory usage which is not considered during warmup and can cause OOM because we have nothing in scheduler to prevent this

We have received several examples of this:

https://github.com/vllm-project/vllm/issues/5890
https://github.com/vllm-project/vllm/issues/5060 << some of the OOM issues in AsyncEngineDeadError
Anyone running MMLU or ARC in lm-eval-harness

Attempt to fix this:

https://github.com/vllm-project/vllm/pull/5355

I am working on a design to address this issue

neubig commented 3 months ago

Hi @robertgshaw2-neuralmagic , I was wondering if there were any ongoing attempts to resolve this issue, or if #5355 seems like an acceptable fix? Having this fixed would be very helpful to us!

binxuan commented 3 months ago

In my case with large vocab size (200k+) and long seq length (8k+), the logits sort is the most memory consuming part and can easily trigger OOM on one single GPU. Is it possible to do some kind of sequence parallel to distribute it to all TP workers?

cjfcsjt commented 1 month ago

@robertgshaw2-neuralmagic Any update? Thanks.

robertgshaw2-neuralmagic commented 1 month ago

For now running with --enable-chunked-prefill should avoid the issue. I have been focused on the performance side of vllm for now given this is the biggest priority in vllm. I will return to this once we finalize the wave of optimizations.

Apologies for the delay. I agree this is a significant blemish in vllm right now.

njhill commented 1 month ago

@tjohnson31415 is looking into this.

tjohnson31415 commented 1 month ago

I measured the peak GPU memory usage during the processing in Sampler. I found that the memory usage is balooned to 9x the size of the input logits tensor. The increase comes from upscaling logits to float32 (+2x), copying the tensor for probs and logprobs (+4x), and processing in _sample creating another temporary copy (+2x). I'm sure there are ways to reduce this spike in memory, but we'd still need the input tesnor limited.

The main challenge is if any prompt_logprobs are requested since the logits tensor will have logits for every token in the prompt. We will hit limits for a model with a large context and vocab size in the size of the logits tensor even before processing in Sampler. With Llama 3.1 models with a vocab_size of 128256 and a max sequence length of 131072, a single request with prompt_logprobs requested could produce a logits tensor of 128256 131072 2B =~ 31 GiB.

Limiting the number of tokens processed at a time with chunked prefill seems like the right solution to me. However, for cases where chunked prefill is not supported, we may still need "chunked logits processing" to limit the number of tokens in the processing from hidden states -> logprobs output. This may be difficult to implement though.

@robertgshaw2-neuralmagic: What do you think about this approach of "chunked logits processing"?

robertgshaw2-neuralmagic commented 1 month ago

Is there any fundamental reason we need to make all the copies? Otherwise, it would make sense to me that chunking could work

patrickvonplaten commented 3 weeks ago

Also running into this issue here with small models (2B) and when returning logprobs

vllm-project / vllm

[Bug]: TRACKING ISSUE: CUDA OOM with Logprobs #5907

Your current environment

🐛 Describe the bug