vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
26.09k stars 3.82k forks source link

[Bug]: TRACKING ISSUE: CUDA OOM with Logprobs #5907

Open robertgshaw2-neuralmagic opened 2 months ago

robertgshaw2-neuralmagic commented 2 months ago

Your current environment

The output of `python collect_env.py`

🐛 Describe the bug

vLLM has an issue where we can go OOM if too many logprobs are requested.

The reason that this happens is that there are three sources of memory usage:

When determining the KV cache size, we calculate peak memory running a long prefill without logprobs

If a prompt requests many logprobs, however, this is an additional source of memory usage which is not considered during warmup and can cause OOM because we have nothing in scheduler to prevent this

We have received several examples of this:

Attempt to fix this:

I am working on a design to address this issue

neubig commented 1 month ago

Hi @robertgshaw2-neuralmagic , I was wondering if there were any ongoing attempts to resolve this issue, or if #5355 seems like an acceptable fix? Having this fixed would be very helpful to us!

binxuan commented 1 month ago

In my case with large vocab size (200k+) and long seq length (8k+), the logits sort is the most memory consuming part and can easily trigger OOM on one single GPU. Is it possible to do some kind of sequence parallel to distribute it to all TP workers?