vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
28.79k stars 4.27k forks source link

CUDA error: out of memory #188

Closed SunixLiu closed 1 year ago

SunixLiu commented 1 year ago

I successfully installed vLLM in WSL2, when I was trying to run the sample code, I got error info like this:

from vllm import LLM, SamplingParams

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(model="/mnt/d/github/text-generation-webui/models/facebook_opt-125m")

outputs = llm.generate(prompts, sampling_params)

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

INFO 06-21 21:40:02 llm_engine.py:59] Initializing an LLM engine with config: model='/mnt/d/github/text-generation-webui/models/facebook_opt-125m', dtype=torch.float16, use_dummy_weights=False, download_dir=None, use_np_weights=False, tensor_parallel_size=1, seed=0) INFO 06-21 21:40:12 llm_engine.py:128] # GPU blocks: 37375, # CPU blocks: 7281 Traceback (most recent call last): File "/mnt/d/01Projects/vllm/prac_1.py", line 11, in llm = LLM(model="/mnt/d/github/text-generation-webui/models/facebook_opt-125m") File "/mnt/d/github/vllm/vllm/entrypoints/llm.py", line 55, in init self.llm_engine = LLMEngine.from_engine_args(engine_args) File "/mnt/d/github/vllm/vllm/engine/llm_engine.py", line 145, in from_engine_args engine = cls(engine_configs, distributed_init_method, devices, File "/mnt/d/github/vllm/vllm/engine/llm_engine.py", line 102, in init self._init_cache() File "/mnt/d/github/vllm/vllm/engine/llm_engine.py", line 134, in _init_cache self._run_workers("init_cache_engine", cache_config=self.cache_config) File "/mnt/d/github/vllm/vllm/engine/llm_engine.py", line 307, in _run_workers output = executor(args, **kwargs) File "/mnt/d/github/vllm/vllm/worker/worker.py", line 126, in init_cache_engine self.cache_engine = CacheEngine( File "/mnt/d/github/vllm/vllm/worker/cache_engine.py", line 41, in init self.cpu_cache = self.allocate_cpu_cache() File "/mnt/d/github/vllm/vllm/worker/cache_engine.py", line 89, in allocate_cpu_cache key_blocks = torch.empty( RuntimeError: CUDA error: out of memory CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Python: 3.10.11 GPU: RTX 3090 24G Linux: WSL2, Ubuntu 20.04.6 LTS Can anyone help to answer this?

zhuohan123 commented 1 year ago

The OPT-125M is very small and this cannot happen. While I cannot reproduce the exact error on my side, when you initialize the LLM class, can you try to add the following argument: gpu_memory_utilization=0.80 or set the utilization to an even lower number? The default utilization upper bound is 0.90.

Additionally, can you set CUDA_LAUNCH_BLOCKING=1 to see which exact lines causes the error?

SunixLiu commented 1 year ago

I added two lines:

import os
os.environ ['CUDA_LAUNCH_BLOCKING'] ='1'

and this gpu_memory_utilization=0.50 .

but it seems the error info is the same:

INFO 06-21 22:53:33 llm_engine.py:59] Initializing an LLM engine with config: model='/mnt/d/github/text-generation-webui/models/facebook_opt-125m', dtype=torch.float16, use_dummy_weights=False, download_dir=None, use_np_weights=False, tensor_parallel_size=1, seed=0) INFO 06-21 22:53:41 llm_engine.py:128] # GPU blocks: 19899, # CPU blocks: 7281 Traceback (most recent call last): File "/mnt/d/01Projects/vllm/prac_1.py", line 14, in llm = LLM(model="/mnt/d/github/text-generation-webui/models/facebook_opt-125m",gpu_memory_utilization=0.50) File "/mnt/d/github/vllm/vllm/entrypoints/llm.py", line 55, in init self.llm_engine = LLMEngine.from_engine_args(engine_args) File "/mnt/d/github/vllm/vllm/engine/llm_engine.py", line 145, in from_engine_args engine = cls(engine_configs, distributed_init_method, devices, File "/mnt/d/github/vllm/vllm/engine/llm_engine.py", line 102, in init self._init_cache() File "/mnt/d/github/vllm/vllm/engine/llm_engine.py", line 134, in _init_cache self._run_workers("init_cache_engine", cache_config=self.cache_config) File "/mnt/d/github/vllm/vllm/engine/llm_engine.py", line 307, in _run_workers output = executor(args, **kwargs) File "/mnt/d/github/vllm/vllm/worker/worker.py", line 126, in init_cache_engine self.cache_engine = CacheEngine( File "/mnt/d/github/vllm/vllm/worker/cache_engine.py", line 41, in init self.cpu_cache = self.allocate_cpu_cache() File "/mnt/d/github/vllm/vllm/worker/cache_engine.py", line 89, in allocate_cpu_cache key_blocks = torch.empty( RuntimeError: CUDA error: out of memory Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Btw, the text-generation-webui can load the model successfully.

zhuohan123 commented 1 year ago

Just to make sure it's not the model's issue, can you run facebook/opt-125m model?

ZQ-Dev8 commented 1 year ago

I'm running into the same issue. I'm on Ubuntu WSL2, CUDA 11.8, and RTX 2080Ti, which should easily be able to accommodate opt-125m. I've tried other small models, such as cerebras/Cerebras-GPT-111M and distilgpt2, but those give different errors (probably due to architecture incompatibility?).

Edit: I just tried EleutherAI/pythia-70M-deduped and it too gives the OOM error.

I've confirmed I can run opt-125m and pythia-70m via standard transformers library...any other ideas what might be causing the issue @zhuohan123 @WoosukKwon ?

SunixLiu commented 1 year ago

Just to make sure it's not the model's issue, can you run facebook/opt-125m model?

Yes, I can use text-generation-webui to load facebook/opt-125m and get it work.

AlpinDale commented 1 year ago

Currently having this issue as well. The CUDA_VISIBLE_DEVICES environment variable has no effect either, and it only loads the models to GPU 0. I'm running on A100s but still get OOM with 125m OPT. There's something fatally wrong with this repo.

EDIT: CUDA_VISIBLE_DEVICES seems to work just fine, but the 125m model is using over 37GB of memory.

WoosukKwon commented 1 year ago

@dcruiz01 @SunixLiu @AlpinDale vLLM is designed to take almost all of your GPU memory. Could you double-check your GPU is not used by other processes when using vLLM?

AlpinDale commented 1 year ago

@dcruiz01 @SunixLiu @AlpinDale vLLM is designed to take almost all of your GPU memory. Could you double-check your GPU is not used by other processes when using vLLM?

Thanks, I think I understand now. A somewhat related question - how is multi-GPU handled? If I load a bigger model, will it split across the available GPUs? Also, if I want more memory to accommodate more users on a single, smaller model, can I have the process use multiple GPUs at once?

WoosukKwon commented 1 year ago

@AlpinDale Good question. You can use the tensor_parallel_size argument for multi-GPU inference. First, initialize your Ray cluster by executing

$ ray start --head

Then, use the tensor_parallel_size argument in the LLM class:

llm = LLM(model=<your model>, tensor_parallel_size=2)  # Inference with 2 GPUs 
SunixLiu commented 1 year ago

@dcruiz01 @SunixLiu @AlpinDale vLLM is designed to take almost all of your GPU memory. Could you double-check your GPU is not used by other processes when using vLLM?

I'm pretty sure there's no other process was using my GPU, and here's the screenshot, you can find a burst in both GPU 3D and VRAM, that happended when the code was trying to load the model.

image
ZQ-Dev8 commented 1 year ago

I'm seeing the same as @SunixLiu. Nothing else is happening on my GPU. Running the code at the top of this issue spikes the GPU and then crashes.

ZQ-Dev8 commented 1 year ago

Any new developments here? @zhuohan123

zhuohan123 commented 1 year ago

@SunixLiu @dcruiz01 We believe this is actually a bug. We are not sure whether this is caused by WSL or vLLM. We don't have a Windows environment with GPUs. How about we set up a 30-min meeting to pair-debug on your environments? Can you send me an email at zhuohan[at]berkeley.edu to set up a call? Thanks!

zhuohan123 commented 1 year ago

Just pair-debugged with @SunixLiu and we successfully locate the issue. As a temporary fix, please comment out pin_memory=True in vLLM code when allocating CPU cache:

https://github.com/vllm-project/vllm/blob/4026a049d3ad510bea8e177bf71722bd510fbb46/vllm/worker/cache_engine.py#L89-L97

pin_memory has a limit in WSL (official doc) and the limit seems to be 2GB. After commenting out this, vLLM should work properly.

SunixLiu commented 1 year ago

Thank you @zhuohan123 for helping this. Issue closed.

ZQ-Dev8 commented 1 year ago

Just pair-debugged with @SunixLiu and we successfully locate the issue. As a temporary fix, please comment out pin_memory=True in vLLM code when allocating CPU cache:

https://github.com/vllm-project/vllm/blob/4026a049d3ad510bea8e177bf71722bd510fbb46/vllm/worker/cache_engine.py#L89-L97

pin_memory has a limit in WSL (official doc) and the limit seems to be 2GB. After commenting out this, vLLM should work properly.

Hey thank you so much for getting to the bottom of this! Never saw an email, but I'm glad you were able to figure it out with @SunixLiu. Looking forward to testing out the fix.

lucasjinreal commented 1 year ago

Got OOM too as well, on 32GB v100 and using 7B llama model, it shouldn't OOM, why?

FarziBuilder commented 1 year ago

Just pair-debugged with @SunixLiu and we successfully locate the issue. As a temporary fix, please comment out pin_memory=True in vLLM code when allocating CPU cache:

https://github.com/vllm-project/vllm/blob/4026a049d3ad510bea8e177bf71722bd510fbb46/vllm/worker/cache_engine.py#L89-L97

pin_memory has a limit in WSL (official doc) and the limit seems to be 2GB. After commenting out this, vLLM should work properly.

@zhuohan123 i don't understand how to do that. Do u want us to fork vLLM, change the source code and then re-run.

Or is there any simpler way?

prashantskit commented 1 year ago

Hi, I am not using WSL and still getting the same error. I am inferencing LLama-2 7b model on V100 16GB GPU.

allanwakes commented 1 year ago

I run into the same error, I tried to run Baichuan2-7B-Chat-4bits alone, and it worked. But with vllm, it went to oom. No WSL, RTX 3060, 12G

flexchar commented 11 months ago

@allanwakes @prashantskit have you come across a solution?

I'm seeing same with Mistral 7B on RTX 3090 24GB

nosolosoft commented 11 months ago

Mistral 7b on RTX4060 16GB fails. With ollama for example it doesn't

flexchar commented 11 months ago

After setting --max-model-len 8192 the OOM went away. I also use an AWQ quant with --quantization awq parameter. Works amazing!

@nosolosoft, what if you try using --max-model-len 4096?

gordicaleksa commented 11 months ago

I'm hitting this same issue on Ubuntu, 48GB of VRAM (2xRTX3090), using OPT-125M model.

I'm comfortably using large ML systems on my rig in general.

the pin memory hack works for ubuntu as well, I've just set it to False as a tmp workaround.

gordicaleksa commented 11 months ago

Ok, I've done some debugging, this function is literally designed to allocate GPU memory as max as possible (up to what's set in the gpu_memory_utilization). What's the reasoning behind this design decision?

Edit: I know see that this is by design, you occupy all available GPU memory in case new instances are spawned? So then it must be some other bug. I stepped into the code and OPT-125M by itself only occupied a small amount of memory until this GPU allocation happened. I'll limit the gpu_memory_utilization as a tmp workaround.

image

mendhak commented 11 months ago

With Mistral as well on Ubuntu, I'm getting the CUDA out of memory error, and playing with --gpu-memory-utilization doesn't seem to make a difference.

python -u -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --model mistralai/Mistral-7B-v0.1 --dtype half --gpu-memory-utilization 0.8  --max-model-len 4096

results in:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 224.00 MiB (GPU 0; 10.73 GiB total capacity; 9.85 GiB already allocated; 46.44 MiB free; 9.86 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.

I tried to follow the AWQ advice above and this works:

python -u -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --model TheBloke/Mistral-7B-OpenOrca-AWQ --dtype half --gpu-memory-utilization 0.7  --max-model-len 4096 --quantization awq

(Ubuntu 22.04, RTX 2080 Ti, 32GB RAM)

But I'm not sure how to get the original Mistral running. Anything I'm missing?

tongyx361 commented 6 months ago

In my case (pure Linux with A100(80GB)), decreasing swap_space (perhaps under the corresponding value of GPU memory * gpu_memory_utilization) helps. Actually, the default value of swap_space is only 4 (GiB). It seems that we can set it to too bigger than this value.

yiakwy-xpu-ml-framework-team commented 5 months ago

Ok, I've done some debugging, this function is literally designed to allocate GPU memory as max as possible (up to what's set in the gpu_memory_utilization). What's the reasoning behind this design decision?

Edit: I know see that this is by design, you occupy all available GPU memory in case new instances are spawned? So then it must be some other bug. I stepped into the code and OPT-125M by itself only occupied a small amount of memory until this GPU allocation happened. I'll limit the gpu_memory_utilization as a tmp workaround.

image

Note in the latest code, vllm allocating blocks with pin_memory. The problems are that this section also tries to utilize all the memory where memory for pin_memory DMA buffer is limited, so when you load a small model like lllama 7b with A100, it will go out of memory either in CPU or GPU side. Hope this message helps.

LGLG42 commented 5 months ago

Same situation here, tried every single option with gpu_memory_utilization:[0.2:0.9], enforce_eager, batch_size, tried smaller model like "pretrained=facebook/opt-125m", max_model_len=[128:4096], etc, etc, etc, and the only thing that worked was to hack manually in the code pin_memory to False in the anaconda3/envs/vllm_p39/lib/python3.9/site-packages/vllm/worker/cache_engine.py

LLM engine (v0.4.2)

yiakwy-xpu-ml-framework-team commented 5 months ago

您好,邮箱主人会认真阅读!谢谢关注/

piotrmasior commented 4 months ago
[rank0]:   File "/home/pmasior/miniconda3/envs/vllm2/lib/python3.10/site-packages/vllm/worker/cache_engine.py", line 64, in _allocate_kv_cache
[rank0]:     torch.empty(kv_cache_shape,
[rank0]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.70 GiB. GPU 

image 1,7gb my ass :D

same issue as @LGLG42

not working on python 3.9 nor 3.10

nvidia-smi
Sat May 25 23:52:09 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4090        On  | 00000000:01:00.0  On |                  Off |
|  0%   58C    P8              21W / 450W |   2539MiB / 24564MiB |      7%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
piotrmasior commented 4 months ago

this works:

python -m vllm.entrypoints.openai.api_server --model facebook/opt-125m --gpu-memory-utilization 0.8

but memory being reserved... is funny image

piotrmasior commented 4 months ago

I do not know if it will help someone, but it looks like you need to fine tune manually --gpu-memory-utilization for instance that works:

python -m vllm.entrypoints.openai.api_server --model facebook/opt-125m --gpu-memory-utilization 0.05

but 0.04 throws:

ValueError: The model's max seq len (2048) is larger than the maximum number of tokens that can be stored in KV cache (1344). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.

so adjusting to:

python -m vllm.entrypoints.openai.api_server --model facebook/opt-125m --gpu-memory-utilization 0.04 --max-model-len 1024

also works

so in my case manipulating both parameters manually set you up... for any model, little irritating though

cheers

Daya-Jin commented 3 months ago

lower the gpu-memory-utilization works for me(8*A800 80GB).

python -m vllm.entrypoints.openai.api_server --model /Qwen-7B-Chat --dtype bfloat16 --api-key token-abc123 --trust-remote-code --gpu-memory-utilization 0.3 --max-model-len 4096