Closed SunixLiu closed 1 year ago
The OPT-125M is very small and this cannot happen. While I cannot reproduce the exact error on my side, when you initialize the LLM class, can you try to add the following argument: gpu_memory_utilization=0.80
or set the utilization to an even lower number? The default utilization upper bound is 0.90.
Additionally, can you set CUDA_LAUNCH_BLOCKING=1
to see which exact lines causes the error?
I added two lines:
import os
os.environ ['CUDA_LAUNCH_BLOCKING'] ='1'
and this gpu_memory_utilization=0.50
.
but it seems the error info is the same:
INFO 06-21 22:53:33 llm_engine.py:59] Initializing an LLM engine with config: model='/mnt/d/github/text-generation-webui/models/facebook_opt-125m', dtype=torch.float16, use_dummy_weights=False, download_dir=None, use_np_weights=False, tensor_parallel_size=1, seed=0)
INFO 06-21 22:53:41 llm_engine.py:128] # GPU blocks: 19899, # CPU blocks: 7281
Traceback (most recent call last):
File "/mnt/d/01Projects/vllm/prac_1.py", line 14, in TORCH_USE_CUDA_DSA
to enable device-side assertions.
Btw, the text-generation-webui can load the model successfully.
Just to make sure it's not the model's issue, can you run facebook/opt-125m
model?
I'm running into the same issue. I'm on Ubuntu WSL2, CUDA 11.8, and RTX 2080Ti, which should easily be able to accommodate opt-125m. I've tried other small models, such as cerebras/Cerebras-GPT-111M and distilgpt2, but those give different errors (probably due to architecture incompatibility?).
Edit: I just tried EleutherAI/pythia-70M-deduped and it too gives the OOM error.
I've confirmed I can run opt-125m and pythia-70m via standard transformers library...any other ideas what might be causing the issue @zhuohan123 @WoosukKwon ?
Just to make sure it's not the model's issue, can you run
facebook/opt-125m
model?
Yes, I can use text-generation-webui to load facebook/opt-125m
and get it work.
Currently having this issue as well. The CUDA_VISIBLE_DEVICES
environment variable has no effect either, and it only loads the models to GPU 0. I'm running on A100s but still get OOM with 125m OPT. There's something fatally wrong with this repo.
EDIT: CUDA_VISIBLE_DEVICES
seems to work just fine, but the 125m model is using over 37GB of memory.
@dcruiz01 @SunixLiu @AlpinDale vLLM is designed to take almost all of your GPU memory. Could you double-check your GPU is not used by other processes when using vLLM?
@dcruiz01 @SunixLiu @AlpinDale vLLM is designed to take almost all of your GPU memory. Could you double-check your GPU is not used by other processes when using vLLM?
Thanks, I think I understand now. A somewhat related question - how is multi-GPU handled? If I load a bigger model, will it split across the available GPUs? Also, if I want more memory to accommodate more users on a single, smaller model, can I have the process use multiple GPUs at once?
@AlpinDale Good question. You can use the tensor_parallel_size
argument for multi-GPU inference.
First, initialize your Ray cluster by executing
$ ray start --head
Then, use the tensor_parallel_size
argument in the LLM class:
llm = LLM(model=<your model>, tensor_parallel_size=2) # Inference with 2 GPUs
@dcruiz01 @SunixLiu @AlpinDale vLLM is designed to take almost all of your GPU memory. Could you double-check your GPU is not used by other processes when using vLLM?
I'm pretty sure there's no other process was using my GPU, and here's the screenshot, you can find a burst in both GPU 3D and VRAM, that happended when the code was trying to load the model.
I'm seeing the same as @SunixLiu. Nothing else is happening on my GPU. Running the code at the top of this issue spikes the GPU and then crashes.
Any new developments here? @zhuohan123
@SunixLiu @dcruiz01 We believe this is actually a bug. We are not sure whether this is caused by WSL or vLLM. We don't have a Windows environment with GPUs. How about we set up a 30-min meeting to pair-debug on your environments? Can you send me an email at zhuohan[at]berkeley.edu
to set up a call? Thanks!
Just pair-debugged with @SunixLiu and we successfully locate the issue. As a temporary fix, please comment out pin_memory=True
in vLLM code when allocating CPU cache:
pin_memory
has a limit in WSL (official doc) and the limit seems to be 2GB. After commenting out this, vLLM should work properly.
Thank you @zhuohan123 for helping this. Issue closed.
Just pair-debugged with @SunixLiu and we successfully locate the issue. As a temporary fix, please comment out
pin_memory=True
in vLLM code when allocating CPU cache:
pin_memory
has a limit in WSL (official doc) and the limit seems to be 2GB. After commenting out this, vLLM should work properly.
Hey thank you so much for getting to the bottom of this! Never saw an email, but I'm glad you were able to figure it out with @SunixLiu. Looking forward to testing out the fix.
Got OOM too as well, on 32GB v100 and using 7B llama model, it shouldn't OOM, why?
Just pair-debugged with @SunixLiu and we successfully locate the issue. As a temporary fix, please comment out
pin_memory=True
in vLLM code when allocating CPU cache:
pin_memory
has a limit in WSL (official doc) and the limit seems to be 2GB. After commenting out this, vLLM should work properly.
@zhuohan123 i don't understand how to do that. Do u want us to fork vLLM, change the source code and then re-run.
Or is there any simpler way?
Hi, I am not using WSL and still getting the same error. I am inferencing LLama-2 7b model on V100 16GB GPU.
I run into the same error, I tried to run Baichuan2-7B-Chat-4bits alone, and it worked. But with vllm, it went to oom. No WSL, RTX 3060, 12G
@allanwakes @prashantskit have you come across a solution?
I'm seeing same with Mistral 7B on RTX 3090 24GB
Mistral 7b on RTX4060 16GB fails. With ollama for example it doesn't
After setting --max-model-len 8192
the OOM went away. I also use an AWQ quant with --quantization awq
parameter. Works amazing!
@nosolosoft, what if you try using --max-model-len 4096
?
I'm hitting this same issue on Ubuntu, 48GB of VRAM (2xRTX3090), using OPT-125M model.
I'm comfortably using large ML systems on my rig in general.
the pin memory hack works for ubuntu as well, I've just set it to False as a tmp workaround.
Ok, I've done some debugging, this function is literally designed to allocate GPU memory as max as possible (up to what's set in the gpu_memory_utilization
). What's the reasoning behind this design decision?
Edit: I know see that this is by design, you occupy all available GPU memory in case new instances are spawned? So then it must be some other bug. I stepped into the code and OPT-125M by itself only occupied a small amount of memory until this GPU allocation happened. I'll limit the gpu_memory_utilization
as a tmp workaround.
With Mistral as well on Ubuntu, I'm getting the CUDA out of memory error, and playing with --gpu-memory-utilization
doesn't seem to make a difference.
python -u -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --model mistralai/Mistral-7B-v0.1 --dtype half --gpu-memory-utilization 0.8 --max-model-len 4096
results in:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 224.00 MiB (GPU 0; 10.73 GiB total capacity; 9.85 GiB already allocated; 46.44 MiB free; 9.86 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.
I tried to follow the AWQ advice above and this works:
python -u -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --model TheBloke/Mistral-7B-OpenOrca-AWQ --dtype half --gpu-memory-utilization 0.7 --max-model-len 4096 --quantization awq
(Ubuntu 22.04, RTX 2080 Ti, 32GB RAM)
But I'm not sure how to get the original Mistral running. Anything I'm missing?
In my case (pure Linux with A100(80GB)),
decreasing swap_space
(perhaps under the corresponding value of GPU memory * gpu_memory_utilization
) helps.
Actually, the default value of swap_space
is only 4 (GiB). It seems that we can set it to too bigger than this value.
Ok, I've done some debugging, this function is literally designed to allocate GPU memory as max as possible (up to what's set in the
gpu_memory_utilization
). What's the reasoning behind this design decision?Edit: I know see that this is by design, you occupy all available GPU memory in case new instances are spawned? So then it must be some other bug. I stepped into the code and OPT-125M by itself only occupied a small amount of memory until this GPU allocation happened. I'll limit the
gpu_memory_utilization
as a tmp workaround.
Note in the latest code, vllm allocating blocks with pin_memory. The problems are that this section also tries to utilize all the memory where memory for pin_memory DMA buffer is limited, so when you load a small model like lllama 7b with A100, it will go out of memory either in CPU or GPU side. Hope this message helps.
Same situation here, tried every single option with gpu_memory_utilization:[0.2:0.9], enforce_eager, batch_size, tried smaller model like "pretrained=facebook/opt-125m", max_model_len=[128:4096], etc, etc, etc, and the only thing that worked was to hack manually in the code pin_memory to False
in the anaconda3/envs/vllm_p39/lib/python3.9/site-packages/vllm/worker/cache_engine.py
LLM engine (v0.4.2)
您好,邮箱主人会认真阅读!谢谢关注/
[rank0]: File "/home/pmasior/miniconda3/envs/vllm2/lib/python3.10/site-packages/vllm/worker/cache_engine.py", line 64, in _allocate_kv_cache
[rank0]: torch.empty(kv_cache_shape,
[rank0]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.70 GiB. GPU
1,7gb my ass :D
same issue as @LGLG42
not working on python 3.9 nor 3.10
nvidia-smi
Sat May 25 23:52:09 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08 Driver Version: 545.23.08 CUDA Version: 12.3 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 4090 On | 00000000:01:00.0 On | Off |
| 0% 58C P8 21W / 450W | 2539MiB / 24564MiB | 7% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
this works:
python -m vllm.entrypoints.openai.api_server --model facebook/opt-125m --gpu-memory-utilization 0.8
but memory being reserved... is funny
I do not know if it will help someone, but it looks like you need to fine tune manually --gpu-memory-utilization for instance that works:
python -m vllm.entrypoints.openai.api_server --model facebook/opt-125m --gpu-memory-utilization 0.05
but 0.04 throws:
ValueError: The model's max seq len (2048) is larger than the maximum number of tokens that can be stored in KV cache (1344). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.
so adjusting to:
python -m vllm.entrypoints.openai.api_server --model facebook/opt-125m --gpu-memory-utilization 0.04 --max-model-len 1024
also works
so in my case manipulating both parameters manually set you up... for any model, little irritating though
cheers
lower the gpu-memory-utilization works for me(8*A800 80GB).
python -m vllm.entrypoints.openai.api_server --model /Qwen-7B-Chat --dtype bfloat16 --api-key token-abc123 --trust-remote-code --gpu-memory-utilization 0.3 --max-model-len 4096
I successfully installed vLLM in WSL2, when I was trying to run the sample code, I got error info like this:
INFO 06-21 21:40:02 llm_engine.py:59] Initializing an LLM engine with config: model='/mnt/d/github/text-generation-webui/models/facebook_opt-125m', dtype=torch.float16, use_dummy_weights=False, download_dir=None, use_np_weights=False, tensor_parallel_size=1, seed=0) INFO 06-21 21:40:12 llm_engine.py:128] # GPU blocks: 37375, # CPU blocks: 7281 Traceback (most recent call last): File "/mnt/d/01Projects/vllm/prac_1.py", line 11, in
llm = LLM(model="/mnt/d/github/text-generation-webui/models/facebook_opt-125m")
File "/mnt/d/github/vllm/vllm/entrypoints/llm.py", line 55, in init
self.llm_engine = LLMEngine.from_engine_args(engine_args)
File "/mnt/d/github/vllm/vllm/engine/llm_engine.py", line 145, in from_engine_args
engine = cls(engine_configs, distributed_init_method, devices,
File "/mnt/d/github/vllm/vllm/engine/llm_engine.py", line 102, in init
self._init_cache()
File "/mnt/d/github/vllm/vllm/engine/llm_engine.py", line 134, in _init_cache
self._run_workers("init_cache_engine", cache_config=self.cache_config)
File "/mnt/d/github/vllm/vllm/engine/llm_engine.py", line 307, in _run_workers
output = executor(args, **kwargs)
File "/mnt/d/github/vllm/vllm/worker/worker.py", line 126, in init_cache_engine
self.cache_engine = CacheEngine(
File "/mnt/d/github/vllm/vllm/worker/cache_engine.py", line 41, in init
self.cpu_cache = self.allocate_cpu_cache()
File "/mnt/d/github/vllm/vllm/worker/cache_engine.py", line 89, in allocate_cpu_cache
key_blocks = torch.empty(
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with
TORCH_USE_CUDA_DSA
to enable device-side assertions.Python: 3.10.11 GPU: RTX 3090 24G Linux: WSL2, Ubuntu 20.04.6 LTS Can anyone help to answer this?