vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
26.06k stars 3.82k forks source link

[Bug]: high gpu_memory_utilization with 'OOM' and low gpu_memory_utilization with 'No available memory for the cache blocks' #5274

Open mars-ch opened 3 months ago

mars-ch commented 3 months ago

Your current environment

v100 32G * 8

🐛 Describe the bug

I tried to run a 32B model with lora adapters and test different GPU_MEMORY_UTILIZATION.

When gpu_memory_utilization = 0.9 , it came with OOM. When gpu_memory_utilization = 0.8 , it came with 'No available memory for the cache blocks. Try increasing gpu_memory_utilization when initializing the engine.'

Does it mean that I need to find a suitable value for gpu_memory_utilization or is there any other things going wrong?

mgoin commented 3 months ago

@mars-ch what if you try using a smaller max_model_len? Could you share your script? It is important to know how many lora adapters and what tensor parallelism you are using.

riverind commented 2 months ago

Your current environment

v100 32G * 8

🐛 Describe the bug

I tried to run a 32B model with lora adapters and test different GPU_MEMORY_UTILIZATION.

When gpu_memory_utilization = 0.9 , it came with OOM. When gpu_memory_utilization = 0.8 , it came with 'No available memory for the cache blocks. Try increasing gpu_memory_utilization when initializing the engine.'

Does it mean that I need to find a suitable value for gpu_memory_utilization or is there any other things going wrong?

please decrease max_model_len or increase gpu_memory_utilization or increase tensor-parallel-size gpus

mars-ch commented 2 months ago

@mars-ch what if you try using a smaller max_model_len? Could you share your script? It is important to know how many lora adapters and what tensor parallelism you are using.

I used the LLM class like this:

sampling_params = SamplingParams(temperature=0.7, top_p=0.8, repetition_penalty=1.05, max_tokens=512)

llm = LLM(model="output_merged",dtype="half",gpu_memory_utilization=0.95, tensor_parallel_size=8, enforce_eager=True)

Thanks.

mars-ch commented 2 months ago

What's more, the error shows that there are 2 processes:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 234.00 MiB. GPU  has a total capacity of 31.75 GiB of which 98.75 MiB is free. Process 2852145 has 8.72 GiB memory in use. Process 2878797 has 22.92 GiB memory in use. Of the allocated memory 20.56 GiB is allocated by PyTorch, and 480.11 MiB is reserved by PyTorch but unallocated.

DarkLight1337 commented 2 months ago

There is currently a bug in the model profiling logic which causes the memory profiler to underestimate the amount of memory to be used by the model. To circumvent this, you can reduce the value of gpu_memory_utilization.