vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
29.75k stars 4.49k forks source link

[Bug]: cuda OOM errors persist across requests. #6907

Open servient-ashwin opened 3 months ago

servient-ashwin commented 3 months ago

Your current environment

The output of `python collect_env.py`

Collecting environment information...
PyTorch version: 2.3.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Amazon Linux 2 (x86_64)
GCC version: (GCC) 7.3.1 20180712 (Red Hat 7.3.1-17)
Clang version: Could not collect
CMake version: version 3.27.7
Libc version: glibc-2.26

Python version: 3.10.9 | packaged by conda-forge | (main, Feb  2 2023, 20:20:04) [GCC 11.3.0] (64-bit runtime)
Python platform: Linux-5.10.219-208.866.amzn2.x86_64-x86_64-with-glibc2.26
Is CUDA available: True
CUDA runtime version: 12.1.105
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA A10G
Nvidia driver version: 550.54.14
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              4
On-line CPU(s) list: 0-3
Thread(s) per core:  2
Core(s) per socket:  2
Socket(s):           1
NUMA node(s):        1
Vendor ID:           AuthenticAMD
CPU family:          23
Model:               49
Model name:          AMD EPYC 7R32
Stepping:            0
CPU MHz:             3257.105
BogoMIPS:            5600.00
Hypervisor vendor:   KVM
Virtualization type: full
L1d cache:           32K
L1i cache:           32K
L2 cache:            512K
L3 cache:            8192K
NUMA node0 CPU(s):   0-3
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid

Versions of relevant libraries:
[pip3] mypy==1.9.0
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.26.2
[pip3] nvidia-nccl-cu11==2.14.3
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] torch==2.3.0
[pip3] torchvision==0.18.0
[pip3] triton==2.3.0
[pip3] vllm-nccl-cu12==2.18.1.0.4.0
[conda] numpy                     1.26.2                   pypi_0    pypi
[conda] nvidia-nccl-cu11          2.14.3                   pypi_0    pypi
[conda] nvidia-nccl-cu12          2.20.5                   pypi_0    pypi
[conda] torch                     2.3.0                    pypi_0    pypi
[conda] torchvision               0.18.0                   pypi_0    pypi
[conda] triton                    2.3.0                    pypi_0    pypi
[conda] vllm-nccl-cu12            2.18.1.0.4.0             pypi_0    pypiROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.5.1
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X  0-3 0       N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

🐛 Describe the bug

I am seeing errors when requests are too large and too frequent I get cuda OOM errors. That's a user/application issue and how things are handled before connecting to the server.

However, every subsequent request regardless of it's size now gives cuda OOM errors unless you restart the server. Is there a way to soft relaod when you hit OOM errors or any other possible way this could be solved since one cannot restart the server if we encounter this every time.

Other steps that have already been tried include reducing GPU memory utilization, timeouts, changing context lengths, but they feel like stop-gap for the GPU memory issue concern.

Steps to reproduce

  1. Use vllm 0.5.1 and | NVIDIA-SMI 550.54.14 Driver Version: 550.54.14 CUDA Version: 12.4 |
  2. Start the server using the following command on the NVIDIA A10G python -m vllm.entrypoints.openai.api_server --model mistralai/Mistral-7B-Instruct-v0.3 --download-dir /tmp/ --port 8006 --tensor-parallel-size 1 --gpu-memory-utilization 1 &
  3. Send large requests, more frequent large requests that will lead to cuda OOM errors, by large I mean the one's that utilize the full context.
  4. Observe the first cuda OOM
  5. All requests that follow this error also produce cuda OOM error.
  6. Restart server
  7. Issue resolved, repeat steps 1 through 6
WoosukKwon commented 3 months ago

@servient-ashwin Could you please reduce the gpu-memory utilization to 0.9 (default) or 0.95? Because vLLM's memory profiling is not 100% accurate, setting too high gpu-memory-utilization may lead to OOMs when there's extra memory usage that is not captured in memory profiling.

servient-ashwin commented 3 months ago

Got it, The reason we set that to 1 was because A10G has just enough (24GB) memory to load the model mentioned (full precision, non quantized). At 0.9 the model wouldn't load with it's full context length.

servient-ashwin commented 2 months ago

@WoosukKwon I tried all the combinations for memory utilization along with the one you suggested, and I continue to see this error with long form contexts. At this moment I am unable to figure out the root cause for the memory leak that causes this OOM apart from my observation around request lengths, GPU usages and monitoring token generation latencies, however what I'd like to know as a stop gap is are there any ways to hot reload the current server on cuda oom. Since there could be a variety of reasons for OOM errors, I came across this toolkit from nvidia compute sanitizer, however that is a superficial solution to this issue.

What an I do to implement hot reload to the the model server on OOM for now*?