[Bug]: memory leak - Githubissues

wciq1208 commented 2 days ago

Your current environment

The output of `python collect_env.py`

```text Collecting environment information... PyTorch version: 2.4.0 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A OS: Ubuntu 22.04.4 LTS (x86_64) GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Clang version: Could not collect CMake version: version 3.26.4 Libc version: glibc-2.35 Python version: 3.11.9 (main, Apr 19 2024, 16:48:06) [GCC 11.2.0] (64-bit runtime) Python platform: Linux-3.10.0-1160.71.1.el7.x86_64-x86_64-with-glibc2.35 Is CUDA available: True CUDA runtime version: Could not collect CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3090 GPU 1: NVIDIA GeForce RTX 3090 Nvidia driver version: 535.129.03 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 46 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 16 On-line CPU(s) list: 0-15 Vendor ID: GenuineIntel BIOS Vendor ID: Red Hat Model name: Intel(R) Xeon(R) Gold 6140M CPU @ 2.30GHz BIOS Model name: RHEL 7.6.0 PC (i440FX + PIIX, 1996) CPU family: 6 Model: 85 Thread(s) per core: 1 Core(s) per socket: 4 Socket(s): 4 Stepping: 4 BogoMIPS: 4599.99 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology eagerfpu pni pclmulqdq vmx ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 arat umip pku ospke md_clear spec_ctrl intel_stibp arch_capabilities Virtualization: VT-x Hypervisor vendor: KVM Virtualization type: full L1d cache: 512 KiB (16 instances) L1i cache: 512 KiB (16 instances) L2 cache: 64 MiB (16 instances) L3 cache: 64 MiB (4 instances) NUMA node(s): 1 NUMA node0 CPU(s): 0-15 Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Mitigation; PTE Inversion; VMX conditional cache flushes, SMT disabled Vulnerability Mds: Mitigation; Clear CPU buffers; SMT Host state unknown Vulnerability Meltdown: Mitigation; PTI Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; Load fences, usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; IBRS (kernel), IBPB Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Mitigation; Clear CPU buffers; SMT Host state unknown Versions of relevant libraries: [pip3] mypy-extensions==1.0.0 [pip3] numpy==1.26.4 [pip3] nvidia-ml-py==12.560.30 [pip3] onnxruntime==1.16.3 [pip3] optree==0.12.1 [pip3] pyzmq==26.2.0 [pip3] sentence-transformers==3.0.1 [pip3] torch==2.4.0 [pip3] torchaudio==2.4.0 [pip3] torchelastic==0.2.2 [pip3] torchvision==0.19.0 [pip3] transformers==4.44.2 [pip3] transformers-stream-generator==0.0.4 [pip3] triton==3.0.0 [conda] blas 1.0 mkl [conda] cuda-cudart 12.1.105 0 nvidia [conda] cuda-cupti 12.1.105 0 nvidia [conda] cuda-libraries 12.1.0 0 nvidia [conda] cuda-nvrtc 12.1.105 0 nvidia [conda] cuda-nvtx 12.1.105 0 nvidia [conda] cuda-opencl 12.5.39 0 nvidia [conda] cuda-runtime 12.1.0 0 nvidia [conda] cuda-version 12.5 3 nvidia [conda] ffmpeg 4.3 hf484d3e_0 pytorch [conda] libcublas 12.1.0.26 0 nvidia [conda] libcufft 11.0.2.4 0 nvidia [conda] libcufile 1.10.1.7 0 nvidia [conda] libcurand 10.3.6.82 0 nvidia [conda] libcusolver 11.4.4.55 0 nvidia [conda] libcusparse 12.0.2.55 0 nvidia [conda] libjpeg-turbo 2.0.0 h9bf148f_0 pytorch [conda] libnpp 12.0.2.50 0 nvidia [conda] libnvjitlink 12.1.105 0 nvidia [conda] libnvjpeg 12.1.1.14 0 nvidia [conda] mkl 2023.1.0 h213fc3f_46344 [conda] mkl-service 2.4.0 py311h5eee18b_1 [conda] mkl_fft 1.3.8 py311h5eee18b_0 [conda] mkl_random 1.2.4 py311hdb19cb5_0 [conda] numpy 1.26.4 py311h08b1b3b_0 [conda] numpy-base 1.26.4 py311hf175353_0 [conda] nvidia-ml-py 12.560.30 pypi_0 pypi [conda] optree 0.12.1 pypi_0 pypi [conda] pytorch 2.4.0 py3.11_cuda12.1_cudnn9.1.0_0 pytorch [conda] pytorch-cuda 12.1 ha16c6d3_5 pytorch [conda] pytorch-mutex 1.0 cuda pytorch [conda] pyzmq 26.2.0 pypi_0 pypi [conda] sentence-transformers 3.0.1 pypi_0 pypi [conda] torchaudio 2.4.0 py311_cu121 pytorch [conda] torchelastic 0.2.2 pypi_0 pypi [conda] torchtriton 3.0.0 py311 pytorch [conda] torchvision 0.19.0 py311_cu121 pytorch [conda] transformers 4.44.2 pypi_0 pypi [conda] transformers-stream-generator 0.0.4 pypi_0 pypi ROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: 0.6.1.post2@9ba0817ff1eb514f51cc6de9cb8e16c98d6ee44f vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: GPU0 GPU1 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X PHB 0-15 0 N/A GPU1 PHB X 0-15 0 N/A Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks ```

Model Input Dumps

No response

🐛 Describe the bug

vllm serve /hestia/model/Qwen2.5-14B-Instruct-AWQ --max-model-len 16384 --quantization awq --port 8001 --swap-space 0 --served-model-name qwen --enable-auto-tool-choice --tool-call-parser hermes --num-gpu-blocks-override 1024

Before submitting a new issue...

[X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

robertgshaw2-neuralmagic commented 2 days ago

Is this while the system is running?

wciq1208 commented 2 days ago

Is this while the system is running?

yes，I also set Guided Decode to lm-format-enforcer, but now I don't know where the problem occurred.The system has been running for about three hours, reaching such a high memory usage.

robertgshaw2-neuralmagic commented 2 days ago

So why is this evidence of a memory leak?

wciq1208 commented 2 days ago

So why is this evidence of a memory leak?

My concurrency is only 1, and after stopping the requests, the memory usage did not decrease. At this point, I did not observe any memory being reclaimed.

wciq1208 commented 1 day ago

So why is this evidence of a memory leak?

I tried triggering GC periodically through a plugin, but this didn't solve the problem. I also observed this issue on version 0.6.0, but when I rolled back to 0.5.5, the problem disappeared. The same model ran for 6 hours on version 0.5.0, and with the same request volume, the memory peak never exceeded 4.5GB. So, the issue seems to have been introduced in version 0.6.0.

lcvcl commented 1 day ago

I'm having the same problem. gpu memory leak, The memory keeps growing.

wciq1208 commented 1 day ago

I'm having the same problem. gpu memory leak, The memory keeps growing.

I have also encountered continuous GPU memory growth, and I eventually discovered it was caused by memory fragmentation. I wrote a custom plugin to solve it, but the memory issue might be due to something else.

env: VLLM_PLUGINS: clean_cuda_cache plugin code:

# coding = utf-8

import gc
import time
from typing import Optional, List

import torch
from vllm.model_executor.layers.sampler import SamplerOutput
from vllm.sequence import ExecuteModelRequest
from vllm.worker.worker_base import LocalOrDistributedWorkerBase

# export VLLM_PLUGINS=clean_cuda_cache

class CleanTimepoint:
    last_timepoint = 0
    interval_s = 60

def execute_model(self, execute_model_req: Optional[ExecuteModelRequest] = None, ) -> Optional[List[SamplerOutput]]:
    res = self.origin_execute_model(execute_model_req, )
    now = time.time()
    if now - CleanTimepoint.last_timepoint > CleanTimepoint.interval_s:
        torch.cuda.empty_cache()
        gc.collect()
        CleanTimepoint.last_timepoint = now
    return res

def register():
    if torch.cuda.is_available() and not getattr(LocalOrDistributedWorkerBase, "origin_execute_model", None):
        LocalOrDistributedWorkerBase.origin_execute_model = LocalOrDistributedWorkerBase.execute_model
        LocalOrDistributedWorkerBase.execute_model = execute_model
        print(LocalOrDistributedWorkerBase.origin_execute_model)
        print(LocalOrDistributedWorkerBase.execute_model)

vllm-project / vllm

[Bug]: memory leak #8629

Your current environment

Model Input Dumps

🐛 Describe the bug

Before submitting a new issue...