vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
30.91k stars 4.7k forks source link

[Bug]: v0.6.4.post1 crashed:Error in model execution: CUDA error: an illegal memory access was encountered #10389

Open wciq1208 opened 1 week ago

wciq1208 commented 1 week ago

Your current environment

The output of `python collect_env.py` ```text Collecting environment information... PyTorch version: 2.5.1+cu124 Is debug build: False CUDA used to build PyTorch: 12.4 ROCM used to build PyTorch: N/A OS: Ubuntu 22.04.5 LTS (x86_64) GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Clang version: Could not collect CMake version: version 3.30.5 Libc version: glibc-2.35 Python version: 3.11.10 | packaged by conda-forge | (main, Oct 16 2024, 01:27:36) [GCC 13.3.0] (64-bit runtime) Python platform: Linux-3.10.0-1160.71.1.el7.x86_64-x86_64-with-glibc2.35 Is CUDA available: True CUDA runtime version: Could not collect CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3090 GPU 1: NVIDIA GeForce RTX 3090 Nvidia driver version: 535.129.03 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 46 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 16 On-line CPU(s) list: 0-15 Vendor ID: GenuineIntel Model name: Intel(R) Xeon(R) Gold 6140M CPU @ 2.30GHz CPU family: 6 Model: 85 Thread(s) per core: 1 Core(s) per socket: 4 Socket(s): 4 Stepping: 4 BogoMIPS: 4599.99 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology eagerfpu pni pclmulqdq vmx ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 arat umip pku ospke md_clear spec_ctrl intel_stibp arch_capabilities Virtualization: VT-x Hypervisor vendor: KVM Virtualization type: full L1d cache: 512 KiB (16 instances) L1i cache: 512 KiB (16 instances) L2 cache: 64 MiB (16 instances) L3 cache: 64 MiB (4 instances) NUMA node(s): 1 NUMA node0 CPU(s): 0-15 Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Mitigation; PTE Inversion; VMX conditional cache flushes, SMT disabled Vulnerability Mds: Mitigation; Clear CPU buffers; SMT Host state unknown Vulnerability Meltdown: Mitigation; PTI Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; Load fences, usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; IBRS (kernel), IBPB Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Mitigation; Clear CPU buffers; SMT Host state unknown Versions of relevant libraries: [pip3] numpy==1.26.4 [pip3] nvidia-cublas-cu12==12.4.5.8 [pip3] nvidia-cuda-cupti-cu12==12.4.127 [pip3] nvidia-cuda-nvrtc-cu12==12.4.127 [pip3] nvidia-cuda-runtime-cu12==12.4.127 [pip3] nvidia-cudnn-cu12==9.1.0.70 [pip3] nvidia-cufft-cu12==11.2.1.3 [pip3] nvidia-curand-cu12==10.3.5.147 [pip3] nvidia-cusolver-cu12==11.6.1.9 [pip3] nvidia-cusparse-cu12==12.3.1.170 [pip3] nvidia-ml-py==12.560.30 [pip3] nvidia-nccl-cu12==2.21.5 [pip3] nvidia-nvjitlink-cu12==12.4.127 [pip3] nvidia-nvtx-cu12==12.4.127 [pip3] optree==0.13.0 [pip3] pyzmq==26.2.0 [pip3] torch==2.5.1+cu124 [pip3] torchaudio==2.5.1+cu124 [pip3] torchelastic==0.2.2 [pip3] torchvision==0.20.1+cu124 [pip3] transformers==4.46.2 [pip3] triton==3.1.0 [conda] numpy 1.26.4 pypi_0 pypi [conda] nvidia-cublas-cu12 12.4.5.8 pypi_0 pypi [conda] nvidia-cuda-cupti-cu12 12.4.127 pypi_0 pypi [conda] nvidia-cuda-nvrtc-cu12 12.4.127 pypi_0 pypi [conda] nvidia-cuda-runtime-cu12 12.4.127 pypi_0 pypi [conda] nvidia-cudnn-cu12 9.1.0.70 pypi_0 pypi [conda] nvidia-cufft-cu12 11.2.1.3 pypi_0 pypi [conda] nvidia-curand-cu12 10.3.5.147 pypi_0 pypi [conda] nvidia-cusolver-cu12 11.6.1.9 pypi_0 pypi [conda] nvidia-cusparse-cu12 12.3.1.170 pypi_0 pypi [conda] nvidia-ml-py 12.560.30 pypi_0 pypi [conda] nvidia-nccl-cu12 2.21.5 pypi_0 pypi [conda] nvidia-nvjitlink-cu12 12.4.127 pypi_0 pypi [conda] nvidia-nvtx-cu12 12.4.127 pypi_0 pypi [conda] optree 0.13.0 pypi_0 pypi [conda] pyzmq 26.2.0 pypi_0 pypi [conda] torch 2.5.1+cu124 pypi_0 pypi [conda] torchaudio 2.5.1+cu124 pypi_0 pypi [conda] torchelastic 0.2.2 pypi_0 pypi [conda] torchvision 0.20.1+cu124 pypi_0 pypi [conda] transformers 4.46.2 pypi_0 pypi [conda] triton 3.1.0 pypi_0 pypi ROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: 0.6.4.post1 vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: GPU0 GPU1 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X PHB 0-15 0 N/A GPU1 PHB X 0-15 0 N/A Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks NVIDIA_VISIBLE_DEVICES=all NVIDIA_DRIVER_CAPABILITIES=compute,utility PYTORCH_VERSION=2.5.1 CUDA_VISIBLE_DEVICES=0 CUDA_VISIBLE_DEVICES=0 VLLM_PLUGINS=clean_cuda_cache LD_LIBRARY_PATH=/opt/conda/lib/python3.11/site-packages/cv2/../../lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64 VLLM_RPC_TIMEOUT=600000 CUDA_MODULE_LOADING=LAZY ```

Model Input Dumps

err_execute_model_input_20241116-081810.zip

🐛 Describe the bug

command

vllm serve /hestia/model/Qwen2.5-14B-Instruct-AWQ --max-model-len 32768 --quantization awq_marlin --port 8001 --served-model-name qwen --num-gpu-blocks-override 2048 --disable-log-requests --swap-space 4 --enable-prefix-caching --enable-chunked-prefill
INFO 11-16 10:37:50 metrics.py:449] Avg prompt throughput: 5941.0 tokens/s, Avg generation throughput: 16.5 tokens/s, Running: 3 reqs, Swapped: 0 reqs, Pending: 13 reqs, GPU KV cache usage: 1.5%, CPU KV cache usage: 0.0%.
INFO 11-16 10:37:50 metrics.py:465] Prefix cache hit rate: GPU: 94.87%, CPU: 0.00%
INFO:     ::1:59242 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 11-16 10:37:53 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20241116-103753.pkl...
WARNING 11-16 10:37:53 model_runner_base.py:143] Failed to pickle inputs of failed execution: CUDA error: an illegal memory access was encountered
WARNING 11-16 10:37:53 model_runner_base.py:143] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
WARNING 11-16 10:37:53 model_runner_base.py:143] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
WARNING 11-16 10:37:53 model_runner_base.py:143] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
WARNING 11-16 10:37:53 model_runner_base.py:143] 
CRITICAL 11-16 10:37:53 launcher.py:99] MQLLMEngine is already dead, terminating server process
INFO:     ::1:59242 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
CRITICAL 11-16 10:37:53 launcher.py:99] MQLLMEngine is already dead, terminating server process
INFO:     ::1:59468 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
ERROR 11-16 10:37:53 engine.py:135] RuntimeError('Error in model execution: CUDA error: an illegal memory access was encountered\nCUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.\nFor debugging consider passing CUDA_LAUNCH_BLOCKING=1\nCompile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.\n')
ERROR 11-16 10:37:53 engine.py:135] Traceback (most recent call last):
ERROR 11-16 10:37:53 engine.py:135]   File "/opt/conda/lib/python3.11/site-packages/vllm/worker/model_runner_base.py", line 116, in _wrapper
ERROR 11-16 10:37:53 engine.py:135]     return func(*args, **kwargs)
ERROR 11-16 10:37:53 engine.py:135]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 11-16 10:37:53 engine.py:135]   File "/opt/conda/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1687, in execute_model
ERROR 11-16 10:37:53 engine.py:135]     logits = self.model.compute_logits(hidden_or_intermediate_states,
ERROR 11-16 10:37:53 engine.py:135]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 11-16 10:37:53 engine.py:135]   File "/opt/conda/lib/python3.11/site-packages/vllm/model_executor/models/qwen2.py", line 478, in compute_logits
ERROR 11-16 10:37:53 engine.py:135]     logits = self.logits_processor(self.lm_head, hidden_states,
ERROR 11-16 10:37:53 engine.py:135]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 11-16 10:37:53 engine.py:135]   File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
ERROR 11-16 10:37:53 engine.py:135]     return self._call_impl(*args, **kwargs)
ERROR 11-16 10:37:53 engine.py:135]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 11-16 10:37:53 engine.py:135]   File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
ERROR 11-16 10:37:53 engine.py:135]     return forward_call(*args, **kwargs)
ERROR 11-16 10:37:53 engine.py:135]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 11-16 10:37:53 engine.py:135]   File "/opt/conda/lib/python3.11/site-packages/vllm/model_executor/layers/logits_processor.py", line 74, in forward
ERROR 11-16 10:37:53 engine.py:135]     logits = _apply_logits_processors(logits, sampling_metadata)
ERROR 11-16 10:37:53 engine.py:135]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 11-16 10:37:53 engine.py:135]   File "/opt/conda/lib/python3.11/site-packages/vllm/model_executor/layers/logits_processor.py", line 150, in _apply_logits_processors
ERROR 11-16 10:37:53 engine.py:135]     logits_row = logits_processor(past_tokens_ids,
ERROR 11-16 10:37:53 engine.py:135]                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 11-16 10:37:53 engine.py:135]   File "/opt/conda/lib/python3.11/site-packages/vllm/model_executor/guided_decoding/outlines_logits_processors.py", line 87, in __call__
ERROR 11-16 10:37:53 engine.py:135]     allowed_tokens = torch.tensor(allowed_tokens, device=scores.device)
ERROR 11-16 10:37:53 engine.py:135]                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 11-16 10:37:53 engine.py:135] RuntimeError: CUDA error: an illegal memory access was encountered
ERROR 11-16 10:37:53 engine.py:135] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
ERROR 11-16 10:37:53 engine.py:135] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
ERROR 11-16 10:37:53 engine.py:135] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
ERROR 11-16 10:37:53 engine.py:135] 
ERROR 11-16 10:37:53 engine.py:135] 
ERROR 11-16 10:37:53 engine.py:135] The above exception was the direct cause of the following exception:
ERROR 11-16 10:37:53 engine.py:135] 
ERROR 11-16 10:37:53 engine.py:135] Traceback (most recent call last):
ERROR 11-16 10:37:53 engine.py:135]   File "/opt/conda/lib/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 133, in start
ERROR 11-16 10:37:53 engine.py:135]     self.run_engine_loop()
ERROR 11-16 10:37:53 engine.py:135]   File "/opt/conda/lib/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 196, in run_engine_loop
ERROR 11-16 10:37:53 engine.py:135]     request_outputs = self.engine_step()
ERROR 11-16 10:37:53 engine.py:135]                       ^^^^^^^^^^^^^^^^^^
ERROR 11-16 10:37:53 engine.py:135]   File "/opt/conda/lib/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 214, in engine_step
ERROR 11-16 10:37:53 engine.py:135]     raise e
ERROR 11-16 10:37:53 engine.py:135]   File "/opt/conda/lib/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 205, in engine_step
ERROR 11-16 10:37:53 engine.py:135]     return self.engine.step()
ERROR 11-16 10:37:53 engine.py:135]            ^^^^^^^^^^^^^^^^^^
ERROR 11-16 10:37:53 engine.py:135]   File "/opt/conda/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 1454, in step
ERROR 11-16 10:37:53 engine.py:135]     outputs = self.model_executor.execute_model(
ERROR 11-16 10:37:53 engine.py:135]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 11-16 10:37:53 engine.py:135]   File "/opt/conda/lib/python3.11/site-packages/vllm/executor/gpu_executor.py", line 125, in execute_model
ERROR 11-16 10:37:53 engine.py:135]     output = self.driver_worker.execute_model(execute_model_req)
ERROR 11-16 10:37:53 engine.py:135]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 11-16 10:37:53 engine.py:135]   File "/opt/conda/lib/python3.11/site-packages/vllm/worker/worker_base.py", line 343, in execute_model
ERROR 11-16 10:37:53 engine.py:135]     output = self.model_runner.execute_model(
ERROR 11-16 10:37:53 engine.py:135]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 11-16 10:37:53 engine.py:135]   File "/opt/conda/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 11-16 10:37:53 engine.py:135]     return func(*args, **kwargs)
ERROR 11-16 10:37:53 engine.py:135]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 11-16 10:37:53 engine.py:135]   File "/opt/conda/lib/python3.11/site-packages/vllm/worker/model_runner_base.py", line 146, in _wrapper
ERROR 11-16 10:37:53 engine.py:135]     raise type(err)(f"Error in model execution: "
ERROR 11-16 10:37:53 engine.py:135] RuntimeError: Error in model execution: CUDA error: an illegal memory access was encountered
ERROR 11-16 10:37:53 engine.py:135] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
ERROR 11-16 10:37:53 engine.py:135] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
ERROR 11-16 10:37:53 engine.py:135] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
ERROR 11-16 10:37:53 engine.py:135] 
INFO:     Shutting down
INFO:     Waiting for application shutdown.
INFO:     Application shutdown complete.
INFO:     Finished server process [618]

Before submitting a new issue...

DaBossCoda commented 1 week ago

Getting this a lot since 0.6.3. seems to be related to AWQ models.

llmforever commented 1 week ago

same situation here,can anyone solve this?

epark001 commented 1 week ago

experiencing this as well. thought this would be fixed by #9532 but still experiencing this since 0.6.3

edit: still experiencing this in 0.6.2

sunyicode0012 commented 1 week ago

I encountered the same problem and was quite confused during the process. vision: 0.6.3.post1 model: llama-3.1-405B-FP8

DaBossCoda commented 1 week ago

INFO 11-19 11:15:57 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20241119-111557.pkl...
WARNING 11-19 11:15:57 model_runner_base.py:143] Failed to pickle inputs of failed execution: CUDA error: an illegal memory access was encountered
WARNING 11-19 11:15:57 model_runner_base.py:143] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
WARNING 11-19 11:15:57 model_runner_base.py:143] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
WARNING 11-19 11:15:57 model_runner_base.py:143] Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
WARNING 11-19 11:15:57 model_runner_base.py:143]
[rank0]:[E1119 11:15:57.623518240 ProcessGroupNCCL.cpp:1595] [PG ID 2 PG GUID 3 Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

seven1122 commented 5 days ago

same problem in 0.6.1,0.6.3.post1,0.6.4.post1

DaBossCoda commented 4 days ago

happens to me in 0.6.2 too

badrjd commented 2 days ago

Same for me on llama 3.1 70b awq from 0.6.1 to 0.6.4.post1

BIGWangYuDong commented 2 days ago

Same for me on QWen 2.5-72B

linfan commented 19 hours ago

Same issues for QWen-2.5-72B-GPTQ-INT4 with 0.6.4.post1 with enable_chunked_prefill=False, enable_prefix_caching=False, use_v2_block_manager=False

badrjd commented 12 hours ago

Going back to 0.6.0 fixed the issue for me, but unfortunately it's quite slower.