[Bug]: qwen2-72b-instruct model with RuntimeError: CUDA error: an illegal memory access was encountered

Your current environment

PyTorch version: 2.3.0a0+ebedce2
Is debug build: False
CUDA used to build PyTorch: 12.3
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.3 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.30.0
Libc version: glibc-2.35

Python version: 3.10.12 (main, Mar 22 2024, 16:50:05) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-4.19.91-014-kangaroo.2.10.13.5c249cdaf.x86_64-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.3.107
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: NVIDIA A100-SXM4-80GB
GPU 1: NVIDIA A100-SXM4-80GB
GPU 2: NVIDIA A100-SXM4-80GB
GPU 3: NVIDIA A100-SXM4-80GB

Nvidia driver version: 535.54.03
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.9.0.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv.so.9.0.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn.so.9.0.0
/usr/lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.0.0
/usr/lib/x86_64-linux-gnu/libcudnn_engines_runtime_compiled.so.9.0.0
/usr/lib/x86_64-linux-gnu/libcudnn_graph.so.9.0.0
/usr/lib/x86_64-linux-gnu/libcudnn_heuristic.so.9.0.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops.so.9.0.0
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Address sizes:                   46 bits physical, 57 bits virtual
Byte Order:                      Little Endian
CPU(s):                          48
On-line CPU(s) list:             0-47
Vendor ID:                       GenuineIntel
Model name:                      Intel(R) Xeon(R) Processor @ 2.90GHz
CPU family:                      6
Model:                           106
Thread(s) per core:              1
Core(s) per socket:              48
Socket(s):                       1
Stepping:                        6
BogoMIPS:                        5800.00
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves wbnoinvd avx512vbmi umip pku avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid fsrm md_clear arch_capabilities
Virtualization:                  VT-x
Hypervisor vendor:               KVM
Virtualization type:             full
L1d cache:                       2.3 MiB (48 instances)
L1i cache:                       1.5 MiB (48 instances)
L2 cache:                        60 MiB (48 instances)
L3 cache:                        48 MiB (1 instance)
NUMA node(s):                    1
NUMA node0 CPU(s):               0-47
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Spec store bypass: Vulnerable
Vulnerability Spectre v1:        Vulnerable: __user pointer sanitization and usercopy barriers only; no swapgs barriers
Vulnerability Spectre v2:        Vulnerable, IBPB: disabled, STIBP: disabled
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected

Versions of relevant libraries:
[pip3] mypy==1.9.0
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.24.4
[pip3] onnx==1.15.0rc2
[pip3] optree==0.10.0
[pip3] pytorch-quantization==2.1.2
[pip3] torch==2.3.0a0+ebedce2
[pip3] torch-tensorrt==2.3.0a0
[pip3] torchdata==0.7.1a0
[pip3] torchtext==0.17.0a0
[pip3] torchvision==0.18.0a0
[pip3] transformers==4.42.4
[pip3] transformers-stream-generator==0.0.5
[pip3] triton==3.0.0
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.5.3.post1
vLLM Build Flags:
CUDA Archs: 5.2 6.0 6.1 7.0 7.2 7.5 8.0 8.6 8.7 9.0+PTX; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    GPU1    GPU2    GPU3    NIC0    NIC1    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV6     NV6     NV6     PHB     PHB     0-47            N/A             N/A
GPU1    NV6      X      NV6     NV6     PHB     PHB     0-47            N/A             N/A
GPU2    NV6     NV6      X      NV6     PHB     PHB     0-47            N/A             N/A
GPU3    NV6     NV6     NV6      X      PHB     PHB     0-47            N/A             N/A
NIC0    PHB     PHB     PHB     PHB      X      PHB
NIC1    PHB     PHB     PHB     PHB     PHB      X 

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1

🐛 Describe the bug

I am running a test python script test_llm.py, test_llm.py code is as follows:

Click to expand test_llm.py

```python import torch from vllm import LLM, SamplingParams import random import random import argparse import time random.seed(0) # Set the random seed for reproducibility dummy_prompt = "hello " * 30 # print(dummy_prompt) prompts = [] with open("./benchmarks/sonnet.txt", "r") as f: prompts = f.readlines() prompts = [prompt.strip() for prompt in prompts] # random.shuffle(prompts) def test_llm(model:str, n, max_tokens, tp_size): prompts_choose = prompts[:n] # print(prompts_choose) # Create a sampling params object. sampling_params = SamplingParams(temperature=0.0, top_p=1.0, max_tokens=max_tokens, ignore_eos=True) # Create an LLM. llm = LLM(model=model, trust_remote_code=True, enforce_eager=True, disable_log_stats=False, max_num_seqs=n, tensor_parallel_size=tp_size, disable_custom_all_reduce=True, gpu_memory_utilization=0.9) # Generate texts from the prompts. The output is a list of RequestOutput objects # that contain the prompt, generated text, and other information. torch.cuda.synchronize() time1 = time.perf_counter() outputs = llm.generate(prompts_choose, sampling_params) torch.cuda.synchronize() time2 = time.perf_counter() print(f"\nllm.generate over. All Generate Time: {time2 - time1:.5f} s\n") # # Print the outputs. # for output in outputs: # prompt = output.prompt # generated_text = output.outputs[0].text # # print(f"Prompt: {prompt!r},\n") # print(f"Generated text: {generated_text!r}\n") def test(): parser = argparse.ArgumentParser(description='Test LLM') parser.add_argument('-n', type=int, default=4, help='Number of prompts') parser.add_argument('-max_tokens', type=int, default=16, help='Maximum number of tokens') parser.add_argument('-tp_size', type=int, default=1, help='Tensor Parallel Size') parser.add_argument('-model', type=str, help='Model path') args = parser.parse_args() n = args.n max_tokens = args.max_tokens tp_size = args.tp_size model = args.model test_llm(model, n, max_tokens, tp_size) test() ```

run command is as follows:

python test_llm.py -n 128 -max_tokens 256 -tp_size 4 -model {YOUR_PATH}

When I use model: qwen2-72b-instruct, max_num_seqs=128, tensor_parallel_size=4, enforce_eager=True, prompts: vllm/benchmarks/sonnet.txt, it crashes inexplicably, with the error RuntimeError. CUDA error: an illegal memory access was encountered.

It has been verified to occur with batch_size=128 (batch_size 64, 256 are normal), max_tokens > 4, (max_tokens 4, 8, 16, 32, 64, 128, 256, etc. all happen).

Also using dummy_prompt = "hello " * 30 will also happen this error.

The error output is:

Processed prompts:   0%|                                                                                              | 0/128 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]Traceback (most recent call last):
  File "/mnt/data/zhuhr/vllm-main/test.log/test_llm.py", line 240, in <module>
    test()
  File "/mnt/data/zhuhr/vllm-main/test.log/test_llm.py", line 238, in test
    test_llm(model, n, max_tokens, tp_size)
  File "/mnt/data/zhuhr/vllm-main/test.log/test_llm.py", line 208, in test_llm
    outputs = llm.generate(prompts_choose, sampling_params)
  File "/mnt/data/zhuhr/vllm-main/vllm/utils.py", line 844, in inner
    return fn(*args, **kwargs)
  File "/mnt/data/zhuhr/vllm-main/vllm/entrypoints/llm.py", line 316, in generate
    outputs = self._run_engine(use_tqdm=use_tqdm)
  File "/mnt/data/zhuhr/vllm-main/vllm/entrypoints/llm.py", line 569, in _run_engine
    step_outputs = self.llm_engine.step()
  File "/mnt/data/zhuhr/vllm-main/vllm/engine/llm_engine.py", line 911, in step
    output = self.model_executor.execute_model(
  File "/mnt/data/zhuhr/vllm-main/vllm/executor/distributed_gpu_executor.py", line 76, in execute_model
    driver_outputs = self._driver_execute_model(execute_model_req)
  File "/mnt/data/zhuhr/vllm-main/vllm/executor/multiproc_gpu_executor.py", line 141, in _driver_execute_model
    return self.driver_worker.execute_model(execute_model_req)
  File "/mnt/data/zhuhr/vllm-main/vllm/worker/worker_base.py", line 272, in execute_model
    output = self.model_runner.execute_model(
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/mnt/data/zhuhr/vllm-main/vllm/worker/model_runner.py", line 1354, in execute_model
    hidden_or_intermediate_states = model_executable(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/mnt/data/zhuhr/vllm-main/vllm/model_executor/models/qwen2.py", line 336, in forward
    hidden_states = self.model(input_ids, positions, kv_caches,
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/mnt/data/zhuhr/vllm-main/vllm/model_executor/models/qwen2.py", line 257, in forward
    hidden_states, residual = layer(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/mnt/data/zhuhr/vllm-main/vllm/model_executor/models/qwen2.py", line 209, in forward
    hidden_states = self.self_attn(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/mnt/data/zhuhr/vllm-main/vllm/model_executor/models/qwen2.py", line 156, in forward
    attn_output = self.attn(q, k, v, kv_cache, attn_metadata)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/mnt/data/zhuhr/vllm-main/vllm/attention/layer.py", line 97, in forward
    return self.impl.forward(query,
  File "/mnt/data/zhuhr/vllm-main/vllm/attention/backends/flash_attn.py", line 543, in forward
    output[num_prefill_tokens:] = flash_attn_with_kvcache(
RuntimeError: CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

[rank0]:[E ProcessGroupNCCL.cpp:1335] [PG 2 Rank 0] NCCL watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /opt/pytorch/pytorch/c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x99 (0x7ff617f9bdc9 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xe0 (0x7ff617f4c2d0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x3c2 (0x7ff6221c2142 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x6f (0x7ff5b6daf11f in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7ff5b6db1dd8 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1f3 (0x7ff5b6db9083 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x125 (0x7ff5b6db9ed5 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xdc253 (0x7ff617ab0253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x94ac3 (0x7ff623912ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: clone + 0x44 (0x7ff6239a3a04 in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG 2 Rank 0] NCCL watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /opt/pytorch/pytorch/c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x99 (0x7ff617f9bdc9 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xe0 (0x7ff617f4c2d0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x3c2 (0x7ff6221c2142 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x6f (0x7ff5b6daf11f in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7ff5b6db1dd8 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1f3 (0x7ff5b6db9083 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x125 (0x7ff5b6db9ed5 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xdc253 (0x7ff617ab0253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x94ac3 (0x7ff623912ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: clone + 0x44 (0x7ff6239a3a04 in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1339 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x99 (0x7ff617f9bdc9 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xf6f04e (0x7ff5b6de604e in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xca016a (0x7ff5b6b1716a in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: <unknown function> + 0xdc253 (0x7ff617ab0253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #4: <unknown function> + 0x94ac3 (0x7ff623912ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #5: clone + 0x44 (0x7ff6239a3a04 in /lib/x86_64-linux-gnu/libc.so.6)

/usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
Aborted

Seems to only happen when batch_size=128, the flash_attn backend output[num_prefill_tokens:] = flash_attn_with_kvcache(...) error, haven't figured out the cause of the error yet..

vllm-project / vllm

[Bug]: qwen2-72b-instruct model with RuntimeError: CUDA error: an illegal memory access was encountered #6776

Your current environment

🐛 Describe the bug