vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
26.05k stars 3.82k forks source link

[Bug] [spec decode] [flash_attn]: CUDA illegal memory access when calling flash_attn_cuda.fwd_kvcache #5152

Open khluu opened 3 months ago

khluu commented 3 months ago

My environment setup

1st environment (running on ec2 g6.4xlarge)

[2024-06-01T10:14:23Z] Collecting environment information...
[2024-06-01T10:14:26Z] PyTorch version: 2.3.0+cu121
[2024-06-01T10:14:26Z] Is debug build: False
[2024-06-01T10:14:26Z] CUDA used to build PyTorch: 12.1
[2024-06-01T10:14:26Z] ROCM used to build PyTorch: N/A
[2024-06-01T10:14:26Z]
[2024-06-01T10:14:26Z] OS: Ubuntu 22.04.4 LTS (x86_64)
[2024-06-01T10:14:26Z] GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
[2024-06-01T10:14:26Z] Clang version: Could not collect
[2024-06-01T10:14:26Z] CMake version: version 3.29.3
[2024-06-01T10:14:26Z] Libc version: glibc-2.35
[2024-06-01T10:14:26Z]
[2024-06-01T10:14:26Z] Python version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime)
[2024-06-01T10:14:26Z] Python platform: Linux-6.1.90-99.173.amzn2023.x86_64-x86_64-with-glibc2.35
[2024-06-01T10:14:26Z] Is CUDA available: True
[2024-06-01T10:14:26Z] CUDA runtime version: Could not collect
[2024-06-01T10:14:26Z] CUDA_MODULE_LOADING set to: LAZY
[2024-06-01T10:14:26Z] GPU models and configuration: GPU 0: NVIDIA L4
[2024-06-01T10:14:26Z] Nvidia driver version: 525.147.05
[2024-06-01T10:14:26Z] cuDNN version: Could not collect
[2024-06-01T10:14:26Z] HIP runtime version: N/A
[2024-06-01T10:14:26Z] MIOpen runtime version: N/A
[2024-06-01T10:14:26Z] Is XNNPACK available: True
[2024-06-01T10:14:26Z]
[2024-06-01T10:14:26Z] CPU:
[2024-06-01T10:14:26Z] Architecture:                         x86_64
[2024-06-01T10:14:26Z] CPU op-mode(s):                       32-bit, 64-bit
[2024-06-01T10:14:26Z] Address sizes:                        48 bits physical, 48 bits virtual
[2024-06-01T10:14:26Z] Byte Order:                           Little Endian
[2024-06-01T10:14:26Z] CPU(s):                               16
[2024-06-01T10:14:26Z] On-line CPU(s) list:                  0-15
[2024-06-01T10:14:26Z] Vendor ID:                            AuthenticAMD
[2024-06-01T10:14:26Z] Model name:                           AMD EPYC 7R13 Processor
[2024-06-01T10:14:26Z] CPU family:                           25
[2024-06-01T10:14:26Z] Model:                                1
[2024-06-01T10:14:26Z] Thread(s) per core:                   2
[2024-06-01T10:14:26Z] Core(s) per socket:                   8
[2024-06-01T10:14:26Z] Socket(s):                            1
[2024-06-01T10:14:26Z] Stepping:                             1
[2024-06-01T10:14:26Z] BogoMIPS:                             5299.99
[2024-06-01T10:14:26Z] Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext invpcid_single ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save vaes vpclmulqdq rdpid
[2024-06-01T10:14:26Z] Hypervisor vendor:                    KVM
[2024-06-01T10:14:26Z] Virtualization type:                  full
[2024-06-01T10:14:26Z] L1d cache:                            256 KiB (8 instances)
[2024-06-01T10:14:26Z] L1i cache:                            256 KiB (8 instances)
[2024-06-01T10:14:26Z] L2 cache:                             4 MiB (8 instances)
[2024-06-01T10:14:26Z] L3 cache:                             32 MiB (1 instance)
[2024-06-01T10:14:26Z] NUMA node(s):                         1
[2024-06-01T10:14:26Z] NUMA node0 CPU(s):                    0-15
[2024-06-01T10:14:26Z] Vulnerability Gather data sampling:   Not affected
[2024-06-01T10:14:26Z] Vulnerability Itlb multihit:          Not affected
[2024-06-01T10:14:26Z] Vulnerability L1tf:                   Not affected
[2024-06-01T10:14:26Z] Vulnerability Mds:                    Not affected
[2024-06-01T10:14:26Z] Vulnerability Meltdown:               Not affected
[2024-06-01T10:14:26Z] Vulnerability Mmio stale data:        Not affected
[2024-06-01T10:14:26Z] Vulnerability Reg file data sampling: Not affected
[2024-06-01T10:14:26Z] Vulnerability Retbleed:               Not affected
[2024-06-01T10:14:26Z] Vulnerability Spec rstack overflow:   Mitigation; safe RET, no microcode
[2024-06-01T10:14:26Z] Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
[2024-06-01T10:14:26Z] Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
[2024-06-01T10:14:26Z] Vulnerability Spectre v2:             Mitigation; Retpolines; IBPB conditional; IBRS_FW; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
[2024-06-01T10:14:26Z] Vulnerability Srbds:                  Not affected
[2024-06-01T10:14:26Z] Vulnerability Tsx async abort:        Not affected
[2024-06-01T10:14:26Z]
[2024-06-01T10:14:26Z] Versions of relevant libraries:
[2024-06-01T10:14:26Z] [pip3] mypy==1.9.0
[2024-06-01T10:14:26Z] [pip3] mypy-extensions==1.0.0
[2024-06-01T10:14:26Z] [pip3] numpy==1.26.4
[2024-06-01T10:14:26Z] [pip3] nvidia-nccl-cu12==2.20.5
[2024-06-01T10:14:26Z] [pip3] torch==2.3.0
[2024-06-01T10:14:26Z] [pip3] triton==2.3.0
[2024-06-01T10:14:26Z] [conda] Could not collectROCM Version: Could not collect
[2024-06-01T10:14:26Z] Neuron SDK Version: N/A
[2024-06-01T10:14:26Z] vLLM Version: 0.4.3
[2024-06-01T10:14:26Z] vLLM Build Flags:
[2024-06-01T10:14:26Z] CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
[2024-06-01T10:14:26Z] GPU Topology:
[2024-06-01T10:14:26Z] GPU0 CPU Affinity    NUMA Affinity
[2024-06-01T10:14:26Z] GPU0  X  0-15        N/A

2nd environment (running on GCP g2-standard-12):

Collecting environment information...
PyTorch version: 2.3.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.29.3
Libc version: glibc-2.35

Python version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-5.10.0-29-cloud-amd64-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA L4
Nvidia driver version: 550.54.15
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        46 bits physical, 48 bits virtual
Byte Order:                           Little Endian
CPU(s):                               12
On-line CPU(s) list:                  0-11
Vendor ID:                            GenuineIntel
Model name:                           Intel(R) Xeon(R) CPU @ 2.20GHz
CPU family:                           6
Model:                                85
Thread(s) per core:                   2
Core(s) per socket:                   6
Socket(s):                            1
Stepping:                             7
BogoMIPS:                             4400.45
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat avx512_vnni md_clear arch_capabilities
Hypervisor vendor:                    KVM
Virtualization type:                  full
L1d cache:                            192 KiB (6 instances)
L1i cache:                            192 KiB (6 instances)
L2 cache:                             6 MiB (6 instances)
L3 cache:                             38.5 MiB (1 instance)
NUMA node(s):                         1
NUMA node0 CPU(s):                    0-11
Vulnerability Gather data sampling:   Not affected
Vulnerability Itlb multihit:          Not affected
Vulnerability L1tf:                   Not affected
Vulnerability Mds:                    Mitigation; Clear CPU buffers; SMT Host state unknown
Vulnerability Meltdown:               Not affected
Vulnerability Mmio stale data:        Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Mitigation; Enhanced IBRS
Vulnerability Spec rstack overflow:   Not affected
Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:             Mitigation; Enhanced / Automatic IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Mitigation; Clear CPU buffers; SMT Host state unknown

Versions of relevant libraries:
[pip3] mypy==1.9.0
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.26.4
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] torch==2.3.0
[pip3] triton==2.3.0
[conda] Could not collectROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.4.3
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      0-11    0               N/A

🐛 Describe the bug

DeJoker commented 2 months ago

same problem happen to me. Is this bug in progress?

khluu commented 2 months ago

@DeJoker do you also see it in unit test or other places? How are you running it?

khluu commented 2 months ago

This issue on Spec decoding tests should be fixed already

DeJoker commented 2 months ago

@khluu I don't have a demo right now that can at least reproduce the problem. Just same issue with flash_attn_cuda.fwd_kvcache. the situation is vllm start in nvidia triton server(nvcr.io/nvidia/tritonserver:24.05-vllm-python-py3), then send request directly with grpc client

My environment setup:

Collecting environment information...
PyTorch version: 2.3.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: 14.0.0-1ubuntu1.1
CMake version: version 3.29.3
Libc version: glibc-2.35

Python version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-5.10.134-13.1.al8.x86_64-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.4.131
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: NVIDIA A100-SXM4-80GB
GPU 1: NVIDIA A100-SXM4-80GB
GPU 2: NVIDIA A100-SXM4-80GB
GPU 3: NVIDIA A100-SXM4-80GB
GPU 4: NVIDIA A100-SXM4-80GB
GPU 5: NVIDIA A100-SXM4-80GB
GPU 6: NVIDIA A100-SXM4-80GB
GPU 7: NVIDIA A100-SXM4-80GB

Nvidia driver version: 530.30.02
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.9.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv.so.9.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn.so.9.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_engines_runtime_compiled.so.9.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_graph.so.9.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_heuristic.so.9.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops.so.9.1.0
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Address sizes:                   46 bits physical, 48 bits virtual
Byte Order:                      Little Endian
CPU(s):                          128
On-line CPU(s) list:             0-127
Vendor ID:                       GenuineIntel
Model name:                      Intel(R) Xeon(R) Platinum 8369B CPU @ 2.90GHz
CPU family:                      6
Model:                           106
Thread(s) per core:              2
Core(s) per socket:              32
Socket(s):                       2
Stepping:                        6
BogoMIPS:                        5800.00
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves wbnoinvd arat avx512vbmi avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid fsrm arch_capabilities
Hypervisor vendor:               KVM
Virtualization type:             full
L1d cache:                       3 MiB (64 instances)
L1i cache:                       2 MiB (64 instances)
L2 cache:                        80 MiB (64 instances)
L3 cache:                        96 MiB (2 instances)
NUMA node(s):                    2
NUMA node0 CPU(s):               0-63
NUMA node1 CPU(s):               64-127
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Mmio stale data:   Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Retbleed:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] torch==2.3.0
[pip3] transformers==4.41.0
[pip3] triton==2.3.0
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.5.0.post1
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    CPU Affinity    NUMA Affinity
GPU0     X      NV12    NV12    NV12    NV12    NV12    NV12    NV12    0-127   0-1
GPU1    NV12     X      NV12    NV12    NV12    NV12    NV12    NV12    0-127   0-1
GPU2    NV12    NV12     X      NV12    NV12    NV12    NV12    NV12    0-127   0-1
GPU3    NV12    NV12    NV12     X      NV12    NV12    NV12    NV12    0-127   0-1
GPU4    NV12    NV12    NV12    NV12     X      NV12    NV12    NV12    0-127   0-1
GPU5    NV12    NV12    NV12    NV12    NV12     X      NV12    NV12    0-127   0-1
GPU6    NV12    NV12    NV12    NV12    NV12    NV12     X      NV12    0-127   0-1
GPU7    NV12    NV12    NV12    NV12    NV12    NV12    NV12     X      0-127   0-1

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

error message with:

INFO 06-18 08:04:36 metrics.py:341] Avg prompt throughput: 17673.7 tokens/s, Avg generation throughput: 204.0 tokens/s, Running: 233 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 7.8%, CPU KV cache usage: 0.0%.
INFO 06-18 08:04:41 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 313.0 tokens/s, Running: 190 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 7.4%, CPU KV cache usage: 0.0%.
ERROR 06-18 08:04:44 async_llm_engine.py:52] Engine background task failed
ERROR 06-18 08:04:44 async_llm_engine.py:52] Traceback (most recent call last):
ERROR 06-18 08:04:44 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 42, in _log_task_completion
ERROR 06-18 08:04:44 async_llm_engine.py:52]     return_value = task.result()
ERROR 06-18 08:04:44 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 532, in run_engine_loop
ERROR 06-18 08:04:44 async_llm_engine.py:52]     has_requests_in_progress = await asyncio.wait_for(
ERROR 06-18 08:04:44 async_llm_engine.py:52]   File "/usr/lib/python3.10/asyncio/tasks.py", line 445, in wait_for
ERROR 06-18 08:04:44 async_llm_engine.py:52]     return fut.result()
ERROR 06-18 08:04:44 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 506, in engine_step
ERROR 06-18 08:04:44 async_llm_engine.py:52]     request_outputs = await self.engine.step_async()
ERROR 06-18 08:04:44 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 235, in step_async
ERROR 06-18 08:04:44 async_llm_engine.py:52]     output = await self.model_executor.execute_model_async(
ERROR 06-18 08:04:44 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 117, in execute_model_async
ERROR 06-18 08:04:44 async_llm_engine.py:52]     output = await make_async(self.driver_worker.execute_model
ERROR 06-18 08:04:44 async_llm_engine.py:52]   File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
ERROR 06-18 08:04:44 async_llm_engine.py:52]     result = self.fn(*self.args, **self.kwargs)
ERROR 06-18 08:04:44 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
ERROR 06-18 08:04:44 async_llm_engine.py:52]     return func(*args, **kwargs)
ERROR 06-18 08:04:44 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 280, in execute_model
ERROR 06-18 08:04:44 async_llm_engine.py:52]     output = self.model_runner.execute_model(seq_group_metadata_list,
ERROR 06-18 08:04:44 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
ERROR 06-18 08:04:44 async_llm_engine.py:52]     return func(*args, **kwargs)
ERROR 06-18 08:04:44 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 749, in execute_model
ERROR 06-18 08:04:44 async_llm_engine.py:52]     hidden_states = model_executable(
ERROR 06-18 08:04:44 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
ERROR 06-18 08:04:44 async_llm_engine.py:52]     return self._call_impl(*args, **kwargs)
ERROR 06-18 08:04:44 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
ERROR 06-18 08:04:44 async_llm_engine.py:52]     return forward_call(*args, **kwargs)
ERROR 06-18 08:04:44 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen2.py", line 330, in forward
ERROR 06-18 08:04:44 async_llm_engine.py:52]     hidden_states = self.model(input_ids, positions, kv_caches,
ERROR 06-18 08:04:44 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
ERROR 06-18 08:04:44 async_llm_engine.py:52]     return self._call_impl(*args, **kwargs)
ERROR 06-18 08:04:44 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
ERROR 06-18 08:04:44 async_llm_engine.py:52]     return forward_call(*args, **kwargs)
ERROR 06-18 08:04:44 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen2.py", line 254, in forward
ERROR 06-18 08:04:44 async_llm_engine.py:52]     hidden_states, residual = layer(
ERROR 06-18 08:04:44 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
ERROR 06-18 08:04:44 async_llm_engine.py:52]     return self._call_impl(*args, **kwargs)
ERROR 06-18 08:04:44 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
ERROR 06-18 08:04:44 async_llm_engine.py:52]     return forward_call(*args, **kwargs)
ERROR 06-18 08:04:44 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen2.py", line 206, in forward
ERROR 06-18 08:04:44 async_llm_engine.py:52]     hidden_states = self.self_attn(
ERROR 06-18 08:04:44 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
ERROR 06-18 08:04:44 async_llm_engine.py:52]     return self._call_impl(*args, **kwargs)
ERROR 06-18 08:04:44 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
ERROR 06-18 08:04:44 async_llm_engine.py:52]     return forward_call(*args, **kwargs)
ERROR 06-18 08:04:44 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen2.py", line 153, in forward
ERROR 06-18 08:04:44 async_llm_engine.py:52]     attn_output = self.attn(q, k, v, kv_cache, attn_metadata)
ERROR 06-18 08:04:44 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
ERROR 06-18 08:04:44 async_llm_engine.py:52]     return self._call_impl(*args, **kwargs)
ERROR 06-18 08:04:44 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
ERROR 06-18 08:04:44 async_llm_engine.py:52]     return forward_call(*args, **kwargs)
ERROR 06-18 08:04:44 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm/attention/layer.py", line 89, in forward
ERROR 06-18 08:04:44 async_llm_engine.py:52]     return self.impl.forward(query, key, value, kv_cache, attn_metadata,
ERROR 06-18 08:04:44 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm/attention/backends/flash_attn.py", line 355, in forward
ERROR 06-18 08:04:44 async_llm_engine.py:52]     output[num_prefill_tokens:] = flash_attn_with_kvcache(
ERROR 06-18 08:04:44 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm_flash_attn/flash_attn_interface.py", line 1233, in flash_attn_with_kvcache
ERROR 06-18 08:04:44 async_llm_engine.py:52]     out, softmax_lse = flash_attn_cuda.fwd_kvcache(
ERROR 06-18 08:04:44 async_llm_engine.py:52] RuntimeError: CUDA error: an illegal memory access was encountered
ERROR 06-18 08:04:44 async_llm_engine.py:52] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
ERROR 06-18 08:04:44 async_llm_engine.py:52] For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
ERROR 06-18 08:04:44 async_llm_engine.py:52] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
ERROR 06-18 08:04:44 async_llm_engine.py:52] 
Exception in callback _log_task_completion(error_callback=<bound method...7eff2e47e500>>)(<Task finishe...sertions.\n')>) at /usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py:32
handle: <Handle _log_task_completion(error_callback=<bound method...7eff2e47e500>>)(<Task finishe...sertions.\n')>) at /usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py:32>
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 42, in _log_task_completion
    return_value = task.result()
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 532, in run_engine_loop
    has_requests_in_progress = await asyncio.wait_for(
  File "/usr/lib/python3.10/asyncio/tasks.py", line 445, in wait_for
    return fut.result()
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 506, in engine_step
    request_outputs = await self.engine.step_async()
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 235, in step_async
    output = await self.model_executor.execute_model_async(
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 117, in execute_model_async
    output = await make_async(self.driver_worker.execute_model
  File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 280, in execute_model
    output = self.model_runner.execute_model(seq_group_metadata_list,
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 749, in execute_model
    hidden_states = model_executable(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen2.py", line 330, in forward
    hidden_states = self.model(input_ids, positions, kv_caches,
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen2.py", line 254, in forward
    hidden_states, residual = layer(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen2.py", line 206, in forward
    hidden_states = self.self_attn(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen2.py", line 153, in forward
    attn_output = self.attn(q, k, v, kv_cache, attn_metadata)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/attention/layer.py", line 89, in forward
    return self.impl.forward(query, key, value, kv_cache, attn_metadata,
  File "/usr/local/lib/python3.10/dist-packages/vllm/attention/backends/flash_attn.py", line 355, in forward
    output[num_prefill_tokens:] = flash_attn_with_kvcache(
  File "/usr/local/lib/python3.10/dist-packages/vllm_flash_attn/flash_attn_interface.py", line 1233, in flash_attn_with_kvcache
    out, softmax_lse = flash_attn_cuda.fwd_kvcache(
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 54, in _log_task_completion
    raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for theactual cause.
I0618 08:04:44.709818 1084 model.py:368] "[vllm] Error generating stream: CUDA error: an illegal memory access was encountered\nCUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.\nFor debugging consider passing CUDA_LAUNCH_BLOCKING=1.\nCompile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.\n"
I0618 08:04:44.710252 1084 model.py:368] "[vllm] Error generating stream: CUDA error: an illegal memory access was encountered\nCUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.\nFor debugging consider passing CUDA_LAUNCH_BLOCKING=1.\nCompile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.\n"
rain7996 commented 3 weeks ago

I get the same error. When I set the max_num_seqs=20, the error appears. When I set he max_num_seqs=18, everything goes well. It seems like a kind of memory overflow? BTW, my gpu is H20 and the code runs well on my H800 machine.