[Bug]: L40 GPU deepseek-v2 fp8 cuda graph error; Using `--enforce-eager` can run properly.

fengyang95 commented 3 days ago

Your current environment

The output of `python collect_env.py`

```text Collecting environment information... PyTorch version: 2.4.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A OS: Debian GNU/Linux 11 (bullseye) (x86_64) GCC version: (Debian 10.2.1-6) 10.2.1 20210110 Clang version: Could not collect CMake version: version 3.30.3 Libc version: glibc-2.31 Python version: 3.11.2 (main, Jul 23 2024, 17:09:09) [GCC 10.2.1 20210110] (64-bit runtime) Python platform: Linux-5.4.143.bsk.8-amd64-x86_64-with-glibc2.31 Is CUDA available: True CUDA runtime version: 12.4.131 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA L40 GPU 1: NVIDIA L40 GPU 2: NVIDIA L40 GPU 3: NVIDIA L40 GPU 4: NVIDIA L40 GPU 5: NVIDIA L40 GPU 6: NVIDIA L40 GPU 7: NVIDIA L40 Nvidia driver version: Could not collect cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.9.2.0 /usr/lib/x86_64-linux-gnu/libcudnn_adv.so.9.2.0 /usr/lib/x86_64-linux-gnu/libcudnn_cnn.so.9.2.0 /usr/lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.2.0 /usr/lib/x86_64-linux-gnu/libcudnn_engines_runtime_compiled.so.9.2.0 /usr/lib/x86_64-linux-gnu/libcudnn_graph.so.9.2.0 /usr/lib/x86_64-linux-gnu/libcudnn_heuristic.so.9.2.0 /usr/lib/x86_64-linux-gnu/libcudnn_ops.so.9.2.0 HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 52 bits physical, 57 bits virtual CPU(s): 180 On-line CPU(s) list: 0-179 Thread(s) per core: 2 Core(s) per socket: 45 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 143 Model name: Intel(R) Xeon(R) Platinum 8457C Stepping: 8 CPU MHz: 2599.583 BogoMIPS: 5199.16 Hypervisor vendor: KVM Virtualization type: full L1d cache: 4.2 MiB L1i cache: 2.8 MiB L2 cache: 180 MiB L3 cache: 195 MiB NUMA node0 CPU(s): 0-89 NUMA node1 CPU(s): 90-179 Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Enhanced IBRS, IBPB conditional, RSB filling Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Mitigation; TSX disabled Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx512_bf16 wbnoinvd arat avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid cldemote movdiri movdir64b md_clear arch_capabilities Versions of relevant libraries: [pip3] byted-torch==2.4.0.post1 [pip3] flashinfer==0.1.6+cu124torch2.4 [pip3] numpy==1.26.4 [pip3] nvidia-cublas-cu12==12.1.3.1 [pip3] nvidia-cuda-cupti-cu12==12.1.105 [pip3] nvidia-cuda-nvrtc-cu12==12.1.105 [pip3] nvidia-cuda-runtime-cu12==12.1.105 [pip3] nvidia-cudnn-cu12==9.1.0.70 [pip3] nvidia-cufft-cu12==11.0.2.54 [pip3] nvidia-curand-cu12==10.3.2.106 [pip3] nvidia-cusolver-cu12==11.4.5.107 [pip3] nvidia-cusparse-cu12==12.1.0.106 [pip3] nvidia-ml-py==12.560.30 [pip3] nvidia-nccl-cu12==2.20.5 [pip3] nvidia-nvjitlink-cu12==12.6.68 [pip3] nvidia-nvtx-cu12==12.1.105 [pip3] pyzmq==26.2.0 [pip3] torch==2.4.0 [pip3] torchao==0.5.0 [pip3] torchaudio==2.4.0+cu124 [pip3] torchvision==0.19.0 [pip3] transformers==4.44.2 [pip3] triton==3.0.0 [pip3] zmq==0.0.0 [conda] Could not collect ROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: 0.6.1.post2@3724d5f6b59d9859e5b47c047535bb8edc124eab vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X NODE NODE NODE SYS SYS SYS SYS SYS 2-89 0 N/A GPU1 NODE X NODE NODE SYS SYS SYS SYS SYS 2-89 0 N/A GPU2 NODE NODE X NODE SYS SYS SYS SYS SYS 2-89 0 N/A GPU3 NODE NODE NODE X SYS SYS SYS SYS SYS 2-89 0 N/A GPU4 SYS SYS SYS SYS X NODE NODE NODE SYS 92-177 1 N/A GPU5 SYS SYS SYS SYS NODE X NODE NODE SYS 92-177 1 N/A GPU6 SYS SYS SYS SYS NODE NODE X NODE SYS 92-177 1 N/A GPU7 SYS SYS SYS SYS NODE NODE NODE X SYS 92-177 1 N/A NIC0 SYS SYS SYS SYS SYS SYS SYS SYS X Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks NIC Legend: NIC0: mlx5_0 ```

Model Input Dumps

python3 -m vllm.entrypoints.openai.api_server --model neuralmagic/DeepSeek-Coder-V2-Instruct-FP8 --served-model-name dsv2 --trust-remote-code --tensor-parallel-size 8 --max-model-len 8192 --port $PORT0 --gpu-memory-utilization 0.99 --kv-cache-dtype fp8 >> deepseek_v2.log 2>&1

🐛 Describe the bug

Process SpawnProcess-1: Traceback (most recent call last): File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1679, in capture output_hidden_or_intermediate_states = self.model( ^^^^^^^^^^^ File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(*args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/vllm/model_executor/models/deepseek_v2.py", line 504, in forward hidden_states = self.model(input_ids, positions, kv_caches, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(*args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/vllm/model_executor/models/deepseek_v2.py", line 461, in forward hidden_states, residual = layer(positions, hidden_states, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/vllm/model_executor/models/deepseek_v2.py", line 401, in forward hidden_states = self.mlp(hidden_states) ^^^^^^^^^^^^^^^^^^^^^^^ File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(*args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/vllm/model_executor/models/deepseek_v2.py", line 148, in forward final_hidden_states = self.experts( ^^^^^^^^^^^^^ File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(*args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/vllm/model_executor/layers/fused_moe/layer.py", line 469, in forward final_hidden_states = self.quant_method.apply( ^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/vllm/model_executor/layers/quantization/fp8.py", line 496, in apply return fused_experts(x, ^^^^^^^^^^^^^^^^ File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 553, in fused_experts invoke_fused_moe_kernel(curr_hidden_states, File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 258, in invoke_fused_moe_kernel fused_moe_kernel[grid]( File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/triton/runtime/jit.py", line 345, in return lambda *args, *kwargs: self.run(grid=grid, warmup=False, args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/triton/runtime/jit.py", line 691, in run kernel.run(grid_0, grid_1, grid_2, stream, kernel.function, kernel.packed_metadata, launch_metadata, File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/triton/backends/nvidia/driver.py", line 365, in call self.launch(*args, **kwargs) RuntimeError: Triton Error [CUDA]: operation failed due to a previous error during capture

Before submitting a new issue...

[X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

fengyang95 commented 3 days ago

To adapt to the L40, I modified the shared memory in https://github.com/vllm-project/vllm/blob/main/csrc/moe_align_block_size_kernels.cu to global memory and used cudaMalloc. Could this modification have affected the capture of the CUDA graph?

void moe_align_block_size(torch::Tensor topk_ids, int64_t num_experts,
107                           int64_t block_size, torch::Tensor sorted_token_ids,
108                           torch::Tensor experts_ids,
109                           torch::Tensor num_tokens_post_pad) {
110   const cudaStream_t stream = at::cuda::getCurrentCUDAStream();
111   VLLM_DISPATCH_INTEGRAL_TYPES(
112       topk_ids.scalar_type(), "moe_align_block_size_kernel", [&] {
113         // calc needed amount of shared mem for `tokens_cnts` and `cumsum`
114         // tensors
115 //         const int32_t shared_mem =
116 //             ((num_experts + 1) * num_experts + (num_experts + 1)) *
117 //             sizeof(int32_t);
118 
119         const int32_t mem_tokens_cnts =
120             ((num_experts + 1) * num_experts) * sizeof(int32_t);
121         const int32_t mem_cumsum =
122             (num_experts + 1) * sizeof(int32_t);
123 
124         // allocate global memory
125         int32_t* tokens_cnts;
126         int32_t* cumsum;
127         cudaMalloc(&tokens_cnts, mem_tokens_cnts);
128         cudaMalloc(&cumsum, mem_cumsum);
129 
130 
131         // set dynamic shared mem
132         auto kernel = vllm::moe_align_block_size_kernel<scalar_t>;
133 //         AT_CUDA_CHECK(VLLM_DevFuncAttribute_SET_MaxDynamicSharedMemorySize(
134 //             (void*)kernel, shared_mem));
135         kernel<<<1, num_experts, 0, stream>>>(
136             topk_ids.data_ptr<scalar_t>(), sorted_token_ids.data_ptr<int32_t>(),
137             experts_ids.data_ptr<int32_t>(),
138             num_tokens_post_pad.data_ptr<int32_t>(), num_experts, block_size,
139             topk_ids.numel(),tokens_cnts, cumsum);
140 
141         // free global memory
142         cudaFree(tokens_cnts);

mgoin commented 2 days ago

Thanks for reporting, I was able to reproduce on an L40 system with vllm==0.6.1.post2 with CUDA_LAUNCH_BLOCKING=1 vllm serve neuralmagic/DeepSeek-Coder-V2-Instruct-FP8 --trust-remote-code --tensor-parallel-size 8.

I'm not sure what the cause is yet, but it is clearly an issue with moe_align_block_size. It fails with and without cudagraphs for me. We will continue to look into it.

mgoin commented 1 day ago

@fengyang95 how did this resolve? Was there a fix landed?

fengyang95 commented 4 hours ago

@fengyang95 how did this resolve? Was there a fix landed?

@mgoin To avoid the issue of insufficient L40 shared memory, I changed the data type of tokens_cnt to uint16, and then it worked.

vllm-project / vllm