vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
26.92k stars 3.95k forks source link

[Bug]: L40 GPU deepseek-v2 fp8 cuda graph error; Using `--enforce-eager` can run properly. #8494

Closed fengyang95 closed 1 day ago

fengyang95 commented 3 days ago

Your current environment

The output of `python collect_env.py` ```text Collecting environment information... PyTorch version: 2.4.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A OS: Debian GNU/Linux 11 (bullseye) (x86_64) GCC version: (Debian 10.2.1-6) 10.2.1 20210110 Clang version: Could not collect CMake version: version 3.30.3 Libc version: glibc-2.31 Python version: 3.11.2 (main, Jul 23 2024, 17:09:09) [GCC 10.2.1 20210110] (64-bit runtime) Python platform: Linux-5.4.143.bsk.8-amd64-x86_64-with-glibc2.31 Is CUDA available: True CUDA runtime version: 12.4.131 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA L40 GPU 1: NVIDIA L40 GPU 2: NVIDIA L40 GPU 3: NVIDIA L40 GPU 4: NVIDIA L40 GPU 5: NVIDIA L40 GPU 6: NVIDIA L40 GPU 7: NVIDIA L40 Nvidia driver version: Could not collect cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.9.2.0 /usr/lib/x86_64-linux-gnu/libcudnn_adv.so.9.2.0 /usr/lib/x86_64-linux-gnu/libcudnn_cnn.so.9.2.0 /usr/lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.2.0 /usr/lib/x86_64-linux-gnu/libcudnn_engines_runtime_compiled.so.9.2.0 /usr/lib/x86_64-linux-gnu/libcudnn_graph.so.9.2.0 /usr/lib/x86_64-linux-gnu/libcudnn_heuristic.so.9.2.0 /usr/lib/x86_64-linux-gnu/libcudnn_ops.so.9.2.0 HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 52 bits physical, 57 bits virtual CPU(s): 180 On-line CPU(s) list: 0-179 Thread(s) per core: 2 Core(s) per socket: 45 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 143 Model name: Intel(R) Xeon(R) Platinum 8457C Stepping: 8 CPU MHz: 2599.583 BogoMIPS: 5199.16 Hypervisor vendor: KVM Virtualization type: full L1d cache: 4.2 MiB L1i cache: 2.8 MiB L2 cache: 180 MiB L3 cache: 195 MiB NUMA node0 CPU(s): 0-89 NUMA node1 CPU(s): 90-179 Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Enhanced IBRS, IBPB conditional, RSB filling Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Mitigation; TSX disabled Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx512_bf16 wbnoinvd arat avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid cldemote movdiri movdir64b md_clear arch_capabilities Versions of relevant libraries: [pip3] byted-torch==2.4.0.post1 [pip3] flashinfer==0.1.6+cu124torch2.4 [pip3] numpy==1.26.4 [pip3] nvidia-cublas-cu12==12.1.3.1 [pip3] nvidia-cuda-cupti-cu12==12.1.105 [pip3] nvidia-cuda-nvrtc-cu12==12.1.105 [pip3] nvidia-cuda-runtime-cu12==12.1.105 [pip3] nvidia-cudnn-cu12==9.1.0.70 [pip3] nvidia-cufft-cu12==11.0.2.54 [pip3] nvidia-curand-cu12==10.3.2.106 [pip3] nvidia-cusolver-cu12==11.4.5.107 [pip3] nvidia-cusparse-cu12==12.1.0.106 [pip3] nvidia-ml-py==12.560.30 [pip3] nvidia-nccl-cu12==2.20.5 [pip3] nvidia-nvjitlink-cu12==12.6.68 [pip3] nvidia-nvtx-cu12==12.1.105 [pip3] pyzmq==26.2.0 [pip3] torch==2.4.0 [pip3] torchao==0.5.0 [pip3] torchaudio==2.4.0+cu124 [pip3] torchvision==0.19.0 [pip3] transformers==4.44.2 [pip3] triton==3.0.0 [pip3] zmq==0.0.0 [conda] Could not collect ROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: 0.6.1.post2@3724d5f6b59d9859e5b47c047535bb8edc124eab vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X NODE NODE NODE SYS SYS SYS SYS SYS 2-89 0 N/A GPU1 NODE X NODE NODE SYS SYS SYS SYS SYS 2-89 0 N/A GPU2 NODE NODE X NODE SYS SYS SYS SYS SYS 2-89 0 N/A GPU3 NODE NODE NODE X SYS SYS SYS SYS SYS 2-89 0 N/A GPU4 SYS SYS SYS SYS X NODE NODE NODE SYS 92-177 1 N/A GPU5 SYS SYS SYS SYS NODE X NODE NODE SYS 92-177 1 N/A GPU6 SYS SYS SYS SYS NODE NODE X NODE SYS 92-177 1 N/A GPU7 SYS SYS SYS SYS NODE NODE NODE X SYS 92-177 1 N/A NIC0 SYS SYS SYS SYS SYS SYS SYS SYS X Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks NIC Legend: NIC0: mlx5_0 ```

Model Input Dumps

python3 -m vllm.entrypoints.openai.api_server --model neuralmagic/DeepSeek-Coder-V2-Instruct-FP8 --served-model-name dsv2 --trust-remote-code --tensor-parallel-size 8 --max-model-len 8192 --port $PORT0 --gpu-memory-utilization 0.99 --kv-cache-dtype fp8 >> deepseek_v2.log 2>&1

🐛 Describe the bug

Process SpawnProcess-1: Traceback (most recent call last): File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1679, in capture output_hidden_or_intermediate_states = self.model( ^^^^^^^^^^^ File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(*args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/vllm/model_executor/models/deepseek_v2.py", line 504, in forward hidden_states = self.model(input_ids, positions, kv_caches, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(*args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/vllm/model_executor/models/deepseek_v2.py", line 461, in forward hidden_states, residual = layer(positions, hidden_states, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/vllm/model_executor/models/deepseek_v2.py", line 401, in forward hidden_states = self.mlp(hidden_states) ^^^^^^^^^^^^^^^^^^^^^^^ File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(*args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/vllm/model_executor/models/deepseek_v2.py", line 148, in forward final_hidden_states = self.experts( ^^^^^^^^^^^^^ File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(*args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/vllm/model_executor/layers/fused_moe/layer.py", line 469, in forward final_hidden_states = self.quant_method.apply( ^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/vllm/model_executor/layers/quantization/fp8.py", line 496, in apply return fused_experts(x, ^^^^^^^^^^^^^^^^ File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 553, in fused_experts invoke_fused_moe_kernel(curr_hidden_states, File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 258, in invoke_fused_moe_kernel fused_moe_kernel[grid]( File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/triton/runtime/jit.py", line 345, in return lambda *args, *kwargs: self.run(grid=grid, warmup=False, args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/triton/runtime/jit.py", line 691, in run kernel.run(grid_0, grid_1, grid_2, stream, kernel.function, kernel.packed_metadata, launch_metadata, File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/triton/backends/nvidia/driver.py", line 365, in call self.launch(*args, **kwargs) RuntimeError: Triton Error [CUDA]: operation failed due to a previous error during capture

Before submitting a new issue...

fengyang95 commented 3 days ago

To adapt to the L40, I modified the shared memory in https://github.com/vllm-project/vllm/blob/main/csrc/moe_align_block_size_kernels.cu to global memory and used cudaMalloc. Could this modification have affected the capture of the CUDA graph?

void moe_align_block_size(torch::Tensor topk_ids, int64_t num_experts,
107                           int64_t block_size, torch::Tensor sorted_token_ids,
108                           torch::Tensor experts_ids,
109                           torch::Tensor num_tokens_post_pad) {
110   const cudaStream_t stream = at::cuda::getCurrentCUDAStream();
111   VLLM_DISPATCH_INTEGRAL_TYPES(
112       topk_ids.scalar_type(), "moe_align_block_size_kernel", [&] {
113         // calc needed amount of shared mem for `tokens_cnts` and `cumsum`
114         // tensors
115 //         const int32_t shared_mem =
116 //             ((num_experts + 1) * num_experts + (num_experts + 1)) *
117 //             sizeof(int32_t);
118 
119         const int32_t mem_tokens_cnts =
120             ((num_experts + 1) * num_experts) * sizeof(int32_t);
121         const int32_t mem_cumsum =
122             (num_experts + 1) * sizeof(int32_t);
123 
124         // allocate global memory
125         int32_t* tokens_cnts;
126         int32_t* cumsum;
127         cudaMalloc(&tokens_cnts, mem_tokens_cnts);
128         cudaMalloc(&cumsum, mem_cumsum);
129 
130 
131         // set dynamic shared mem
132         auto kernel = vllm::moe_align_block_size_kernel<scalar_t>;
133 //         AT_CUDA_CHECK(VLLM_DevFuncAttribute_SET_MaxDynamicSharedMemorySize(
134 //             (void*)kernel, shared_mem));
135         kernel<<<1, num_experts, 0, stream>>>(
136             topk_ids.data_ptr<scalar_t>(), sorted_token_ids.data_ptr<int32_t>(),
137             experts_ids.data_ptr<int32_t>(),
138             num_tokens_post_pad.data_ptr<int32_t>(), num_experts, block_size,
139             topk_ids.numel(),tokens_cnts, cumsum);
140 
141         // free global memory
142         cudaFree(tokens_cnts);
mgoin commented 2 days ago

Thanks for reporting, I was able to reproduce on an L40 system with vllm==0.6.1.post2 with CUDA_LAUNCH_BLOCKING=1 vllm serve neuralmagic/DeepSeek-Coder-V2-Instruct-FP8 --trust-remote-code --tensor-parallel-size 8.

I'm not sure what the cause is yet, but it is clearly an issue with moe_align_block_size. It fails with and without cudagraphs for me. We will continue to look into it.

mgoin commented 1 day ago

@fengyang95 how did this resolve? Was there a fix landed?

fengyang95 commented 4 hours ago

@fengyang95 how did this resolve? Was there a fix landed?

@mgoin To avoid the issue of insufficient L40 shared memory, I changed the data type of tokens_cnt to uint16, and then it worked.