File "/home/ray/default/vllm/./benchmarks/kernels/benchmark_moe.py", line 70, in run
fused_moe(
File "/tmp/ray/session_2024-06-27_10-06-48_118980_5595/runtime_resources/working_dir_files/_ray_pkg_ef0e5109bc8b4140628503119c10e0b2c9ea3f17/vllm/model_executor/layers/fused_moe/fused_moe.py", line 519, in fused_moe
return fused_experts(hidden_states,
File "/tmp/ray/session_2024-06-27_10-06-48_118980_5595/runtime_resources/working_dir_files/_ray_pkg_ef0e5109bc8b4140628503119c10e0b2c9ea3f17/vllm/model_executor/layers/fused_moe/fused_moe.py", line 449, in fused_experts
invoke_fused_moe_kernel(intermediate_cache2,
File "/tmp/ray/session_2024-06-27_10-06-48_118980_5595/runtime_resources/working_dir_files/_ray_pkg_ef0e5109bc8b4140628503119c10e0b2c9ea3f17/vllm/model_executor/layers/fused_moe/fused_moe.py", line 245, in invoke_fused_moe_kernel
fused_moe_kernel[grid](
File "/home/ray/anaconda3/lib/python3.10/site-packages/triton/runtime/jit.py", line 167, in <lambda>
return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
File "/home/ray/anaconda3/lib/python3.10/site-packages/triton/runtime/jit.py", line 425, in run
kernel.run(grid_0, grid_1, grid_2, kernel.num_warps, kernel.num_ctas, # number of warps/ctas per instance
File "/home/ray/anaconda3/lib/python3.10/site-packages/triton/compiler/compiler.py", line 255, in __getattribute__
self._init_handles()
File "/home/ray/anaconda3/lib/python3.10/site-packages/triton/compiler/compiler.py", line 250, in _init_handles
self.module, self.function, self.n_regs, self.n_spills = driver.utils.load_binary(
RuntimeError: Triton Error [CUDA]: an illegal memory access was encountered
We have seen this problem on L4 and A100 GPUs. I also tried to tune this particular workload using different block sizes, but none of the configs could bypass the error. Since we usually don't use such a large batch size (number of tokens), this bug should not be critical at least for now.
Your current environment
🐛 Describe the bug
Illegal memory access for MoE triton kernel when the workload (e.g., batch size) is too large. To reproduce:
Output
We have seen this problem on L4 and A100 GPUs. I also tried to tune this particular workload using different block sizes, but none of the configs could bypass the error. Since we usually don't use such a large batch size (number of tokens), this bug should not be critical at least for now.
Also cc @pcmoritz @WoosukKwon @Yard1