[Bug]: FusedMoE kernel performance depends on input prompt length while decoding

Your current environment

The output of `python collect_env.py`

```text PyTorch version: 2.4.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A OS: Ubuntu 22.04.5 LTS (x86_64) GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Clang version: Could not collect CMake version: version 3.30.4 Libc version: glibc-2.35 Python version: 3.10.12 (main, Sep 11 2024, 15:47:36) [GCC 11.4.0] (64-bit runtime) Python platform: Linux-5.15.0-78-generic-x86_64-with-glibc2.35 Is CUDA available: True CUDA runtime version: 12.6.77 Versions of relevant libraries: [pip3] numpy==1.26.4 [pip3] nvidia-cublas-cu12==12.1.3.1 [pip3] nvidia-cuda-cupti-cu12==12.1.105 [pip3] nvidia-cuda-nvrtc-cu12==12.1.105 [pip3] nvidia-cuda-runtime-cu12==12.1.105 [pip3] nvidia-cudnn-cu12==9.1.0.70 [pip3] nvidia-cufft-cu12==11.0.2.54 [pip3] nvidia-curand-cu12==10.3.2.106 [pip3] nvidia-cusolver-cu12==11.4.5.107 [pip3] nvidia-cusparse-cu12==12.1.0.106 [pip3] nvidia-ml-py==12.560.30 [pip3] nvidia-nccl-cu12==2.20.5 [pip3] nvidia-nvjitlink-cu12==12.4.127 [pip3] nvidia-nvtx-cu12==12.1.105 [pip3] pyzmq==26.2.0 [pip3] torch==2.4.0 [pip3] torchvision==0.19.0 [pip3] transformers==4.46.2 [pip3] triton==3.0.0 [conda] Could not collect ROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: 0.6.3.post2.dev338+gf0f2e563 vLLM Build Flags: CUDA Archs: 5.2 6.0 6.1 7.0 7.2 7.5 8.0 8.6 8.7 9.0+PTX; ROCm: Disabled; Neuron: Disabled CUBLAS_VERSION=12.6.3.3 CUDA_CACHE_DISABLE=1 TORCH_CUDA_ARCH_LIST=5.2 6.0 6.1 7.0 7.2 7.5 8.0 8.6 8.7 9.0+PTX NCCL_VERSION=2.22.3 CUDA_VERSION=12.6.2.004 PYTORCH_VERSION=2.5.0a0+e000cf0 PYTORCH_BUILD_NUMBER=0 CUDNN_FRONTEND_VERSION=1.7.0 CUDNN_VERSION=9.5.0.50 PYTORCH_HOME=/opt/pytorch/pytorch CUDA_DRIVER_VERSION=560.35.03 PYTORCH_BUILD_VERSION=2.5.0a0+e000cf0 CUDA_MODULE_LOADING=LAZY NVIDIA_PYTORCH_VERSION=24.10 TORCH_ALLOW_TF32_CUBLAS_OVERRIDE=1 ```

Model Input Dumps

No response

🐛 Describe the bug

Environment

H100

Description

FuseMoE kernel should not depend on the size of input prompts while decoding because there is no dependency with the input length, but output tokens/sec during decoding significantly changes in Mixtral model when we change input prompt length. It degrades around 50% when input length increases to x2.
To verify this, we just commented out attention code that has dependency with the lengths of input/output tokens in Mixtral model and confirmed that the results of Mixtral decoding speed degrade when the input prompt length increases even if the attention code was commented out.

How to resolve

My guess was that there is some bug in the fused moe kernel, and took a look at the commit history of fused moe. There was a commit regarding improvement of fused moe performance (https://github.com/vllm-project/vllm/pull/9384), but I'm not sure this commit is the root cause of the bug. I just rolled back the version from latest to 0.6.3.post1 release.
In 0.6.3.post1 release version, the bug disappered. The decoding speed of Mixtral w/o attention does not depend on the input prompt length.

Bug found with @Byeong-Chan

Before submitting a new issue...

[X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

vllm-project / vllm