vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
30.46k stars 4.61k forks source link

[Bug]: FusedMoE kernel performance depends on input prompt length while decoding #10313

Open taegeonum opened 5 days ago

taegeonum commented 5 days ago

Your current environment

The output of `python collect_env.py` ```text PyTorch version: 2.4.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A OS: Ubuntu 22.04.5 LTS (x86_64) GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Clang version: Could not collect CMake version: version 3.30.4 Libc version: glibc-2.35 Python version: 3.10.12 (main, Sep 11 2024, 15:47:36) [GCC 11.4.0] (64-bit runtime) Python platform: Linux-5.15.0-78-generic-x86_64-with-glibc2.35 Is CUDA available: True CUDA runtime version: 12.6.77 Versions of relevant libraries: [pip3] numpy==1.26.4 [pip3] nvidia-cublas-cu12==12.1.3.1 [pip3] nvidia-cuda-cupti-cu12==12.1.105 [pip3] nvidia-cuda-nvrtc-cu12==12.1.105 [pip3] nvidia-cuda-runtime-cu12==12.1.105 [pip3] nvidia-cudnn-cu12==9.1.0.70 [pip3] nvidia-cufft-cu12==11.0.2.54 [pip3] nvidia-curand-cu12==10.3.2.106 [pip3] nvidia-cusolver-cu12==11.4.5.107 [pip3] nvidia-cusparse-cu12==12.1.0.106 [pip3] nvidia-ml-py==12.560.30 [pip3] nvidia-nccl-cu12==2.20.5 [pip3] nvidia-nvjitlink-cu12==12.4.127 [pip3] nvidia-nvtx-cu12==12.1.105 [pip3] pyzmq==26.2.0 [pip3] torch==2.4.0 [pip3] torchvision==0.19.0 [pip3] transformers==4.46.2 [pip3] triton==3.0.0 [conda] Could not collect ROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: 0.6.3.post2.dev338+gf0f2e563 vLLM Build Flags: CUDA Archs: 5.2 6.0 6.1 7.0 7.2 7.5 8.0 8.6 8.7 9.0+PTX; ROCm: Disabled; Neuron: Disabled CUBLAS_VERSION=12.6.3.3 CUDA_CACHE_DISABLE=1 TORCH_CUDA_ARCH_LIST=5.2 6.0 6.1 7.0 7.2 7.5 8.0 8.6 8.7 9.0+PTX NCCL_VERSION=2.22.3 CUDA_VERSION=12.6.2.004 PYTORCH_VERSION=2.5.0a0+e000cf0 PYTORCH_BUILD_NUMBER=0 CUDNN_FRONTEND_VERSION=1.7.0 CUDNN_VERSION=9.5.0.50 PYTORCH_HOME=/opt/pytorch/pytorch CUDA_DRIVER_VERSION=560.35.03 PYTORCH_BUILD_VERSION=2.5.0a0+e000cf0 CUDA_MODULE_LOADING=LAZY NVIDIA_PYTORCH_VERSION=24.10 TORCH_ALLOW_TF32_CUBLAS_OVERRIDE=1 ```

Model Input Dumps

No response

🐛 Describe the bug

Environment

Description

How to resolve

Bug found with @Byeong-Chan

Before submitting a new issue...

taegeonum commented 4 days ago

@charlifu Do you have any guesses?