[Bug] Exception output when Cuda Graph is enabled for Qwen2.5-Coder

Checklist

[X] 1. I have searched related issues but cannot get the expected help.
[X] 2. The bug has not been fixed in the latest version.
[X] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
[X] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
[X] 5. Please use English, otherwise it will be closed.

Describe the bug

I'm using sglang to launch an OpenAI-compatible server with Qwen2.5-Coder-7B-Instruct, one of the best code models. I then use OpenAI library to send concurrent requests for data generation. However, I noticed some abnormal outputs in the generated results (dozens of exception outputs for thousands of requests, others seem to work well), similar to the following (I provide user content and assistant content is returned by sglang):

[
    {
        "role": "user",
        "content": "Develop a class in Python that accepts three arguments, \"username\", \"birthdate\", and \"email\"."
    },
    {
        "role": "assistant",
        "content": "Here can create to and"
    }
],
[
    {
        "role": "user",
        "content": "Create a Node.js module for validating email addresses."
    },
    {
        "role": "assistant",
        "content": "Here static1\n2\n\n"
    }
]

Initially, I thought the issue was with the tokenizer, but after careful examination, I found nothing unusual. I then tested LLama3.1 and Deepseek-Coder, and their outputs were correct, which ruled out any environmental issues.

Finally, using --disable-cuda-graph resolved the issue. In large batch scenarios, disabling Cuda graph has minimal impact, but given that it’s enabled by default, I hope it can be fixed.

Reproduction

The command causes exception output:

python -m sglang.launch_server --model-path Qwen/Qwen2.5-Coder-7B-Instruct --port 10086 --schedule-conservativeness 0.3 --mem-fraction-static 0.8

The command works well:

python -m sglang.launch_server --model-path Qwen/Qwen2.5-Coder-7B-Instruct --port 10086 --schedule-conservativeness 0.3 --mem-fraction-static 0.8 --disable-cuda-graph

I use LLMs to detect unusual output, so providing a quick repro script is a bit complicated, if this is confirmed as a bug, I will try to provide it.

Environment

Python: 3.11.9 (main, Apr 19 2024, 16:48:06) [GCC 11.2.0] CUDA available: True GPU 0,1,2,3,4,5,6,7: NVIDIA GeForce RTX 4090 GPU 0,1,2,3,4,5,6,7 Compute Capability: 8.9 CUDA_HOME: /usr/local/cuda NVCC: Cuda compilation tools, release 12.1, V12.1.105 CUDA Driver Version: 535.104.12 PyTorch: 2.4.0+cu121 sglang: 0.3.4.post1 flashinfer: 0.1.6+cu121torch2.4 triton: 3.0.0 transformers: 4.45.2 requests: 2.32.3 tqdm: 4.66.4 numpy: 1.26.4 aiohttp: 3.9.5 fastapi: 0.111.1 hf_transfer: 0.1.8 huggingface_hub: 0.24.2 interegular: 0.3.3 packaging: 24.1 PIL: 10.4.0 psutil: 6.0.0 pydantic: 2.9.2 uvicorn: 0.30.3 uvloop: 0.19.0 zmq: 26.0.3 vllm: 0.6.3.post1 multipart: 0.0.9 openai: 1.52.1 anthropic: 0.31.2 NVIDIA Topology: GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X PIX PXB PXB SYS SYS SYS SYS PXB 0-15,32-47 0 N/A GPU1 PIX X PXB PXB SYS SYS SYS SYS PXB 0-15,32-47 0 N/A GPU2 PXB PXB X PXB SYS SYS SYS SYS PXB 0-15,32-47 0 N/A GPU3 PXB PXB PXB X SYS SYS SYS SYS PIX 0-15,32-47 0 N/A GPU4 SYS SYS SYS SYS X PIX PXB PXB SYS 16-31,48-63 1 N/A GPU5 SYS SYS SYS SYS PIX X PXB PXB SYS 16-31,48-63 1 N/A GPU6 SYS SYS SYS SYS PXB PXB X PXB SYS 16-31,48-63 1 N/A GPU7 SYS SYS SYS SYS PXB PXB PXB X SYS 16-31,48-63 1 N/A NIC0 PXB PXB PXB PIX SYS SYS SYS SYS X

Legend:

X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks

NIC Legend:

NIC0: mlx5_0

ulimit soft: 1048576

sgl-project / sglang