sgl-project / sglang

SGLang is a fast serving framework for large language models and vision language models.
https://sgl-project.github.io/
Apache License 2.0
5.92k stars 482 forks source link

[Bug] Bad outputs with fp8 quantization at high RPS #1195

Closed siddhatiwari closed 1 month ago

siddhatiwari commented 2 months ago

Checklist

Describe the bug

I ran a RPS benchmark script with prompts of an average input length of 1600 tokens and got bad outputs as the RPS increased. For example:

*给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给追给给给给给给给给迫你。

It seems to be related to quantization and concurrent requests. I've listed some commands below with various models, quants, and max num reqs, and if they had good or bad outputs at a high RPS and max running-req.

Unfortunately I can't share the exact prompts used, but I'll update as I find other reproducible prompts.

Here's a summary:

Reproduction

BAD OUTPUTS @ 5.5rps
#running-req: 137

CUDA_VISIBLE_DEVICES=2,3  python -m sglang.launch_server --model-path NousResearch/Meta-Llama-3.1-70B-Instruct --port 30003 --tp 2 --mem-fraction-static 0.90 --host 0.0.0.0 --context-length 2048 --quantization fp8

-----------------------------------

GOOD OUTPUTS @ 5.5 RPS
#running-req: 8

CUDA_VISIBLE_DEVICES=2,3  python -m sglang.launch_server --model-path NousResearch/Meta-Llama-3.1-70B-Instruct --port 30003 --tp 2 --mem-fraction-static 0.90 --host 0.0.0.0 --context-length 2048 --quantization fp8 --max-num-reqs 10

-----------------------------------

BAD OUTPUTS @ 5.5rps
#running-req: 135

CUDA_VISIBLE_DEVICES=2,3  python -m sglang.launch_server --model-path NousResearch/Hermes-3-Llama-3.1-70B --port 30003 --tp 2 --mem-fraction-static 0.90 --host 0.0.0.0 --context-length 2048 --quantization fp8

-----------------------------------

GOOD OUTPUTS @ 5.5 RPS
#running-req: 8

CUDA_VISIBLE_DEVICES=2,3  python -m sglang.launch_server --model-path NousResearch/Hermes-3-Llama-3.1-70B --port 30003 --tp 2 --mem-fraction-static 0.90 --host 0.0.0.0 --context-length 2048 --quantization fp8 --max-num-reqs 10

-----------------------------------

GOOD OUTPUTS @ 5.5 RPS
#running-req: 136

CUDA_VISIBLE_DEVICES=2,3  python -m sglang.launch_server --model-path NousResearch/Hermes-3-Llama-3.1-70B-FP8 --port 30003 --tp 2 --mem-fraction-static 0.90 --host 0.0.0.0 --context-length 2048

Environment


CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA H100 80GB HBM3
GPU 0,1,2,3,4,5,6,7 Compute Capability: 9.0
CUDA_HOME: /usr
NVCC: Cuda compilation tools, release 12.2, V12.2.140
CUDA Driver Version: 535.129.03
PyTorch: 2.4.0+cu121
sglang: 0.2.13
flashinfer: 0.1.5+cu121torch2.4
triton: 3.0.0
transformers: 4.44.2
requests: 2.32.3
tqdm: 4.66.5
numpy: 1.26.4
aiohttp: 3.10.5
fastapi: 0.112.1
hf_transfer: 0.1.8
huggingface_hub: 0.24.6
interegular: 0.3.3
packaging: 24.1
PIL: 10.4.0
psutil: 6.0.0
pydantic: 2.8.2
uvicorn: 0.30.6
uvloop: 0.20.0
zmq: 26.2.0
vllm: 0.5.4
multipart: 0.0.9
openai: 1.42.0
anthropic: 0.34.1
NVIDIA Topology: 
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    NIC0    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV18    NV18    NV18    NV18    NV18    NV18    NV18    SYS     0-103   0               N/A
GPU1    NV18     X      NV18    NV18    NV18    NV18    NV18    NV18    SYS     0-103   0               N/A
GPU2    NV18    NV18     X      NV18    NV18    NV18    NV18    NV18    SYS     0-103   0               N/A
GPU3    NV18    NV18    NV18     X      NV18    NV18    NV18    NV18    SYS     0-103   0               N/A
GPU4    NV18    NV18    NV18    NV18     X      NV18    NV18    NV18    SYS     104-207 1               N/A
GPU5    NV18    NV18    NV18    NV18    NV18     X      NV18    NV18    SYS     104-207 1               N/A
GPU6    NV18    NV18    NV18    NV18    NV18    NV18     X      NV18    SYS     104-207 1               N/A
GPU7    NV18    NV18    NV18    NV18    NV18    NV18    NV18     X      SYS     104-207 1               N/A
NIC0    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X 

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0

ulimit soft: 4096```
merrymercy commented 2 months ago

It seems to be a critical bug. Although we have not seen this before, it would be very helpful if you could share some reproducible examples.

In the meantime, you can help us to find the source of this bug by trying the following options (one at each time)

sunflower-leaf commented 2 months ago

Hi, I run into the same problem when the I made too many requests at the same time. The output just became 梦梦梦梦梦梦梦梦梦梦梦梦梦梦梦... May I ask if the bug will be fixed any time soon?

Thanks for your excellent library!

zhyncs commented 2 months ago

v0.2.15 fix some fp8 weights loading bugs. May you have a try?

sunflower-leaf commented 2 months ago

Actually I'm already using v0.2.15. I loaded an AWQ quatization of Mistral Large 2407 on 4 A40s. I had normal outputs at 50 requests at the same time using the asynchronous client from openai library, but got invalid tokens at 100. I haven't tested in more details, though.

qeternity commented 2 months ago

Are you using regex or other constrained generation in your pipeline? If so, you were probably affected by the bug that was fixed here: https://github.com/sgl-project/sglang/commit/47f20da223c62473577231cec49dedb86c56220f

sunflower-leaf commented 2 months ago

Are you using regex or other constrained generation in your pipeline? If so, you were probably affected by the bug that was fixed here: 47f20da

No, I'm not using any constrained generation. Just simple text input and output. Currently I work around this problem by submitting only 50 request at a time, since sglang is still much faster than other libraries I tried.

qeternity commented 1 month ago

Actually I'm already using v0.2.15. I loaded an AWQ quatization of Mistral Large 2407 on 4 A40s. I had normal outputs at 50 requests at the same time using the asynchronous client from openai library, but got invalid tokens at 100. I haven't tested in more details, though.

I have observed this as well on A100s at TP4 with AWQ, GPTQ and w8a8

qeternity commented 1 month ago

This is perhaps related: https://github.com/vllm-project/vllm/issues/7228

user-0a commented 1 month ago

It seems to be a critical bug. Although we have not seen this before, it would be very helpful if you could share some reproducible examples.

In the meantime, you can help us to find the source of this bug by trying the following options (one at each time)

  • --disable-cuda-graph
  • --disable-flashinfer
  • --disable-flashinfer-sampling
  • --chunked-prefill-size -1

It happens even those options disabled (tested on 2xH100 with Llama2 70b fp8)

fengyang95 commented 1 month ago

Has this issue been resolved? I sometimes encounter it too(deepseek-v2.5-fp8). I didn't encounter this issue in (commit 2abe4f1cb6e9b4d36c332b0fb04c0dec2ad38bc6), but it appeared in the latest commit(8f527e29409f714f9de839ece1e7aace15d9b27a). @zhyncs @merrymercy This looks like a rather serious bug.

user-0a commented 1 month ago

I'm still waiting on this as well. Please let me know if I can be of any help in the meantime, I can test any models / configurations, let me know.

fengyang95 commented 1 month ago

vllm-project/vllm#7228

@qeternity This issue seems unrelated to the model; I encountered the same problem using another model (deepseek-v2) as well.

fengyang95 commented 1 month ago

After updating to this PR, I no longer have this issue. https://github.com/sgl-project/sglang/pull/1482 @user-0a