Closed siddhatiwari closed 1 month ago
It seems to be a critical bug. Although we have not seen this before, it would be very helpful if you could share some reproducible examples.
In the meantime, you can help us to find the source of this bug by trying the following options (one at each time)
--disable-cuda-graph
--disable-flashinfer
--disable-flashinfer-sampling
--chunked-prefill-size -1
Hi, I run into the same problem when the I made too many requests at the same time. The output just became 梦梦梦梦梦梦梦梦梦梦梦梦梦梦梦... May I ask if the bug will be fixed any time soon?
Thanks for your excellent library!
v0.2.15 fix some fp8 weights loading bugs. May you have a try?
Actually I'm already using v0.2.15. I loaded an AWQ quatization of Mistral Large 2407 on 4 A40s. I had normal outputs at 50 requests at the same time using the asynchronous client from openai library, but got invalid tokens at 100. I haven't tested in more details, though.
Are you using regex or other constrained generation in your pipeline? If so, you were probably affected by the bug that was fixed here: https://github.com/sgl-project/sglang/commit/47f20da223c62473577231cec49dedb86c56220f
Are you using regex or other constrained generation in your pipeline? If so, you were probably affected by the bug that was fixed here: 47f20da
No, I'm not using any constrained generation. Just simple text input and output. Currently I work around this problem by submitting only 50 request at a time, since sglang is still much faster than other libraries I tried.
Actually I'm already using v0.2.15. I loaded an AWQ quatization of Mistral Large 2407 on 4 A40s. I had normal outputs at 50 requests at the same time using the asynchronous client from openai library, but got invalid tokens at 100. I haven't tested in more details, though.
I have observed this as well on A100s at TP4 with AWQ, GPTQ and w8a8
This is perhaps related: https://github.com/vllm-project/vllm/issues/7228
It seems to be a critical bug. Although we have not seen this before, it would be very helpful if you could share some reproducible examples.
In the meantime, you can help us to find the source of this bug by trying the following options (one at each time)
--disable-cuda-graph
--disable-flashinfer
--disable-flashinfer-sampling
--chunked-prefill-size -1
It happens even those options disabled (tested on 2xH100 with Llama2 70b fp8)
Has this issue been resolved? I sometimes encounter it too(deepseek-v2.5-fp8). I didn't encounter this issue in (commit 2abe4f1cb6e9b4d36c332b0fb04c0dec2ad38bc6), but it appeared in the latest commit(8f527e29409f714f9de839ece1e7aace15d9b27a). @zhyncs @merrymercy This looks like a rather serious bug.
I'm still waiting on this as well. Please let me know if I can be of any help in the meantime, I can test any models / configurations, let me know.
vllm-project/vllm#7228
@qeternity This issue seems unrelated to the model; I encountered the same problem using another model (deepseek-v2) as well.
After updating to this PR, I no longer have this issue. https://github.com/sgl-project/sglang/pull/1482 @user-0a
Checklist
Describe the bug
I ran a RPS benchmark script with prompts of an average input length of 1600 tokens and got bad outputs as the RPS increased. For example:
*给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给追给给给给给给给给迫你。
It seems to be related to quantization and concurrent requests. I've listed some commands below with various models, quants, and max num reqs, and if they had good or bad outputs at a high RPS and max running-req.
Unfortunately I can't share the exact prompts used, but I'll update as I find other reproducible prompts.
Here's a summary:
--quantization fp8
flag have bad outputs at high RPS.--quantization fp8
and--max-num-reqs 10
flags have good outputs at high RPS.--quantization
or--max-num-reqs
flag had good outputs at high RPS.Reproduction
Environment