Open Amber-Believe opened 1 month ago
In the v0.6.0 model, the total number of requests is 1000, -- the request-rate is 100, 1) only half of the requests are successfully responded to, why do so many fail? At the same time, check the output length, found that the output length is only 502 non-empty replies, 2) That is to say, the successful response is also None, what is the reason.
cc @KuntaiDu
@Amber-Believe make sure you follow https://github.com/vllm-project/vllm/issues/8176
@Amber-Believe确保你关注了#8176
Hello, so the meaning here is that the number of successful responses is only half due to the problem of the multi-step scheduler (--num-scheduler-steps 10)?
IIRC vLLM's request buffer size is not large enough to hold 1000 requests. Try use 500 requests instead.
And also would be great if you could use latest vLLM version, there are several stabiliy-related bug fixes there.
Okay, I'll try the new version of vllm
Proposal to improve performance
No response
Report of performance regression
No response
Misc discussion on performance
No response
Your current environment (if you think it is necessary)
Before submitting a new issue...