Open syGOAT opened 6 months ago
The performance of guided_json is still a work in progress. Try setting guided_decoding_backend
to lm-format-enforcer
see whether there's a difference?
@simon-mo Unfortunately, its performance was no better after setting guided_decoding_backend
to lm-format-enforcer
.
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!
Your current environment
How would you like to use vllm
I started the model:
My request body is:
I conducted a stress test with 20 concurrent users, a runtime of 4 minutes, and a ramp-up time of 1 minute. The performance of vllm was as follows. Total requests 111, API requests per second 0.45, minimum response time 5321ms, average response time 34024ms, maximum response time 53862ms, 90% response time 46041ms, failure rate 3.6%. It seems that vllm doesn't perform well under high concurrency?