There could be a load balancer above the server to control the request traffic since the server can't reject requests. This pr can simulate this situation. Code borrowed from https://github.com/vllm-project/vllm/pull/9390
Modifications
add an option --max-concurrency to bench_serving.py
make sure there will not exceed max-concurrency requests coming to the server concurrently
Motivation
There could be a load balancer above the server to control the request traffic since the server can't reject requests. This pr can simulate this situation. Code borrowed from https://github.com/vllm-project/vllm/pull/9390
Modifications
--max-concurrency
tobench_serving.py
max-concurrency
requests coming to the server concurrentlyChecklist