`MAX_CONCURRENCY` parameter doesn't work

Current behaviour: When sending multiple requests with a short interval (e.g. 1 second) to the endpoint with 1 worker enabled, all the requests skip the queue and are being passed to the worker. (The Queued amount is always 0.) This results in a very long execution time.

Screenshot: Screenshot 2024-01-13 at 15 47 56

Steps to reproduce:

set MAX_CONCURRENCY to 1
send multiple requests with a short interval (e.g. 1 second)

Expected behaviour: Only 1 request should be processed at a time, all the subsequent requests should wait in the queue.

This is especially important when using awq models since only a small number of concurrent requests can be efficiently processed in this case.

runpod-workers / worker-vllm

`MAX_CONCURRENCY` parameter doesn't work #36