runpod-workers / worker-vllm

The RunPod worker template for serving our large language model endpoints. Powered by vLLM.
MIT License
213 stars 82 forks source link

`MAX_CONCURRENCY` parameter doesn't work #36

Closed antonioglass closed 7 months ago

antonioglass commented 7 months ago

Current behaviour: When sending multiple requests with a short interval (e.g. 1 second) to the endpoint with 1 worker enabled, all the requests skip the queue and are being passed to the worker. (The Queued amount is always 0.) This results in a very long execution time.

Screenshot: Screenshot 2024-01-13 at 15 47 56

Steps to reproduce:

  1. set MAX_CONCURRENCY to 1
  2. send multiple requests with a short interval (e.g. 1 second)

Expected behaviour: Only 1 request should be processed at a time, all the subsequent requests should wait in the queue.

This is especially important when using awq models since only a small number of concurrent requests can be efficiently processed in this case.

alpayariyak commented 7 months ago

Thank you for your feedback, this has now been fixed in the latest update!