triton-inference-server / tensorrtllm_backend

The Triton TensorRT-LLM Backend
Apache License 2.0
588 stars 81 forks source link

Feature Request: Set maximum number of in flight #412

Open TheCodeWrangler opened 2 months ago

TheCodeWrangler commented 2 months ago

When unexpected large bursts in requests come to my application I would like to be able to limit the number of requests that will be accepted by trtllm backend. I would like to be able to REJECT future requests if the number of active requests for a specific backend exceeds a threshold

I have tried with

dynamic_batching {
  default_queue_policy {
    timeout_action: REJECT
    max_queue_size: 30  
  }
}

But would like to achieve this behavior so that i can better balance my load (and not have one instance with a large backlog)

TheCodeWrangler commented 2 days ago

Any plans to add this as a controllable feature?

Any other alternative suggestions on how I can keep the internal queue getting too large for a single instance