triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.4k stars 1.49k forks source link

Ability to make preferred_batch_size mandatory #7604

Open riZZZhik opened 2 months ago

riZZZhik commented 2 months ago

Is your feature request related to a problem? Please describe. I am using a model with dynamic batching, which necessitates warming up for each supported batch size. This process consumes both time and VRAM. My objective is to significantly enhance RPS without overloading these resources. To accomplish this, I plan to warm up the model for batch sizes with varying increments (e.g., 1, 2, 3, 4, 8, 16, 32). However, this approach cannot be implemented due to the dynamic batching logic described in model_configuration.md.

Example Problem: Consider a scenario with 12 requests in the queue. Currently, these requests would be combined into a single 12 batch. However, the objective is to split them into two batches with sizes of 4 and 8.

Describe the solution you'd like Add a flag to make preferred_batch_size mandatory or create a new variable to strictly control dynamic batch sizes.

Describe alternatives you've considered I haven't come up with any alternatives.