triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.03k stars 1.44k forks source link

Dynamic batching that supports static batch size with padding #7124

Open ShuaiShao93 opened 4 months ago

ShuaiShao93 commented 4 months ago

Is your feature request related to a problem? Please describe. Since TensorRT has limited support for dynamic shape, the dynamic batch size required by dynamic batcher is not very ideal.

Describe the solution you'd like Support padding batch size to the static batch size when there is not sufficient amount of data.

SunnyGhj commented 4 months ago

Great minds think alike, I'm trying to manually implement padding size from the request side

ShuaiShao93 commented 4 months ago

Great minds think alike, I'm trying to manually implement padding size from the request side

Does this mean you disabled dynamic batching on triton? This is not ideal, because one of the most important reasons for us to use Triton is dynamic batching

SunnyGhj commented 4 months ago

when there is not sufficient amount of data.

Similarly, we have manually implemented batch requests on the client and fixed the batch size to static batch size. We are trying to padding the data that is not sufficient amount.

ShuaiShao93 commented 4 months ago

when there is not sufficient amount of data.

Similarly, we have manually implemented batch requests on the client and fixed the batch size to static batch size. We are trying to padding the data that is not sufficient amount.

Ok, it sounds like you re-implemented the dynamic batcher at your own client, which is probably not the best investment of time. I hope Triton can support this natively. But thanks for sharing this!

Tabrizian commented 4 months ago

I think this enhancement makes sense. @GuanLuo / @nnshah1 any additional thoughts?

nnshah1 commented 4 months ago

@ShuaiShao93 If I understand correctly - the idea here is to have a static batch defined in the engine but then have the dynamic batcher pad if it sends in batches with smaller size?

Is that something to handle in the server or in the backend? It might be more efficient to pad right before sending it to the engine.

ShuaiShao93 commented 4 months ago

@nnshah1 how is this possible?

Let's say a model has static batch size = 8. There are two clients, client A has a request of batch size 4, client B has a request of batch size 3.

Ideally, if A and B call triton server at the same time, dynamic batcher makes a batch of size 7, then pads it to 8.

But if we pad at client, which means A pads 4 to 8 and B pads 3 to 8, we need to run inference twice, which doubles the cost

nnshah1 commented 4 months ago

@nnshah1 how is this possible?

Let's say a model has static batch size = 8. There are two clients, client A has a request of batch size 4, client B has a request of batch size 3.

Ideally, if A and B call triton server at the same time, dynamic batcher makes a batch of size 7, then pads it to 8.

But if we pad at client, which means A pads 4 to 8 and B pads 3 to 8, we need to run inference twice, which doubles the cost

No I get your point - I mean to pad in the TRT backend vs the core server piece - not to pad at the client.

nnshah1 commented 4 months ago

As a kind of example for our stable diffusion tutorial - I ended up padding / splitting on the model side and allowing the dynamic batcher to provide batches independent of that. (this is just an example and would need to be implement in the TRT engine or triton core)

https://github.com/triton-inference-server/tutorials/blob/cb2ca257000cd14d59642a7aa86b56d054535d73/Popular_Models_Guide/StableDiffusion/backend/diffusion/model.py#L178

ShuaiShao93 commented 4 months ago

@nnshah1 Ah gotcha. Thanks! Either should work, but sounds better to make this a general feature, and make it a flag in config, in case other backends also want static batch size.