Open ShuaiShao93 opened 4 months ago
Great minds think alike, I'm trying to manually implement padding size from the request side
Great minds think alike, I'm trying to manually implement padding size from the request side
Does this mean you disabled dynamic batching on triton? This is not ideal, because one of the most important reasons for us to use Triton is dynamic batching
when there is not sufficient amount of data.
Similarly, we have manually implemented batch requests on the client and fixed the batch size to static batch size. We are trying to padding the data that is not sufficient amount.
when there is not sufficient amount of data.
Similarly, we have manually implemented batch requests on the client and fixed the batch size to static batch size. We are trying to padding the data that is not sufficient amount.
Ok, it sounds like you re-implemented the dynamic batcher at your own client, which is probably not the best investment of time. I hope Triton can support this natively. But thanks for sharing this!
I think this enhancement makes sense. @GuanLuo / @nnshah1 any additional thoughts?
@ShuaiShao93 If I understand correctly - the idea here is to have a static batch defined in the engine but then have the dynamic batcher pad if it sends in batches with smaller size?
Is that something to handle in the server or in the backend? It might be more efficient to pad right before sending it to the engine.
@nnshah1 how is this possible?
Let's say a model has static batch size = 8. There are two clients, client A has a request of batch size 4, client B has a request of batch size 3.
Ideally, if A and B call triton server at the same time, dynamic batcher makes a batch of size 7, then pads it to 8.
But if we pad at client, which means A pads 4 to 8 and B pads 3 to 8, we need to run inference twice, which doubles the cost
@nnshah1 how is this possible?
Let's say a model has static batch size = 8. There are two clients, client A has a request of batch size 4, client B has a request of batch size 3.
Ideally, if A and B call triton server at the same time, dynamic batcher makes a batch of size 7, then pads it to 8.
But if we pad at client, which means A pads 4 to 8 and B pads 3 to 8, we need to run inference twice, which doubles the cost
No I get your point - I mean to pad in the TRT backend vs the core server piece - not to pad at the client.
As a kind of example for our stable diffusion tutorial - I ended up padding / splitting on the model side and allowing the dynamic batcher to provide batches independent of that. (this is just an example and would need to be implement in the TRT engine or triton core)
@nnshah1 Ah gotcha. Thanks! Either should work, but sounds better to make this a general feature, and make it a flag in config, in case other backends also want static batch size.
Is your feature request related to a problem? Please describe. Since TensorRT has limited support for dynamic shape, the dynamic batch size required by dynamic batcher is not very ideal.
Describe the solution you'd like Support padding batch size to the static batch size when there is not sufficient amount of data.