triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.18k stars 1.46k forks source link

Splitting a batch to max_batch_size if the batch size is larger than max_batch_size #4547

Open omidb opened 2 years ago

omidb commented 2 years ago

Is your feature request related to a problem? Please describe. We are trying to support larger batches for Triton server (larger than max_batch_size), leveraging instance groups and splitting the batch, and distributing to available instances.

Describe the solution you'd like Triton provides configs for doing such a thing. it takes requests in different batch sizes and split them into batches with max_size and sends them to instances.

Describe alternatives you've considered We have two options to implement this with the current architecture: 1- business logic 2- building another server and taking care of splitting ourselves.

Additional context Add any other context or screenshots about the feature request here.

nv-kmcgill53 commented 2 years ago

Hi @omidb, I'd like to better understand your definition of batching, as Triton defines it as the concatenation of requests rather than the size of the first dimension of the input tensor. Given two requests Req_0 and Req_1 (you can think of this as 2 separate HTTP requests for ease) both with input tensors of size 1xN, these two requests can be batched into one submission to the backend model as a 2xN tensor. With a max_batch_size=2, this concatenated tensor will be submitted as the max_batch_size is hit. A third request will start a new batch and, depending on the scheduling, possibly go to a new instance group.

If you are interested in splitting an MxN tensor from a single request into smaller tensors, then you can do this on the client side or, as you describe in the alternatives, use business logic.

cc @tanmayv25

omidb commented 2 years ago

We are currently using business logic to split that. I am thinking that it could be a feature on the Triton server. imagine someone offloading a huge batch and Triton can take care of it instead of sending an error to the client.

jbkyang-nvi commented 1 year ago

Closing issue due to lack of activity. Please re-open the issue if you would like to follow up with this issue

pseudotensor commented 1 year ago

I agree, splitting can be as important as merging inputs. We have to add extra code to manage the splitting process within the model handler, e.g. python backend model code. This would have been better handled by triton to generalize the splitting across all model instances.

Doing this on the client would be a poor separation of concerns IMO. The client shouldn't have to worry about the max_batch_size and do the complex management of multiple concurrent async requests. That would have been much better to manage on triton. @nv-kmcgill53

dyastremsky commented 1 year ago

This may break the flow and intent of Triton's scheduler, but it's a feature request worth considering. Filed a ticket to investigate.

fangpings commented 1 year ago

I have the same problem. We have an ensemble model which has preprocessing, inference and postprocessing. I observed that in the preprocessing phase, sometimes it will generate request whose batch_size is larger than the max_batch_size defined for inference model, and it will be rejected by the inference model. In that case there is nothing I can do at the client side. Is there any workaround for this scenario?

dyastremsky commented 1 year ago

We have filed a ticket to look into the enhancement. However, can you tell me more about your pipeline? It doesn't sound like batching is being used correctly. How are the batch sizes varying between your models? Are you sure you're not trying to use dynamic batching where what you want is a variable dimension? As far as I understand, your models should have no concepts of batching. Your model should have a first dimension that accepts a batch, then the individual responses get scattered back after done. For an ensemble, they would get sent to the next model.

CC: @GuanLuo Do you know if the above is possible? Perhaps with decoupled models where a model can send multiple responses for every one request? I'd think that, even in that case, the responses get scattered by the model then gathered by the next model and are not sent as a batch (with 2+ requests sent pre-batched).

fangpings commented 1 year ago

For ensemble pipeline, the input is a web document. In the preprocessing model we tokenize the web document, but sometimes the number of tokens in the web document will exceed the max_sequence_length of our inference model, in that case we will need to split it into multiple batches, and send them to the inference model.

So the input will always be 1 1 since it's just a string of batch_size 1, and after the preprocessing, the input to our inference model becomes N max_sequence_length, here the N depends on the document length, and sometimes it exceeds the defined max_batch_size.

If Triton could support automatically splitting a single request to 2 batches, run them through the model, and assemble them back into one response, if N is greater than the max_batch_size, that will just solve my problem I think

dyastremsky commented 1 year ago

Understood, thanks for explaining! That sounds like a very valid use case. And one that I can't think of a workaround for that doesn't require processing the response on the client side and sending new requests (inefficient and inconvenient). It might be possible with Business Logic Scripting logic as a workaround.

I have attached your use case to the ticket to help with prioritization. We want to ensure we are supporting user needs. Thank you for sharing.

fangpings commented 1 year ago

Thank you very much for the help!

flexwang2 commented 1 year ago

We would like to see this chunkation on trition side feature will go live

ShuaiShao93 commented 5 months ago

This would still be useful for us. Are there any updates?

dyastremsky commented 5 months ago

Unfortunately, not yet. Other features have been prioritized. I updated the associated ticket to help with priority. Triton is also open-source, so any contributions are more than welcome.

Ref: DLIS-5111.

njaramish commented 4 months ago

Adding a +1 to the "would love this feature" bucket. Thanks for your hard work on this awesome project!