triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.32k stars 1.48k forks source link

Batching support by stacking input arrays in python backend #3984

Open shreypandey opened 2 years ago

shreypandey commented 2 years ago

Is your feature request related to a problem? Please describe. Triton python backend should provide dynamic batching just like other backends supported by triton. For eg. For the model config mentioned below

max_batch_size: 8
input [
  {
    name: "INPUT0"
    data_type: TYPE_FP32
    dims: [ 81 ]
  }
]

Inputs for pytorch/tensorflow/ONNX backends will be of shape [k, 81] where k is batch size calculated by triton dynamic batching. Whereas Inputs for python backend is a python list object of length k, where each element is of type pb_utils.InferenceRequest object which contains array of shape [1, 81].

`execute` MUST be implemented in every Python model. `execute`
function receives a list of pb_utils.InferenceRequest as the only
argument. This function is called when an inference request is made
for this model. Depending on the batching configuration (e.g. Dynamic
Batching) used, `requests` may contain multiple requests. Every
Python model, must create one pb_utils.InferenceResponse for every
pb_utils.InferenceRequest in `requests`. If there is an error, you can
set the error argument when creating a pb_utils.InferenceResponse

Parameters
----------
requests : list
A list of pb_utils.InferenceRequest

Returns
-------
list
A list of pb_utils.InferenceResponse. The length of this list must
be the same as `requests`

Describe the solution you’d like Triton backend should provide inputs as an array with requests batched along the batch axis as in the case of other backends.

Describe alternatives you’ve considered A clear and concise description of any alternative solutions or features you’ve considered.

Additional context Add any other context or screenshots about the feature request here.

Tabrizian commented 2 years ago

We already have a ticket filed for this enhancement. https://github.com/triton-inference-server/server/issues/3286

jcuquemelle commented 1 year ago

A nice addition to this feature would be to also provide an option to pad the batched request, to avoid a recompilation of the model (when using jit.trace or pytorch.compile for example) in case it encounters an incomplete batch