triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.31k stars 1.48k forks source link

Batched Prediction for Python backend #3286

Open sumitbinnani opened 3 years ago

sumitbinnani commented 3 years ago

Is your feature request related to a problem? Please describe. I have a python function that can process multiple requests in parallel (i.e. support batched prediction). However, I have to perform inference per request in a for loop leading to massive slowdown.

Describe the solution you'd like Parse inference requests in execute of python backend as a batched numpy array. e.g. if config has input of dim 3 and dtype as int, and i receive 8 requests to execute, need some way to parse this request as 8x3 array. Similarly, need way to output 8x1 array as a single output by the backend.

Describe alternatives you've considered Couldn't think of anything else.

CPFelix commented 3 years ago

You can set the python model first input dim as -1,such [-1, 224, 224, 3] and it can use batch prediction.

Tabrizian commented 3 years ago

@sumitbinnani I have filed a ticket for this enhancement.

sumitbinnani commented 3 years ago

@CPFelix The current batch prediction snippet for python backend with dynamic batching will look like this:

responses = []
for request in requests:
   ...
   response = ...
   responses.append(response)
return responses

This does not use parallelization or vector operations.

With the approach suggested by you, it will not be possible to have dynamic batching.

TsykunovDmitriy commented 2 years ago

@Tabrizian Is there any progress on this issue?

I also faced with this problem. I have a python backend for a pytorch model. If I use dynamic batching, then several requests come to the execute function. Why aren't they stacked in one batch?

Tabrizian commented 2 years ago

@TsykunovDmitriy There is not any update on this. But we have ticket for this feature request.