Fully-working example with dynamic batching

triton-inference-server / openvino_backend

OpenVINO backend for Triton.

BSD 3-Clause "New" or "Revised" License

29 stars 15 forks source link

Fully-working example with dynamic batching #71

Open mbahri opened 6 months ago

mbahri commented 6 months ago

Thanks a lot for providing this backend. I have tried to use it and I have had some trouble getting Triton to load and run my OpenVINO models.

I found that the backend correctly attempts to load models if the files are just named model.bin and model.xml, in other cases the backend throws an exception. However, the main issue for me now is using the dynamic batching.

It would be very helpful if you could provide a fully working example of how to configure dynamic batching, with values for the different parameters that need to be set.

Related question: the backend doesn't support dynamic axes and one of the parameters mentioned for dynamic batching is about padding batches. Does this mean the backend will pad batches to the max batch size for now?

dtrawins commented 6 months ago

Once the PR https://github.com/triton-inference-server/openvino_backend/pull/72 is merged it will be possible to use the models with dynamic shape. Note that with the dynamic shape on the model input, you don't need to use dynamic batching. If you want to use arbitrary batch size or image resolution you will be able to do it with the model of shape like [-1,-1,-1,3]. If your goal is to improve throughput, you can use multiple instances with parallel execution (check the throughput mode example)

mbahri commented 6 months ago

Hi, thanks for your reply. I think I might be a bit confused here, but could you explain why I don't need to use dynamic batching?

The way I thought it worked was that with dynamic batching enabled, Triton waits a predefined amount of time to group requests together in a batch, which would mean batch size could be 3, then 1, then 5, etc.

When using dynamic batching with other backends like ONNX, I've needed to set the input dimension to, for example, [3, 224, 224] - and have the model accept [-1, 3, 224, 224]

Does it work differently with OpenVINO?

I've used parallel model execution in combination with dynamic batching with ONNX before and needed to tune the number of threads each model instance could use to avoid overloading the CPU. Is it done differently with OpenVINO?

dtrawins commented 6 months ago

@mbahri You could use the dynamic batching but it will not of top efficiency - it will still use the batch padding. You can expect better throughput results by using parallel execution with multi instance configuration is setting the parameter NUM_STREAMS. That way you will not observe cpu overloading. The parameter NUM_STREAMS will handle threads management in parallel execution. To sum up with the PR I mentioned you will be able to deploy models with shape [-1, 3, 224, 224] or [-1, 3, -1, -1]. If you want to improve the throughput for parallel execution from many clients, I recommend using several instances with several NUM_STREAMS (they should match). Removing the padding will be dropped probably later but still similar throughput gain is expected from parallel execution.

mbahri commented 6 months ago

thanks @dtrawins , so to confirm, with parallel model execution and setting NUM_STREAMS, I would just use a batch size of 1 for each model instance?