isarsoft / yolov4-triton-tensorrt

This repository deploys YOLOv4 as an optimized TensorRT engine to Triton Inference Server
http://www.isarsoft.com
Other
276 stars 63 forks source link

Dynamic batcing inference time #58

Closed MoaazAbdulrahman closed 2 years ago

MoaazAbdulrahman commented 2 years ago

Thank you for your effort building this repo. I am facing the issue related to inference time when I run the model with batch size larger than 1. When I set batch size to 4 and pass 4 images to the model, it takes about 200 ms. However when I set the batch size to 4 and pass only 1 image to the model, it takes about 195 ms.

I do care about the inference time and I want to dynamically use the batching in run time by passing different batch sizes while keeping the inference time to the minimun.

Is it possible?

philipp-schmidt commented 2 years ago

We all care about the inference time, that's why we made this repo ;)

You can have a look at dynamic batching in the triton documentation, it's easy to set up.

You can tell triton which batch sizes to prefer dynamically and how long to wait for multiple requests to be combined into a batch.