torch.cuda.backedns.benchmark=True which runs an internal optimizer in the gpu to find the best inference method for the given input shape however as the input shape slightly changes from image to image it continuously optimizes for the unseen shapes which dominates the duration of the main model inference time.
and I have seen it in nvidia visual profiler as well
If you try it with torch.cuda.backedns.benchmark=False code runs faster or
if you would like to keep torch.cuda.backedns.benchmark=True input should be cropped to get a better feeling about how fast the model runs because for the same size the optimizer runs once and sets the input size for the inference internally, or sweep the dataset twice so every single image size is seen and use the second sweep for the timing results
Hi;
There is a glitch with the timing code it sets
torch.cuda.backedns.benchmark=True which runs an internal optimizer in the gpu to find the best inference method for the given input shape however as the input shape slightly changes from image to image it continuously optimizes for the unseen shapes which dominates the duration of the main model inference time.
This is also mentioned in https://discuss.pytorch.org/t/model-inference-very-slow-when-batch-size-changes-for-the-first-time/44911
and I have seen it in nvidia visual profiler as well
If you try it with torch.cuda.backedns.benchmark=False code runs faster or
if you would like to keep torch.cuda.backedns.benchmark=True input should be cropped to get a better feeling about how fast the model runs because for the same size the optimizer runs once and sets the input size for the inference internally, or sweep the dataset twice so every single image size is seen and use the second sweep for the timing results
Thanks