open-mmlab / mmdeploy

OpenMMLab Model Deployment Framework
https://mmdeploy.readthedocs.io/en/latest/
Apache License 2.0
2.77k stars 636 forks source link

Tensorrt inference is slower than .pth. #809

Open jiaqizhang123-stack opened 2 years ago

jiaqizhang123-stack commented 2 years ago

python tools/profile.py configs/mmdet/instance-seg/instance-seg_tensorrt-fp16_dynamic-320x320-1344x1344.py ../mmdetection/configs/mask_rcnn/mask_rcnn_r50_fpn_1x_coco.py tu --model /home/zhang/checkpoints/epoch_36.pth --device cuda --shape 224x224 --num-iter 100
2022-07-25 18:43:56,272 - test - INFO - [forward]-30 times per count: 14.79 ms, 67.60 FPS 2022-07-25 18:43:56,612 - test - INFO - [forward]-50 times per count: 14.80 ms, 67.57 FPS 2022-07-25 18:43:56,949 - test - INFO - [forward]-70 times per count: 14.74 ms, 67.86 FPS 2022-07-25 18:43:57,289 - test - INFO - [forward]-90 times per count: 14.73 ms, 67.89 FPS 2022-07-25 18:43:57,695 - test - INFO - [forward]-110 times per count: 15.06 ms, 66.42 FPS

python tools/profile.py configs/mmdet/instance-seg/instance-seg_tensorrt-fp16_dynamic-320x320-1344x1344.py ../mmdetection/configs/mask_rcnn/mask_rcnn_r50_fpn_1x_coco.py tu --model /home/zhang/checkpoints/epoch_36.pth --device cuda --shape 224x224 --num-iter 100
2022-07-25 18:42:01,970 - mmdeploy - INFO - Found totally 5 image files in tu 2022-07-25 18:42:05,314 - test - INFO - [trt_execute]-30 times per count: 51.84 ms, 19.29 FPS 2022-07-25 18:42:06,604 - test - INFO - [__trt_execute]-50 times per count: 53.18 ms, 18.81 FPS 2022-07-25 18:42:07,890 - test - INFO - [trt_execute]-70 times per count: 53.71 ms, 18.62 FPS 2022-07-25 18:42:09,133 - test - INFO - [__trt_execute]-90 times per count: 53.43 ms, 18.72 FPS 2022-07-25 18:42:10,345 - test - INFO - [__trt_execute]-110 times per count: 53.27 ms, 18.77 FPS

It can be seen that tensorrt is very slow, can you see why?

RunningLeon commented 2 years ago

@jiaqizhang123-stack Hi, your trt model input supposed to be 320x320-1344x1344, but your testing shape is 224x224, isn't that a bit strange? And the testing should fail in this case. BTW, could you post your env info by running python tools/check_env.py?

jiaqizhang123-stack commented 2 years ago

(mmdeploy) zhang@zhang-QiTianM540-A739:~/mmdeploy$ python tools/check_env.py 2022-07-26 10:44:28,349 - mmdeploy - INFO -

2022-07-26 10:44:28,349 - mmdeploy - INFO - **Environmental information** fatal: ambiguous argument 'HEAD': unknown revision or path not in the working tree. Use '--' to separate paths from revisions, like this: 'git [...] -- [...]' 2022-07-26 10:44:29,203 - mmdeploy - INFO - sys.platform: linux 2022-07-26 10:44:29,203 - mmdeploy - INFO - Python: 3.9.12 (main, Jun 1 2022, 11:38:51) [GCC 7.5.0] 2022-07-26 10:44:29,203 - mmdeploy - INFO - CUDA available: True 2022-07-26 10:44:29,203 - mmdeploy - INFO - GPU 0: NVIDIA GeForce GTX 1050 Ti 2022-07-26 10:44:29,203 - mmdeploy - INFO - CUDA_HOME: /usr/local/cuda-10.2 2022-07-26 10:44:29,203 - mmdeploy - INFO - NVCC: Cuda compilation tools, release 10.2, V10.2.89 2022-07-26 10:44:29,203 - mmdeploy - INFO - GCC: gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0 2022-07-26 10:44:29,203 - mmdeploy - INFO - PyTorch: 1.8.0 2022-07-26 10:44:29,204 - mmdeploy - INFO - PyTorch compiling details: PyTorch built with:

2022-07-26 10:44:29,204 - mmdeploy - INFO - TorchVision: 0.9.0 2022-07-26 10:44:29,204 - mmdeploy - INFO - OpenCV: 4.5.3 2022-07-26 10:44:29,204 - mmdeploy - INFO - MMCV: 1.4.0 2022-07-26 10:44:29,204 - mmdeploy - INFO - MMCV Compiler: GCC 7.3 2022-07-26 10:44:29,204 - mmdeploy - INFO - MMCV CUDA Compiler: 10.2 2022-07-26 10:44:29,204 - mmdeploy - INFO - MMDeploy: 0.5.0+HEAD 2022-07-26 10:44:29,204 - mmdeploy - INFO -

2022-07-26 10:44:29,204 - mmdeploy - INFO - **Backend information** 2022-07-26 10:44:29,537 - mmdeploy - INFO - onnxruntime: 1.10.0 ops_is_avaliable : True 2022-07-26 10:44:29,551 - mmdeploy - INFO - tensorrt: 8.2.3.0 ops_is_avaliable : True 2022-07-26 10:44:29,593 - mmdeploy - INFO - ncnn: None ops_is_avaliable : False 2022-07-26 10:44:29,602 - mmdeploy - INFO - pplnn_is_avaliable: False 2022-07-26 10:44:29,611 - mmdeploy - INFO - openvino_is_avaliable: False 2022-07-26 10:44:29,611 - mmdeploy - INFO -

2022-07-26 10:44:29,611 - mmdeploy - INFO - **Codebase information** 2022-07-26 10:44:29,612 - mmdeploy - INFO - mmdet: 2.25.0 2022-07-26 10:44:29,612 - mmdeploy - INFO - mmseg: None 2022-07-26 10:44:29,612 - mmdeploy - INFO - mmcls: None 2022-07-26 10:44:29,612 - mmdeploy - INFO - mmocr: None 2022-07-26 10:44:29,612 - mmdeploy - INFO - mmedit: None 2022-07-26 10:44:29,612 - mmdeploy - INFO - mmdet3d: None 2022-07-26 10:44:29,612 - mmdeploy - INFO - mmpose: None 2022-07-26 10:44:29,612 - mmdeploy - INFO - mmrotate: None

Because the "test_pipeline" is this, so it shouldn't be affected. When the shape is fixed, the time is consistent. test_pipeline = [ dict(type='LoadImageFromFile'), dict( type='MultiScaleFlipAug', img_scale=(512, 192), flip=False, transforms=[ dict(type='Resize', keep_ratio=False), dict(type='RandomFlip'), dict(type='Normalize', **img_norm_cfg), dict(type='Pad', size_divisor=32), dict(type='DefaultFormatBundle'), dict(type='Collect', keys=['img']), ]) ]

RunningLeon commented 2 years ago

@jiaqizhang123-stack Hi, the acceleration performance is different across N cards. Tests results on my side seemed OK.

env

2022-07-26 11:52:41,917 - mmdeploy - INFO - **********Environmental information**********
2022-07-26 11:52:42,901 - mmdeploy - INFO - sys.platform: linux
2022-07-26 11:52:42,901 - mmdeploy - INFO - Python: 3.7.5 (default, Oct 25 2019, 15:51:11) [GCC 7.3.0]
2022-07-26 11:52:42,901 - mmdeploy - INFO - CUDA available: True
2022-07-26 11:52:42,901 - mmdeploy - INFO - GPU 0: NVIDIA GeForce RTX 2080
2022-07-26 11:52:42,901 - mmdeploy - INFO - CUDA_HOME: /usr/local/cuda
2022-07-26 11:52:42,901 - mmdeploy - INFO - NVCC: Build cuda_11.1.TC455_06.29069683_0
2022-07-26 11:52:42,901 - mmdeploy - INFO - GCC: gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
2022-07-26 11:52:42,901 - mmdeploy - INFO - PyTorch: 1.8.0
2022-07-26 11:52:42,901 - mmdeploy - INFO - PyTorch compiling details: PyTorch built with
2022-07-26 11:52:42,901 - mmdeploy - INFO - TorchVision: 0.9.0
2022-07-26 11:52:42,901 - mmdeploy - INFO - OpenCV: 4.5.2
2022-07-26 11:52:42,901 - mmdeploy - INFO - MMCV: 1.4.8
2022-07-26 11:52:42,901 - mmdeploy - INFO - MMCV Compiler: GCC 7.3
2022-07-26 11:52:42,901 - mmdeploy - INFO - MMCV CUDA Compiler: 11.1
2022-07-26 11:52:42,901 - mmdeploy - INFO - MMDeploy: 0.6.0+72776dd
2022-07-26 11:52:42,902 - mmdeploy - INFO - 

2022-07-26 11:52:42,902 - mmdeploy - INFO - **********Backend information**********
2022-07-26 11:52:43,396 - mmdeploy - INFO - onnxruntime: 1.8.0  ops_is_avaliable : True
2022-07-26 11:52:43,415 - mmdeploy - INFO - tensorrt: 8.2.1.8   ops_is_avaliable : True
2022-07-26 11:52:43,432 - mmdeploy - INFO - ncnn: 1.0.20220722  ops_is_avaliable : True
2022-07-26 11:52:43,480 - mmdeploy - INFO - pplnn_is_avaliable: True
2022-07-26 11:52:43,493 - mmdeploy - INFO - openvino_is_avaliable: True
2022-07-26 11:52:43,494 - mmdeploy - INFO - 

2022-07-26 11:52:43,494 - mmdeploy - INFO - **********Codebase information**********
2022-07-26 11:52:45,623 - mmdeploy - INFO - mmdet:  2.25.0
2022-07-26 11:52:45,623 - mmdeploy - INFO - mmseg:  0.26.0
2022-07-26 11:52:45,623 - mmdeploy - INFO - mmcls:  0.23.0
2022-07-26 11:52:45,623 - mmdeploy - INFO - mmocr:  None
2022-07-26 11:52:45,623 - mmdeploy - INFO - mmedit: 0.12.0
2022-07-26 11:52:45,623 - mmdeploy - INFO - mmdet3d:    1.0.0rc3
2022-07-26 11:52:45,623 - mmdeploy - INFO - mmpose: 0.26.0
2022-07-26 11:52:45,623 - mmdeploy - INFO - mmrotate:   0.3.2

320x320

pytorch

+--------+------------+--------+ | Stats | Latency/ms | FPS | | Mean | 29.395 | 34.019 | | Median | 28.093 | 35.596 | | Min | 26.819 | 37.287 | | Max | 43.026 | 23.242 | +--------+------------+--------+

tensorrt

+--------+------------+---------+ | Stats | Latency/ms | FPS | +--------+------------+---------+ | Mean | 9.421 | 106.146 | | Median | 9.318 | 107.318 | | Min | 9.296 | 107.576 | | Max | 11.790 | 84.814 | +--------+------------+---------+

1344x1344

pytorch

+--------+------------+-------+ | Stats | Latency/ms | FPS | +--------+------------+-------+ | Mean | 160.257 | 6.240 | | Median | 159.477 | 6.271 | | Min | 158.115 | 6.324 | | Max | 168.718 | 5.927 | +--------+------------+-------+

tensorrt

+--------+------------+--------+ | Stats | Latency/ms | FPS | +--------+------------+--------+ | Mean | 27.722 | 36.073 | | Median | 27.570 | 36.271 | | Min | 27.300 | 36.629 | | Max | 30.620 | 32.659 | +--------+------------+--------+

jiaqizhang123-stack commented 2 years ago

Are you testing faster rcnn? For faster rcnn, I have a decrease here, but for the above mask rcnn, the engine speed increases pth: 2022-07-26 12:44:51,862 - test - INFO - [forward]-30 times per count: 99.25 ms, 10.08 FPS 2022-07-26 12:44:53,952 - test - INFO - [forward]-50 times per count: 99.33 ms, 10.07 FPS 2022-07-26 12:44:56,044 - test - INFO - [forward]-70 times per count: 99.39 ms, 10.06 FPS 2022-07-26 12:44:58,137 - test - INFO - [forward]-90 times per count: 99.43 ms, 10.06 FPS 2022-07-26 12:45:00,258 - test - INFO - [forward]-110 times per count: 99.61 ms, 10.04 FPS engine: 2022-07-26 12:43:59,524 - test - INFO - [trt_execute]-30 times per count: 66.44 ms, 15.05 FPS 2022-07-26 12:44:00,965 - test - INFO - [__trt_execute]-50 times per count: 66.42 ms, 15.05 FPS 2022-07-26 12:44:02,405 - test - INFO - [trt_execute]-70 times per count: 66.47 ms, 15.05 FPS 2022-07-26 12:44:03,846 - test - INFO - [__trt_execute]-90 times per count: 66.48 ms, 15.04 FPS 2022-07-26 12:44:05,289 - test - INFO - [__trt_execute]-110 times per count: 66.49 ms, 15.04 FPS

RunningLeon commented 2 years ago

Tested on maskrcnn, just follow your settings.

Are you testing faster rcnn? For faster rcnn, I have a decrease here, but for the above mask rcnn, the engine speed increases pth: 2022-07-26 12:44:51,862 - test - INFO - [forward]-30 times per count: 99.25 ms, 10.08 FPS 2022-07-26 12:44:53,952 - test - INFO - [forward]-50 times per count: 99.33 ms, 10.07 FPS 2022-07-26 12:44:56,044 - test - INFO - [forward]-70 times per count: 99.39 ms, 10.06 FPS 2022-07-26 12:44:58,137 - test - INFO - [forward]-90 times per count: 99.43 ms, 10.06 FPS 2022-07-26 12:45:00,258 - test - INFO - [forward]-110 times per count: 99.61 ms, 10.04 FPS engine: 2022-07-26 12:43:59,524 - test - INFO - [trt_execute]-30 times per count: 66.44 ms, 15.05 FPS 2022-07-26 12:44:00,965 - test - INFO - [__trt_execute]-50 times per count: 66.42 ms, 15.05 FPS 2022-07-26 12:44:02,405 - test - INFO - [trt_execute]-70 times per count: 66.47 ms, 15.05 FPS 2022-07-26 12:44:03,846 - test - INFO - [__trt_execute]-90 times per count: 66.48 ms, 15.04 FPS 2022-07-26 12:44:05,289 - test - INFO - [__trt_execute]-110 times per count: 66.49 ms, 15.04 FPS

jiaqizhang123-stack commented 2 years ago

I used the same settings, but the engine speed I get with mask rcnn is still slow. When I don't modify the config of mask rcnn, the test speed of my environment will also be slow

jiaqizhang123-stack commented 2 years ago

Hello, I would like to know why the engine speed of mask r-cnn is slowed down on my graphics card, but the faster rcnn has no effect. Is it because mask rcnn has mask predictions, and what settings will affect the speed? thanks a lot for your answer

RunningLeon commented 2 years ago

Maybe because this case: If there's no bbox in some images, in pytorch, it would skipt running maskhead part. But in TensorRT, because we padded a dummy bbox per here while exporting to ONNX, then the maskhead part would always run once.

jiaqizhang123-stack commented 2 years ago

Thank you very much, I also want to ask, how to skip this step in onnx too,because we want to reduce the engine time so that it can be used in the project.

RunningLeon commented 2 years ago

Then, maybe you may have to cut onnx into two parts and create two tensorrt engines.

jiaqizhang123-stack commented 2 years ago

Sorry, I'm a little unclear, why not modify the original code to remove the virtual input when converting to onnx? And how to split the onnx into two parts, where to start, can you be more specific?

RunningLeon commented 2 years ago

Because TensorRT does not support IF op and padding would make sure maskhead part do not fail even though there is no valid bboxes from nms. If you want to cut onnx into parts, you could refer to this doc: https://mmdeploy.readthedocs.io/en/latest/06-developer-guide/partition_model.html

jiaqizhang123-stack commented 2 years ago

Ok, thank you, why is this situation speeding up in your configuration, but not in my graphics card, is it because of the performance of my graphics card?

If onnx is split into two parts, is this part distinguished from the previous part?

mask_roi_extractor=dict( type='SingleRoIExtractor', roi_layer=dict(type='RoIAlign', output_size=28, sampling_ratio=0), out_channels=128, featmap_strides=[4, 8, 16, 32]), mask_head=dict( type='FCNMaskHead', num_convs=4, in_channels=128, conv_out_channels=128, roi_feat_size=28, upsample_cfg=dict(type='bilinear', scale_factor=2), num_classes=80, loss_mask=dict( type='CrossEntropyLoss', use_mask=True, loss_weight=1.0))

How to get the final result after getting two engines?

RunningLeon commented 2 years ago

Maybe you could compare two cards on nvidia website.

jiaqizhang123-stack commented 2 years ago

Hello, I think I found the reason for the slow engine speed.

Through analysis, it feels that in pytorch, although the maximum size of nms is 90, after nms, there are only 2 boxes left, so only two detection frames are predicted by mask.

However, for the nms rewritten by the engine, there are still 90 boxes after nms. Is this to ensure the same output size?

When the engine was used to test the same image, it was found that two boxes were correct, and the remaining 88 were all 0. The engine took up time here, so it was slow.

In this case, can the time be reduced by modifying max_per_img? [tensor([[[ 0.0000, 18.6011, 72.7970, 108.3448, 1.0000], [ 65.6038, 16.0514, 147.9273, 109.4419, 1.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000]]], device='cuda:0'),

test_cfg=dict( rpn=dict( nms_pre=200, max_per_img=200, nms=dict(type='nms', iou_threshold=0.7), min_bbox_size=0), rcnn=dict( score_thr=0.1, nms=dict(type='nms', iou_threshold=0.5), max_per_img=90, mask_thr_binary=0.5)))

RunningLeon commented 2 years ago

@jiaqizhang123-stack Hi, for tensorrt, output is predetermined, so the outputs are padded to fixed size for batchednms plugin.