Open jiaqizhang123-stack opened 2 years ago
@jiaqizhang123-stack Hi, your trt model input supposed to be 320x320-1344x1344, but your testing shape is 224x224
, isn't that a bit strange? And the testing should fail in this case. BTW, could you post your env info by running python tools/check_env.py
?
(mmdeploy) zhang@zhang-QiTianM540-A739:~/mmdeploy$ python tools/check_env.py 2022-07-26 10:44:28,349 - mmdeploy - INFO -
2022-07-26 10:44:28,349 - mmdeploy - INFO - **Environmental information**
fatal: ambiguous argument 'HEAD': unknown revision or path not in the working tree.
Use '--' to separate paths from revisions, like this:
'git
2022-07-26 10:44:29,204 - mmdeploy - INFO - TorchVision: 0.9.0 2022-07-26 10:44:29,204 - mmdeploy - INFO - OpenCV: 4.5.3 2022-07-26 10:44:29,204 - mmdeploy - INFO - MMCV: 1.4.0 2022-07-26 10:44:29,204 - mmdeploy - INFO - MMCV Compiler: GCC 7.3 2022-07-26 10:44:29,204 - mmdeploy - INFO - MMCV CUDA Compiler: 10.2 2022-07-26 10:44:29,204 - mmdeploy - INFO - MMDeploy: 0.5.0+HEAD 2022-07-26 10:44:29,204 - mmdeploy - INFO -
2022-07-26 10:44:29,204 - mmdeploy - INFO - **Backend information** 2022-07-26 10:44:29,537 - mmdeploy - INFO - onnxruntime: 1.10.0 ops_is_avaliable : True 2022-07-26 10:44:29,551 - mmdeploy - INFO - tensorrt: 8.2.3.0 ops_is_avaliable : True 2022-07-26 10:44:29,593 - mmdeploy - INFO - ncnn: None ops_is_avaliable : False 2022-07-26 10:44:29,602 - mmdeploy - INFO - pplnn_is_avaliable: False 2022-07-26 10:44:29,611 - mmdeploy - INFO - openvino_is_avaliable: False 2022-07-26 10:44:29,611 - mmdeploy - INFO -
2022-07-26 10:44:29,611 - mmdeploy - INFO - **Codebase information** 2022-07-26 10:44:29,612 - mmdeploy - INFO - mmdet: 2.25.0 2022-07-26 10:44:29,612 - mmdeploy - INFO - mmseg: None 2022-07-26 10:44:29,612 - mmdeploy - INFO - mmcls: None 2022-07-26 10:44:29,612 - mmdeploy - INFO - mmocr: None 2022-07-26 10:44:29,612 - mmdeploy - INFO - mmedit: None 2022-07-26 10:44:29,612 - mmdeploy - INFO - mmdet3d: None 2022-07-26 10:44:29,612 - mmdeploy - INFO - mmpose: None 2022-07-26 10:44:29,612 - mmdeploy - INFO - mmrotate: None
Because the "test_pipeline" is this, so it shouldn't be affected. When the shape is fixed, the time is consistent. test_pipeline = [ dict(type='LoadImageFromFile'), dict( type='MultiScaleFlipAug', img_scale=(512, 192), flip=False, transforms=[ dict(type='Resize', keep_ratio=False), dict(type='RandomFlip'), dict(type='Normalize', **img_norm_cfg), dict(type='Pad', size_divisor=32), dict(type='DefaultFormatBundle'), dict(type='Collect', keys=['img']), ]) ]
@jiaqizhang123-stack Hi, the acceleration performance is different across N cards. Tests results on my side seemed OK.
2022-07-26 11:52:41,917 - mmdeploy - INFO - **********Environmental information**********
2022-07-26 11:52:42,901 - mmdeploy - INFO - sys.platform: linux
2022-07-26 11:52:42,901 - mmdeploy - INFO - Python: 3.7.5 (default, Oct 25 2019, 15:51:11) [GCC 7.3.0]
2022-07-26 11:52:42,901 - mmdeploy - INFO - CUDA available: True
2022-07-26 11:52:42,901 - mmdeploy - INFO - GPU 0: NVIDIA GeForce RTX 2080
2022-07-26 11:52:42,901 - mmdeploy - INFO - CUDA_HOME: /usr/local/cuda
2022-07-26 11:52:42,901 - mmdeploy - INFO - NVCC: Build cuda_11.1.TC455_06.29069683_0
2022-07-26 11:52:42,901 - mmdeploy - INFO - GCC: gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
2022-07-26 11:52:42,901 - mmdeploy - INFO - PyTorch: 1.8.0
2022-07-26 11:52:42,901 - mmdeploy - INFO - PyTorch compiling details: PyTorch built with
2022-07-26 11:52:42,901 - mmdeploy - INFO - TorchVision: 0.9.0
2022-07-26 11:52:42,901 - mmdeploy - INFO - OpenCV: 4.5.2
2022-07-26 11:52:42,901 - mmdeploy - INFO - MMCV: 1.4.8
2022-07-26 11:52:42,901 - mmdeploy - INFO - MMCV Compiler: GCC 7.3
2022-07-26 11:52:42,901 - mmdeploy - INFO - MMCV CUDA Compiler: 11.1
2022-07-26 11:52:42,901 - mmdeploy - INFO - MMDeploy: 0.6.0+72776dd
2022-07-26 11:52:42,902 - mmdeploy - INFO -
2022-07-26 11:52:42,902 - mmdeploy - INFO - **********Backend information**********
2022-07-26 11:52:43,396 - mmdeploy - INFO - onnxruntime: 1.8.0 ops_is_avaliable : True
2022-07-26 11:52:43,415 - mmdeploy - INFO - tensorrt: 8.2.1.8 ops_is_avaliable : True
2022-07-26 11:52:43,432 - mmdeploy - INFO - ncnn: 1.0.20220722 ops_is_avaliable : True
2022-07-26 11:52:43,480 - mmdeploy - INFO - pplnn_is_avaliable: True
2022-07-26 11:52:43,493 - mmdeploy - INFO - openvino_is_avaliable: True
2022-07-26 11:52:43,494 - mmdeploy - INFO -
2022-07-26 11:52:43,494 - mmdeploy - INFO - **********Codebase information**********
2022-07-26 11:52:45,623 - mmdeploy - INFO - mmdet: 2.25.0
2022-07-26 11:52:45,623 - mmdeploy - INFO - mmseg: 0.26.0
2022-07-26 11:52:45,623 - mmdeploy - INFO - mmcls: 0.23.0
2022-07-26 11:52:45,623 - mmdeploy - INFO - mmocr: None
2022-07-26 11:52:45,623 - mmdeploy - INFO - mmedit: 0.12.0
2022-07-26 11:52:45,623 - mmdeploy - INFO - mmdet3d: 1.0.0rc3
2022-07-26 11:52:45,623 - mmdeploy - INFO - mmpose: 0.26.0
2022-07-26 11:52:45,623 - mmdeploy - INFO - mmrotate: 0.3.2
+--------+------------+--------+ | Stats | Latency/ms | FPS | | Mean | 29.395 | 34.019 | | Median | 28.093 | 35.596 | | Min | 26.819 | 37.287 | | Max | 43.026 | 23.242 | +--------+------------+--------+
+--------+------------+---------+ | Stats | Latency/ms | FPS | +--------+------------+---------+ | Mean | 9.421 | 106.146 | | Median | 9.318 | 107.318 | | Min | 9.296 | 107.576 | | Max | 11.790 | 84.814 | +--------+------------+---------+
+--------+------------+-------+ | Stats | Latency/ms | FPS | +--------+------------+-------+ | Mean | 160.257 | 6.240 | | Median | 159.477 | 6.271 | | Min | 158.115 | 6.324 | | Max | 168.718 | 5.927 | +--------+------------+-------+
+--------+------------+--------+ | Stats | Latency/ms | FPS | +--------+------------+--------+ | Mean | 27.722 | 36.073 | | Median | 27.570 | 36.271 | | Min | 27.300 | 36.629 | | Max | 30.620 | 32.659 | +--------+------------+--------+
Are you testing faster rcnn? For faster rcnn, I have a decrease here, but for the above mask rcnn, the engine speed increases pth: 2022-07-26 12:44:51,862 - test - INFO - [forward]-30 times per count: 99.25 ms, 10.08 FPS 2022-07-26 12:44:53,952 - test - INFO - [forward]-50 times per count: 99.33 ms, 10.07 FPS 2022-07-26 12:44:56,044 - test - INFO - [forward]-70 times per count: 99.39 ms, 10.06 FPS 2022-07-26 12:44:58,137 - test - INFO - [forward]-90 times per count: 99.43 ms, 10.06 FPS 2022-07-26 12:45:00,258 - test - INFO - [forward]-110 times per count: 99.61 ms, 10.04 FPS engine: 2022-07-26 12:43:59,524 - test - INFO - [trt_execute]-30 times per count: 66.44 ms, 15.05 FPS 2022-07-26 12:44:00,965 - test - INFO - [__trt_execute]-50 times per count: 66.42 ms, 15.05 FPS 2022-07-26 12:44:02,405 - test - INFO - [trt_execute]-70 times per count: 66.47 ms, 15.05 FPS 2022-07-26 12:44:03,846 - test - INFO - [__trt_execute]-90 times per count: 66.48 ms, 15.04 FPS 2022-07-26 12:44:05,289 - test - INFO - [__trt_execute]-110 times per count: 66.49 ms, 15.04 FPS
Tested on maskrcnn, just follow your settings.
Are you testing faster rcnn? For faster rcnn, I have a decrease here, but for the above mask rcnn, the engine speed increases pth: 2022-07-26 12:44:51,862 - test - INFO - [forward]-30 times per count: 99.25 ms, 10.08 FPS 2022-07-26 12:44:53,952 - test - INFO - [forward]-50 times per count: 99.33 ms, 10.07 FPS 2022-07-26 12:44:56,044 - test - INFO - [forward]-70 times per count: 99.39 ms, 10.06 FPS 2022-07-26 12:44:58,137 - test - INFO - [forward]-90 times per count: 99.43 ms, 10.06 FPS 2022-07-26 12:45:00,258 - test - INFO - [forward]-110 times per count: 99.61 ms, 10.04 FPS engine: 2022-07-26 12:43:59,524 - test - INFO - [trt_execute]-30 times per count: 66.44 ms, 15.05 FPS 2022-07-26 12:44:00,965 - test - INFO - [__trt_execute]-50 times per count: 66.42 ms, 15.05 FPS 2022-07-26 12:44:02,405 - test - INFO - [trt_execute]-70 times per count: 66.47 ms, 15.05 FPS 2022-07-26 12:44:03,846 - test - INFO - [__trt_execute]-90 times per count: 66.48 ms, 15.04 FPS 2022-07-26 12:44:05,289 - test - INFO - [__trt_execute]-110 times per count: 66.49 ms, 15.04 FPS
I used the same settings, but the engine speed I get with mask rcnn is still slow. When I don't modify the config of mask rcnn, the test speed of my environment will also be slow
Hello, I would like to know why the engine speed of mask r-cnn is slowed down on my graphics card, but the faster rcnn has no effect. Is it because mask rcnn has mask predictions, and what settings will affect the speed? thanks a lot for your answer
Maybe because this case: If there's no bbox in some images, in pytorch, it would skipt running maskhead part. But in TensorRT, because we padded a dummy bbox per here while exporting to ONNX, then the maskhead part would always run once.
Thank you very much, I also want to ask, how to skip this step in onnx too,because we want to reduce the engine time so that it can be used in the project.
Then, maybe you may have to cut onnx into two parts and create two tensorrt engines.
Sorry, I'm a little unclear, why not modify the original code to remove the virtual input when converting to onnx? And how to split the onnx into two parts, where to start, can you be more specific?
Because TensorRT does not support IF
op and padding would make sure maskhead part do not fail even though there is no valid bboxes from nms.
If you want to cut onnx into parts, you could refer to this doc: https://mmdeploy.readthedocs.io/en/latest/06-developer-guide/partition_model.html
Ok, thank you, why is this situation speeding up in your configuration, but not in my graphics card, is it because of the performance of my graphics card?
If onnx is split into two parts, is this part distinguished from the previous part?
mask_roi_extractor=dict( type='SingleRoIExtractor', roi_layer=dict(type='RoIAlign', output_size=28, sampling_ratio=0), out_channels=128, featmap_strides=[4, 8, 16, 32]), mask_head=dict( type='FCNMaskHead', num_convs=4, in_channels=128, conv_out_channels=128, roi_feat_size=28, upsample_cfg=dict(type='bilinear', scale_factor=2), num_classes=80, loss_mask=dict( type='CrossEntropyLoss', use_mask=True, loss_weight=1.0))
How to get the final result after getting two engines?
Maybe you could compare two cards on nvidia website.
Hello, I think I found the reason for the slow engine speed.
Through analysis, it feels that in pytorch, although the maximum size of nms is 90, after nms, there are only 2 boxes left, so only two detection frames are predicted by mask.
However, for the nms rewritten by the engine, there are still 90 boxes after nms. Is this to ensure the same output size?
When the engine was used to test the same image, it was found that two boxes were correct, and the remaining 88 were all 0. The engine took up time here, so it was slow.
In this case, can the time be reduced by modifying max_per_img? [tensor([[[ 0.0000, 18.6011, 72.7970, 108.3448, 1.0000], [ 65.6038, 16.0514, 147.9273, 109.4419, 1.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000]]], device='cuda:0'),
test_cfg=dict( rpn=dict( nms_pre=200, max_per_img=200, nms=dict(type='nms', iou_threshold=0.7), min_bbox_size=0), rcnn=dict( score_thr=0.1, nms=dict(type='nms', iou_threshold=0.5), max_per_img=90, mask_thr_binary=0.5)))
@jiaqizhang123-stack Hi, for tensorrt, output is predetermined, so the outputs are padded to fixed size for batchednms plugin.
python tools/profile.py configs/mmdet/instance-seg/instance-seg_tensorrt-fp16_dynamic-320x320-1344x1344.py ../mmdetection/configs/mask_rcnn/mask_rcnn_r50_fpn_1x_coco.py tu --model /home/zhang/checkpoints/epoch_36.pth --device cuda --shape 224x224 --num-iter 100
2022-07-25 18:43:56,272 - test - INFO - [forward]-30 times per count: 14.79 ms, 67.60 FPS 2022-07-25 18:43:56,612 - test - INFO - [forward]-50 times per count: 14.80 ms, 67.57 FPS 2022-07-25 18:43:56,949 - test - INFO - [forward]-70 times per count: 14.74 ms, 67.86 FPS 2022-07-25 18:43:57,289 - test - INFO - [forward]-90 times per count: 14.73 ms, 67.89 FPS 2022-07-25 18:43:57,695 - test - INFO - [forward]-110 times per count: 15.06 ms, 66.42 FPS
python tools/profile.py configs/mmdet/instance-seg/instance-seg_tensorrt-fp16_dynamic-320x320-1344x1344.py ../mmdetection/configs/mask_rcnn/mask_rcnn_r50_fpn_1x_coco.py tu --model /home/zhang/checkpoints/epoch_36.pth --device cuda --shape 224x224 --num-iter 100
2022-07-25 18:42:01,970 - mmdeploy - INFO - Found totally 5 image files in tu 2022-07-25 18:42:05,314 - test - INFO - [trt_execute]-30 times per count: 51.84 ms, 19.29 FPS 2022-07-25 18:42:06,604 - test - INFO - [__trt_execute]-50 times per count: 53.18 ms, 18.81 FPS 2022-07-25 18:42:07,890 - test - INFO - [trt_execute]-70 times per count: 53.71 ms, 18.62 FPS 2022-07-25 18:42:09,133 - test - INFO - [__trt_execute]-90 times per count: 53.43 ms, 18.72 FPS 2022-07-25 18:42:10,345 - test - INFO - [__trt_execute]-110 times per count: 53.27 ms, 18.77 FPS
It can be seen that tensorrt is very slow, can you see why?