open-mmlab / mmdeploy

OpenMMLab Model Deployment Framework
https://mmdeploy.readthedocs.io/en/latest/
Apache License 2.0
2.69k stars 618 forks source link

Mask R-CNN speed up #379

Closed IvDmNe closed 2 years ago

IvDmNe commented 2 years ago

I was measured inference time for Mask R-CNN in MMDetection and in TensorRT after deployment. With tensorRT python API I use obtained .engine file to inference, and inference time dropped from 0.16s to 0.13s. Is it normal that the boost is so low? I use GTX 1050 ti and deployed with fp32 and 640x480 resolution. And one more question: deployed model outputs each mask as a 28x28 array. I resize it to fill bounding box of the object, but the mask is still coarse. In MMDetection output masks already have shape of input image and they are much more precise than after deployment. Do I something wrong?

tehkillerbee commented 2 years ago

@IvDmNe For me, the improvement using PyTorch vs. TensorRT was approx 3-6 times faster with much lower memory usage.

How do you run inference with your .engine file?

a227799770055 commented 2 years ago

@tehkillerbee can you share how to inference with your .engine file? Because I also meet the same problem with @IvDmNe. My script just modified script/test.py single_gpu_test as code below to calculate time.

def single_gpu_test(model, data_loader, show=False, out_dir=None, show_score_thr=0.3): model.eval() results = [] dataset = data_loader.dataset PALETTE = getattr(dataset, 'PALETTE', None) prog_bar = mmcv.ProgressBar(len(dataset)) for i, data in enumerate(data_loader): for i in range(10): s = time.time() with torch.no_grad(): result = model(return_loss=False, rescale=True, **data) e = time.time() print(e-s, 's')

IvDmNe commented 2 years ago

@tehkillerbee, I use init function from this sample https://github.com/NVIDIA/TensorRT/blob/master/samples/python/efficientdet/infer.py. At each image i execute only lines 115-119 from infer function. Moreover, in testing in dataset with --speed-test option it shows 8-10 fps. In deploy config file I didn't change anything except of resolution: `base = [ '/root/workspace/mmdeploy/configs/mmdet/base/base_instance-seg_static.py', '/root/workspace/mmdeploy/configs/base/backends/tensorrt.py' ]

onnx_config = dict(input_shape=(640, 480)) backend_config = dict( common_config=dict(max_workspace_size=1 << 30), model_inputs=[ dict( input_shapes=dict( input=dict( min_shape=[1, 3, 480, 640], opt_shape=[1, 3, 480, 640], max_shape=[1, 3, 480, 640]))) ])`

tehkillerbee commented 2 years ago

@a227799770055 I used the C api and adapted my tools from the example here.

@IvDmNe I have not used this sample code myself so I cannot say for sure if it is suitable. But you could try using the C API and see if it makes any difference.

Regarding the size of the output masks; returned mask from TensorRT corresponds to raw mask and must be post processed/upscaled. If you use python, you can use the _do_paste_mask function from mmdet. When using the C API, this is handled automatically. I get masks almost identically to the PyTorch output.

Below is a snippet on how to upscale the masks to correspond to the image size. This approach is not very fast, however.


from mmdet.models.roi_heads.mask_heads.fcn_mask_head import _do_paste_mask
...
masks_chunk, spatial_inds = _do_paste_mask(
    mask_pred[inds],
    bboxes[inds],
    img.shape[0],
    img.shape[1],
    skip_empty=True)
...
IvDmNe commented 2 years ago

@tehkillerbee Thank you for your advice, I will try with C++ API. For now, with int8 deployment I got good boost from the original model: 0.07s from 0.16. As you wrote, function _do_paste_mask is quite slow, but gives really precise masks. And for me it's not clear how the mask is upscaled. cv.resize function gives sharp masks, though it works faster than _do_paste_mask.

urmagicsmine commented 2 years ago

I meet the same problem. I installed three different environments (including version of CUDA, TensorRT, ONNX) on three systems(1080ti, 2070, 3090), and then use the official MMDeploy and MMDetection code to evaluate the speed boosting. However, i found that the FasterRCNN model can get a boost with 50% time consuming each frame, while the MaskRCNN model do not obtain a similar boost( 8.3FPS for mmdetection model and 8.9FPS for the correspoding TensorRT model on 1080ti, and no significant boost on other GPU platforms).

tehkillerbee commented 2 years ago

@IvDmNe One more thing, you could try using a different base config. I have used the following for Mask RCNN:

_base_ = ['../MMDeploy/configs/mmdet/instance-seg/instance-seg_tensorrt-<precision>_dynamic-320x320-1344x1344.py']

(Replace with int8 or fp16). There is also a static config available.

@urmagicsmine What MMDeploy config have you used?

IvDmNe commented 2 years ago

@tehkillerbee The config i posted above is a modified version of the one you suggest. GTX 1050, that I use, doesn't support fp16 at full rate, so this may be a reason why there is no boost between fp32 and fp16 in tensorRT

tehkillerbee commented 2 years ago

@IvDmNe Yes, I have the same issue with P2000 on my work laptop - it does not support FP16 in hardware so there are no significant speed improvements.

urmagicsmine commented 2 years ago

I have noticed that 1080ti doesn't support FP16 , so I tried FP32 config on 1080ti, and FP32/FP16 on 2070&3090. FasterRCNN can get nearly 2x speet boost on 2070&3090 for both FP32 and FP16 models,while no speed boost observed for MaskRCNN FP32/FP16 models on theses three kind of GPU cards. Can you get speed improment on your P2000 card, and what config do you use? tkx ! @tehkillerbee

tehkillerbee commented 2 years ago

@urmagicsmine I have not tested the performance on higher end cards recently as most of the development takes place on a Jetson AGX Xavier platform.

FP32: ~785mS avg. per frame FP16: ~270mS avg. per frame INT8: ~199mS avg. per frame

All tests use the same input images with a number of detections varying from 50 to 500 in each image.

I will try to repeat the tests on a few other platforms and get back to you. I will also test on P2000 as a comparison.

urmagicsmine commented 2 years ago

@tehkillerbee thanks, do you have the speed test result for these models before converted to TRT(test on mmdetection)? that will help a lot.

urmagicsmine commented 2 years ago

btw, I tested MaskRCNN r50 FP32 model on mmdetection and mmdeploy(TRT version) on a 2080ti gpu , and got 11.0FPS/11.8FPS separately.

zsw360720347 commented 2 years ago

btw, I tested MaskRCNN r50 FP32 model on mmdetection and mmdeploy(TRT version) on a 2080ti gpu , and got 11.0FPS/11.8FPS separately.

On windows?Linux?

urmagicsmine commented 2 years ago

btw, I tested MaskRCNN r50 FP32 model on mmdetection and mmdeploy(TRT version) on a 2080ti gpu , and got 11.0FPS/11.8FPS separately.

On windows?Linux?

Ubuntu18.04 and CUDA11.3

RunningLeon commented 2 years ago

close since no activity for more than 4 weeks, please reopen if you still have question, thanks!