Slow inference speed Swin-Transformer: Pytorch to ONNX and TensorRT

manhtd98 commented 2 years ago

I convert Swin-Transformer from Pytorch to ONNX and TensorRT and got slow speed: Pytorch: 0.6s TensorRT:0.59s ONNX: 0.92s Here is my configure:

_base_ = [
    '../_base_/base_instance-seg_dynamic.py',
    '../../_base_/backends/tensorrt-int8.py'
]

backend_config = dict(
    common_config=dict(max_workspace_size=1 << 60),
    model_inputs=[
        dict(
            input_shapes=dict(
                input=dict(
                    min_shape=[1, 3, 320, 320],
                    opt_shape=[1, 3, 800, 1344],
                    max_shape=[3, 3, 1344, 1344])))
    ])

python3 ./tools/deploy.py ./configs/mmdet/instance-seg/instance-seg_tensorrt_dynamic-320x320-1344x1344.py ../mmdetection/configs/insurance/cascade_mask_rcnn_swinB.py ../mmdetection/pretrained/cascade_mask_rcnn_swinB.pth ./demo/demo.jpg --device cuda:0

AllentDan commented 2 years ago

Please use static and fp16 if you want to accelerate the inference. There are many shape inference operators for dynamic TensorRT models.

manhtd98 commented 2 years ago

@AllentDan do you have the same speed ?

AllentDan commented 2 years ago

@AllentDan do you have the same speed ?

I didn't test yet and since we have different NVIDIA cards. I don't think my results make sense to you.

manhtd98 commented 2 years ago

I tried your suggestion and got the same speed. Maybe some layer slow down TRT speed?

AllentDan commented 2 years ago

In my testing, fp16 is faster than fp32.

manhtd98 commented 2 years ago

yes. python = int8 < fp16 < fp32 < onnx. Seem like not much optimize here

AllentDan commented 2 years ago

I tried your suggestion and got the same speed. Maybe some layer slow down TRT speed?

What config did you use? And how did you test the speed?

manhtd98 commented 2 years ago

here is config

_base_ = [
    '../_base_/base_instance-seg_static.py',
    '../../_base_/backends/tensorrt-fp16.py'
]

onnx_config = dict(input_shape=(1344, 800))
backend_config = dict(
    common_config=dict(max_workspace_size=1 << 30),
    model_inputs=[
        dict(
            input_shapes=dict(
                input=dict(
                    min_shape=[1, 3, 800, 1344],
                    opt_shape=[1, 3, 800, 1344],
                    max_shape=[1, 3, 800, 1344])))
    ])

I change '../../_base_/backends/tensorrt-fp16.py' to trt, fp16, int8. I test after 10 request warmup

AllentDan commented 2 years ago

max_shape

So, did you change the input resolution of Pytorch for fair comparison?

manhtd98 commented 2 years ago

where to change it? i did not got it?

AllentDan commented 2 years ago

Ho did you test the speed of the PyTorch model?

manhtd98 commented 2 years ago

I resize it and load to MMDeploy. it have the same input shape. The input of 3 model is the same, i just change model and report the diffrence time request

AllentDan commented 2 years ago

Were you using the script https://github.com/open-mmlab/mmdeploy/blob/master/tools/profile.py？

manhtd98 commented 2 years ago

i just have been create my test with this.

AllentDan commented 2 years ago

i just have been create my test with this.

And the results the same as your previous testing?

manhtd98 commented 2 years ago

yes. can you reproduce the test? I convert to TRT and got 368Mb on INT8. So large because the layer is so complex

AllentDan commented 2 years ago

yes. can you reproduce the test? I convert to TRT and got 368Mb on INT8. So large because the layer is so complex

I will check it later.

manhtd98 commented 2 years ago

@AllentDan did you test it?

AllentDan commented 2 years ago

With fp32: python tools/profile.py configs/mmdet/instance-seg/instance-seg_tensorrt_dynamic-320x320-1344x1344.py ../mmdetection/configs/swin/mask_rcnn_swin-t-p4-w7_fpn_1x_coco.py data/coco/test --model ../work_dirs/tensorrt/det/swin/end2end.engine --shape 800x800

2022-07-29 16:30:03,635 - test - INFO - [tensorrt]-30 times per count: 124.53 ms, 8.03 FPS
2022-07-29 16:30:06,644 - test - INFO - [tensorrt]-50 times per count: 124.57 ms, 8.03 FPS
2022-07-29 16:30:09,609 - test - INFO - [tensorrt]-70 times per count: 124.72 ms, 8.02 FPS
2022-07-29 16:30:12,523 - test - INFO - [tensorrt]-90 times per count: 124.58 ms, 8.03 FPS
2022-07-29 16:30:15,561 - test - INFO - [tensorrt]-110 times per count: 124.72 ms, 8.02 FPS
----- Settings:
+------------+---------+
| batch size |    1    |
|   shape    | 800x800 |
| iterations |   100   |
|   warmup   |    10   |
+------------+---------+
----- Results:
+--------+------------+-------+
| Stats  | Latency/ms |  FPS  |
+--------+------------+-------+
|  Mean  |  124.722   | 8.018 |
| Median |  124.266   | 8.047 |
|  Min   |  121.694   | 8.217 |
|  Max   |  129.085   | 7.747 |
+--------+------------+-------+

python tools/profile.py configs/mmdet/instance-seg/instance-seg_tensorrt_dynamic-320x320-1344x1344.py ../mmdetection/configs/swin/mask_rcnn_swin-t-p4-w7_fpn_1x_coco.py data/coco/test --model https://download.openmmlab.com/mmdetection/v2.0/swin/mask_rcnn_swin-t-p4-w7_fpn_1x_coco/mask_rcnn_swin-t-p4-w7_fpn_1x_coco_20210902_120937-9d6b7cfa.pth --shape 800x800

2022-07-29 16:29:20,989 - test - INFO - [pytorch]-30 times per count: 176.57 ms, 5.66 FPS
2022-07-29 16:29:24,934 - test - INFO - [pytorch]-50 times per count: 178.63 ms, 5.60 FPS
2022-07-29 16:29:28,538 - test - INFO - [pytorch]-70 times per count: 174.35 ms, 5.74 FPS
2022-07-29 16:29:32,027 - test - INFO - [pytorch]-90 times per count: 170.76 ms, 5.86 FPS
2022-07-29 16:29:35,762 - test - INFO - [pytorch]-110 times per count: 170.71 ms, 5.86 FPS
----- Settings:
+------------+---------+
| batch size |    1    |
|   shape    | 800x800 |
| iterations |   100   |
|   warmup   |    10   |
+------------+---------+
----- Results:
+--------+------------+-------+
| Stats  | Latency/ms |  FPS  |
+--------+------------+-------+
|  Mean  |  170.715   | 5.858 |
| Median |  149.991   | 6.667 |
|  Min   |  135.108   | 7.402 |
|  Max   |  313.081   | 3.194 |

And TensorRT model can be further boosted if we use fp16 mode.

AllentDan commented 2 years ago

More details about my results. I used dynamic config of TensorRT rather than static. Here is my env if you want to reproduce it.

2022-07-29 16:33:57,110 - mmdeploy - INFO - 

2022-07-29 16:33:57,111 - mmdeploy - INFO - **********Environmental information**********
2022-07-29 16:33:57,255 - mmdeploy - INFO - sys.platform: linux
2022-07-29 16:33:57,256 - mmdeploy - INFO - Python: 3.8.12 (default, Oct 12 2021, 13:49:34) [GCC 7.5.0]
2022-07-29 16:33:57,256 - mmdeploy - INFO - CUDA available: True
2022-07-29 16:33:57,256 - mmdeploy - INFO - GPU 0: NVIDIA GeForce GTX 1660 SUPER
2022-07-29 16:33:57,256 - mmdeploy - INFO - CUDA_HOME: /usr/local/cuda
2022-07-29 16:33:57,256 - mmdeploy - INFO - NVCC: Cuda compilation tools, release 11.3, V11.3.58
2022-07-29 16:33:57,256 - mmdeploy - INFO - GCC: gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
2022-07-29 16:33:57,256 - mmdeploy - INFO - PyTorch: 1.10.2
2022-07-29 16:33:57,256 - mmdeploy - INFO - PyTorch compiling details: PyTorch built with:
  - GCC 7.3
  - C++ Version: 201402
  - Intel(R) oneAPI Math Kernel Library Version 2021.4-Product Build 20210904 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v2.2.3 (Git Hash 7336ca9f055cf1bfa13efb658fe15dc9b41f0740)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 11.3
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37
  - CuDNN 8.2
  - Magma 2.5.2
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.3, CUDNN_VERSION=8.2.0, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.10.2, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, 

2022-07-29 16:33:57,256 - mmdeploy - INFO - TorchVision: 0.11.3
2022-07-29 16:33:57,256 - mmdeploy - INFO - OpenCV: 4.5.4
2022-07-29 16:33:57,256 - mmdeploy - INFO - MMCV: 1.5.0
2022-07-29 16:33:57,256 - mmdeploy - INFO - MMCV Compiler: GCC 7.5
2022-07-29 16:33:57,256 - mmdeploy - INFO - MMCV CUDA Compiler: 11.3
2022-07-29 16:33:57,256 - mmdeploy - INFO - MMDeploy: 0.6.0+5b31d7a
2022-07-29 16:33:57,256 - mmdeploy - INFO - 

2022-07-29 16:33:57,256 - mmdeploy - INFO - **********Backend information**********
2022-07-29 16:33:57,602 - mmdeploy - INFO - onnxruntime: 1.8.1  ops_is_avaliable : True
2022-07-29 16:33:57,629 - mmdeploy - INFO - tensorrt: 8.4.1.5   ops_is_avaliable : True
2022-07-29 16:33:57,644 - mmdeploy - INFO - ncnn: None  ops_is_avaliable : True
2022-07-29 16:33:57,646 - mmdeploy - INFO - pplnn_is_avaliable: False
2022-07-29 16:33:57,666 - mmdeploy - INFO - openvino_is_avaliable: True
2022-07-29 16:33:57,666 - mmdeploy - INFO - 

2022-07-29 16:33:57,666 - mmdeploy - INFO - **********Codebase information**********
2022-07-29 16:33:57,671 - mmdeploy - INFO - mmdet:      2.19.0
2022-07-29 16:33:57,671 - mmdeploy - INFO - mmseg:      0.25.0
2022-07-29 16:33:57,671 - mmdeploy - INFO - mmcls:      0.23.1
2022-07-29 16:33:57,671 - mmdeploy - INFO - mmocr:      None
2022-07-29 16:33:57,671 - mmdeploy - INFO - mmedit:     0.12.0
2022-07-29 16:33:57,671 - mmdeploy - INFO - mmdet3d:    None
2022-07-29 16:33:57,671 - mmdeploy - INFO - mmpose:     None
2022-07-29 16:33:57,671 - mmdeploy - INFO - mmrotate:   None

manhtd98 commented 2 years ago

i only got 5.2 FPS on TRT INT8 and 2.82 on Pytorch. I use 1660 TI. But anyway, thanks for nice support!

open-mmlab / mmdeploy

Slow inference speed Swin-Transformer: Pytorch to ONNX and TensorRT #800