Closed manhtd98 closed 2 years ago
Please use static and fp16 if you want to accelerate the inference. There are many shape inference operators for dynamic TensorRT models.
@AllentDan do you have the same speed ?
@AllentDan do you have the same speed ?
I didn't test yet and since we have different NVIDIA cards. I don't think my results make sense to you.
I tried your suggestion and got the same speed. Maybe some layer slow down TRT speed?
In my testing, fp16 is faster than fp32.
yes. python = int8 < fp16 < fp32 < onnx. Seem like not much optimize here
I tried your suggestion and got the same speed. Maybe some layer slow down TRT speed?
What config did you use? And how did you test the speed?
here is config
_base_ = [
'../_base_/base_instance-seg_static.py',
'../../_base_/backends/tensorrt-fp16.py'
]
onnx_config = dict(input_shape=(1344, 800))
backend_config = dict(
common_config=dict(max_workspace_size=1 << 30),
model_inputs=[
dict(
input_shapes=dict(
input=dict(
min_shape=[1, 3, 800, 1344],
opt_shape=[1, 3, 800, 1344],
max_shape=[1, 3, 800, 1344])))
])
I change '../../_base_/backends/tensorrt-fp16.py'
to trt, fp16, int8.
I test after 10 request warmup
max_shape
So, did you change the input resolution of Pytorch for fair comparison?
where to change it? i did not got it?
Ho did you test the speed of the PyTorch model?
I resize it and load to MMDeploy. it have the same input shape. The input of 3 model is the same, i just change model and report the diffrence time request
Were you using the script https://github.com/open-mmlab/mmdeploy/blob/master/tools/profile.py
?
i just have been create my test with this.
i just have been create my test with this.
And the results the same as your previous testing?
yes. can you reproduce the test? I convert to TRT and got 368Mb on INT8. So large because the layer is so complex
yes. can you reproduce the test? I convert to TRT and got 368Mb on INT8. So large because the layer is so complex
I will check it later.
@AllentDan did you test it?
With fp32: python tools/profile.py configs/mmdet/instance-seg/instance-seg_tensorrt_dynamic-320x320-1344x1344.py ../mmdetection/configs/swin/mask_rcnn_swin-t-p4-w7_fpn_1x_coco.py data/coco/test --model ../work_dirs/tensorrt/det/swin/end2end.engine --shape 800x800
2022-07-29 16:30:03,635 - test - INFO - [tensorrt]-30 times per count: 124.53 ms, 8.03 FPS
2022-07-29 16:30:06,644 - test - INFO - [tensorrt]-50 times per count: 124.57 ms, 8.03 FPS
2022-07-29 16:30:09,609 - test - INFO - [tensorrt]-70 times per count: 124.72 ms, 8.02 FPS
2022-07-29 16:30:12,523 - test - INFO - [tensorrt]-90 times per count: 124.58 ms, 8.03 FPS
2022-07-29 16:30:15,561 - test - INFO - [tensorrt]-110 times per count: 124.72 ms, 8.02 FPS
----- Settings:
+------------+---------+
| batch size | 1 |
| shape | 800x800 |
| iterations | 100 |
| warmup | 10 |
+------------+---------+
----- Results:
+--------+------------+-------+
| Stats | Latency/ms | FPS |
+--------+------------+-------+
| Mean | 124.722 | 8.018 |
| Median | 124.266 | 8.047 |
| Min | 121.694 | 8.217 |
| Max | 129.085 | 7.747 |
+--------+------------+-------+
python tools/profile.py configs/mmdet/instance-seg/instance-seg_tensorrt_dynamic-320x320-1344x1344.py ../mmdetection/configs/swin/mask_rcnn_swin-t-p4-w7_fpn_1x_coco.py data/coco/test --model https://download.openmmlab.com/mmdetection/v2.0/swin/mask_rcnn_swin-t-p4-w7_fpn_1x_coco/mask_rcnn_swin-t-p4-w7_fpn_1x_coco_20210902_120937-9d6b7cfa.pth --shape 800x800
2022-07-29 16:29:20,989 - test - INFO - [pytorch]-30 times per count: 176.57 ms, 5.66 FPS
2022-07-29 16:29:24,934 - test - INFO - [pytorch]-50 times per count: 178.63 ms, 5.60 FPS
2022-07-29 16:29:28,538 - test - INFO - [pytorch]-70 times per count: 174.35 ms, 5.74 FPS
2022-07-29 16:29:32,027 - test - INFO - [pytorch]-90 times per count: 170.76 ms, 5.86 FPS
2022-07-29 16:29:35,762 - test - INFO - [pytorch]-110 times per count: 170.71 ms, 5.86 FPS
----- Settings:
+------------+---------+
| batch size | 1 |
| shape | 800x800 |
| iterations | 100 |
| warmup | 10 |
+------------+---------+
----- Results:
+--------+------------+-------+
| Stats | Latency/ms | FPS |
+--------+------------+-------+
| Mean | 170.715 | 5.858 |
| Median | 149.991 | 6.667 |
| Min | 135.108 | 7.402 |
| Max | 313.081 | 3.194 |
And TensorRT model can be further boosted if we use fp16 mode.
More details about my results. I used dynamic config of TensorRT rather than static. Here is my env if you want to reproduce it.
2022-07-29 16:33:57,110 - mmdeploy - INFO -
2022-07-29 16:33:57,111 - mmdeploy - INFO - **********Environmental information**********
2022-07-29 16:33:57,255 - mmdeploy - INFO - sys.platform: linux
2022-07-29 16:33:57,256 - mmdeploy - INFO - Python: 3.8.12 (default, Oct 12 2021, 13:49:34) [GCC 7.5.0]
2022-07-29 16:33:57,256 - mmdeploy - INFO - CUDA available: True
2022-07-29 16:33:57,256 - mmdeploy - INFO - GPU 0: NVIDIA GeForce GTX 1660 SUPER
2022-07-29 16:33:57,256 - mmdeploy - INFO - CUDA_HOME: /usr/local/cuda
2022-07-29 16:33:57,256 - mmdeploy - INFO - NVCC: Cuda compilation tools, release 11.3, V11.3.58
2022-07-29 16:33:57,256 - mmdeploy - INFO - GCC: gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
2022-07-29 16:33:57,256 - mmdeploy - INFO - PyTorch: 1.10.2
2022-07-29 16:33:57,256 - mmdeploy - INFO - PyTorch compiling details: PyTorch built with:
- GCC 7.3
- C++ Version: 201402
- Intel(R) oneAPI Math Kernel Library Version 2021.4-Product Build 20210904 for Intel(R) 64 architecture applications
- Intel(R) MKL-DNN v2.2.3 (Git Hash 7336ca9f055cf1bfa13efb658fe15dc9b41f0740)
- OpenMP 201511 (a.k.a. OpenMP 4.5)
- LAPACK is enabled (usually provided by MKL)
- NNPACK is enabled
- CPU capability usage: AVX2
- CUDA Runtime 11.3
- NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37
- CuDNN 8.2
- Magma 2.5.2
- Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.3, CUDNN_VERSION=8.2.0, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.10.2, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,
2022-07-29 16:33:57,256 - mmdeploy - INFO - TorchVision: 0.11.3
2022-07-29 16:33:57,256 - mmdeploy - INFO - OpenCV: 4.5.4
2022-07-29 16:33:57,256 - mmdeploy - INFO - MMCV: 1.5.0
2022-07-29 16:33:57,256 - mmdeploy - INFO - MMCV Compiler: GCC 7.5
2022-07-29 16:33:57,256 - mmdeploy - INFO - MMCV CUDA Compiler: 11.3
2022-07-29 16:33:57,256 - mmdeploy - INFO - MMDeploy: 0.6.0+5b31d7a
2022-07-29 16:33:57,256 - mmdeploy - INFO -
2022-07-29 16:33:57,256 - mmdeploy - INFO - **********Backend information**********
2022-07-29 16:33:57,602 - mmdeploy - INFO - onnxruntime: 1.8.1 ops_is_avaliable : True
2022-07-29 16:33:57,629 - mmdeploy - INFO - tensorrt: 8.4.1.5 ops_is_avaliable : True
2022-07-29 16:33:57,644 - mmdeploy - INFO - ncnn: None ops_is_avaliable : True
2022-07-29 16:33:57,646 - mmdeploy - INFO - pplnn_is_avaliable: False
2022-07-29 16:33:57,666 - mmdeploy - INFO - openvino_is_avaliable: True
2022-07-29 16:33:57,666 - mmdeploy - INFO -
2022-07-29 16:33:57,666 - mmdeploy - INFO - **********Codebase information**********
2022-07-29 16:33:57,671 - mmdeploy - INFO - mmdet: 2.19.0
2022-07-29 16:33:57,671 - mmdeploy - INFO - mmseg: 0.25.0
2022-07-29 16:33:57,671 - mmdeploy - INFO - mmcls: 0.23.1
2022-07-29 16:33:57,671 - mmdeploy - INFO - mmocr: None
2022-07-29 16:33:57,671 - mmdeploy - INFO - mmedit: 0.12.0
2022-07-29 16:33:57,671 - mmdeploy - INFO - mmdet3d: None
2022-07-29 16:33:57,671 - mmdeploy - INFO - mmpose: None
2022-07-29 16:33:57,671 - mmdeploy - INFO - mmrotate: None
i only got 5.2 FPS on TRT INT8 and 2.82 on Pytorch. I use 1660 TI. But anyway, thanks for nice support!
I convert Swin-Transformer from Pytorch to ONNX and TensorRT and got slow speed: Pytorch: 0.6s TensorRT:0.59s ONNX: 0.92s Here is my configure: