open-mmlab / mmdetection

OpenMMLab Detection Toolbox and Benchmark
https://mmdetection.readthedocs.io
Apache License 2.0
29.68k stars 9.48k forks source link

RTMdet-Ins TRT-FP16 is much slower than the document said. #10204

Open chenxinfeng4 opened 1 year ago

chenxinfeng4 commented 1 year ago

Describe the issue

RTMdet-ins inference fp16 model is slow than the document said. It's only Throughput 104.89 qps in my case, not equal to 1.93ms latency.

Reproduction

  1. What command or script did you run?
$ cd /openmmlab/mmdeploy

# create the tensorrt-fp16 deploy for rtmdet
$ sed  's/tensorrt.py/tensorrt-fp16.py/' \
   configs/mmdet/instance-seg/instance-seg_rtmdet-ins_tensorrt_static-640x640.py  \
   > configs/mmdet/instance-seg/instance-seg_rtmdet-ins_tensorrt_fp16_static-640x640.py

$ python ./tools/deploy.py \
    configs/mmdet/instance-seg/instance-seg_rtmdet-ins_tensorrt_fp16_static-640x640.py \
    /openmmlab/mmdetection/work_dirs/rtmdet-ins_s_8xb32-300e_coco/rtmdet-ins_s_8xb32-300e_coco.py \
    /openmmlab/mmdetection/data/checkpoint/rtmdet-ins_m_8xb32-300e_coco_20221123_001039-6eba602e.pth \
    /openmmlab/mmdetection/demo/demo.jpg \
    --work-dir work_dir \
    --show \
    --device cuda:0

Check the output

$ polygraphy inspect model work_dir/end2end.onnx

[I] Loading model: /openmmlab/mmdeploy/work_dir/end2end.onnx
[I] ==== ONNX Model ====
    Name: torch_jit | Opset: 11

    ---- 1 Graph Input(s) ----
    {input [dtype=float32, shape=(1, 3, 640, 640)]}

    ---- 3 Graph Output(s) ----
    {dets [dtype=float32, shape=('Reshapedets_dim_0', 'Reshapedets_dim_1', 'Reshapedets_dim_2')],
     labels [dtype=int64, shape=('Reshapelabels_dim_0', 'Reshapelabels_dim_1')],
     masks [dtype=float32, shape=('Sigmoidmasks_dim_0', 'Sigmoidmasks_dim_1', 'Sigmoidmasks_dim_2', 'Sigmoidmasks_dim_3')]}

    ---- 268 Initializer(s) ----

    ---- 475 Node(s) ----

Test the speed of RTMDet-Ins-s-tensorrt-fp16

trtexec --loadEngine=work_dir/end2end.engine --plugins=build/lib/libmmdeploy_tensorrt_ops.so
&&&& RUNNING TensorRT.trtexec [TensorRT v8204] # trtexec --loadEngine=work_dir/end2end.engine --plugins=build/lib/libmmdeploy_tensorrt_ops.so
[04/22/2023-07:28:35] [I] === Model Options ===
[04/22/2023-07:28:35] [I] Format: *
[04/22/2023-07:28:35] [I] Model: 
[04/22/2023-07:28:35] [I] Output:
[04/22/2023-07:28:35] [I] === Build Options ===
[04/22/2023-07:28:35] [I] Max batch: 1
[04/22/2023-07:28:35] [I] Workspace: 16 MiB
[04/22/2023-07:28:35] [I] minTiming: 1
[04/22/2023-07:28:35] [I] avgTiming: 8
[04/22/2023-07:28:35] [I] Precision: FP32
[04/22/2023-07:28:35] [I] Calibration: 
[04/22/2023-07:28:35] [I] Refit: Disabled
[04/22/2023-07:28:35] [I] Sparsity: Disabled
[04/22/2023-07:28:35] [I] Safe mode: Disabled
[04/22/2023-07:28:35] [I] DirectIO mode: Disabled
[04/22/2023-07:28:35] [I] Restricted mode: Disabled
[04/22/2023-07:28:35] [I] Save engine: 
[04/22/2023-07:28:35] [I] Load engine: work_dir/end2end.engine
[04/22/2023-07:28:35] [I] Profiling verbosity: 0
[04/22/2023-07:28:35] [I] Tactic sources: Using default tactic sources
[04/22/2023-07:28:35] [I] timingCacheMode: local
[04/22/2023-07:28:35] [I] timingCacheFile: 
[04/22/2023-07:28:35] [I] Input(s)s format: fp32:CHW
[04/22/2023-07:28:35] [I] Output(s)s format: fp32:CHW
[04/22/2023-07:28:35] [I] Input build shapes: model
[04/22/2023-07:28:35] [I] Input calibration shapes: model
[04/22/2023-07:28:35] [I] === System Options ===
[04/22/2023-07:28:35] [I] Device: 0
[04/22/2023-07:28:35] [I] DLACore: 
[04/22/2023-07:28:35] [I] Plugins: build/lib/libmmdeploy_tensorrt_ops.so
[04/22/2023-07:28:35] [I] === Inference Options ===
[04/22/2023-07:28:35] [I] Batch: 1
[04/22/2023-07:28:35] [I] Input inference shapes: model
[04/22/2023-07:28:35] [I] Iterations: 10
[04/22/2023-07:28:35] [I] Duration: 3s (+ 200ms warm up)
[04/22/2023-07:28:35] [I] Sleep time: 0ms
[04/22/2023-07:28:35] [I] Idle time: 0ms
[04/22/2023-07:28:35] [I] Streams: 1
[04/22/2023-07:28:35] [I] ExposeDMA: Disabled
[04/22/2023-07:28:35] [I] Data transfers: Enabled
[04/22/2023-07:28:35] [I] Spin-wait: Disabled
[04/22/2023-07:28:35] [I] Multithreading: Disabled
[04/22/2023-07:28:35] [I] CUDA Graph: Disabled
[04/22/2023-07:28:35] [I] Separate profiling: Disabled
[04/22/2023-07:28:35] [I] Time Deserialize: Disabled
[04/22/2023-07:28:35] [I] Time Refit: Disabled
[04/22/2023-07:28:35] [I] Skip inference: Disabled
[04/22/2023-07:28:35] [I] Inputs:
[04/22/2023-07:28:35] [I] === Reporting Options ===
[04/22/2023-07:28:35] [I] Verbose: Disabled
[04/22/2023-07:28:35] [I] Averages: 10 inferences
[04/22/2023-07:28:35] [I] Percentile: 99
[04/22/2023-07:28:35] [I] Dump refittable layers:Disabled
[04/22/2023-07:28:35] [I] Dump output: Disabled
[04/22/2023-07:28:35] [I] Profile: Disabled
[04/22/2023-07:28:35] [I] Export timing to JSON file: 
[04/22/2023-07:28:35] [I] Export output to JSON file: 
[04/22/2023-07:28:35] [I] Export profile to JSON file: 
[04/22/2023-07:28:35] [I] 
[04/22/2023-07:28:35] [I] === Device Information ===
[04/22/2023-07:28:35] [I] Selected Device: NVIDIA GeForce RTX 3090
[04/22/2023-07:28:35] [I] Compute Capability: 8.6
[04/22/2023-07:28:35] [I] SMs: 82
[04/22/2023-07:28:35] [I] Compute Clock Rate: 1.695 GHz
[04/22/2023-07:28:35] [I] Device Global Memory: 24268 MiB
[04/22/2023-07:28:35] [I] Shared Memory per SM: 100 KiB
[04/22/2023-07:28:35] [I] Memory Bus Width: 384 bits (ECC disabled)
[04/22/2023-07:28:35] [I] Memory Clock Rate: 9.751 GHz
[04/22/2023-07:28:35] [I] 
[04/22/2023-07:28:35] [I] TensorRT version: 8.2.4
[04/22/2023-07:28:35] [I] Loading supplied plugin library: build/lib/libmmdeploy_tensorrt_ops.so
[04/22/2023-07:28:36] [I] [TRT] [MemUsageChange] Init CUDA: CPU +457, GPU +0, now: CPU 495, GPU 3206 (MiB)
[04/22/2023-07:28:36] [I] [TRT] Loaded engine size: 25 MiB
[04/22/2023-07:28:36] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +850, GPU +368, now: CPU 1359, GPU 3598 (MiB)
[04/22/2023-07:28:36] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +127, GPU +60, now: CPU 1486, GPU 3658 (MiB)
[04/22/2023-07:28:36] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +21, now: CPU 0, GPU 21 (MiB)
[04/22/2023-07:28:36] [I] Engine loaded in 1.55282 sec.
[04/22/2023-07:28:36] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +1, GPU +10, now: CPU 1460, GPU 3650 (MiB)
[04/22/2023-07:28:36] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 1460, GPU 3658 (MiB)
[04/22/2023-07:28:37] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +192, now: CPU 0, GPU 213 (MiB)
[04/22/2023-07:28:37] [I] Using random values for input input
[04/22/2023-07:28:37] [I] Created input binding for input with dimensions 1x3x640x640
[04/22/2023-07:28:37] [I] Using random values for output dets
[04/22/2023-07:28:37] [I] Created output binding for dets with dimensions 1x100x5
[04/22/2023-07:28:37] [I] Using random values for output labels
[04/22/2023-07:28:37] [I] Created output binding for labels with dimensions 1x100
[04/22/2023-07:28:37] [I] Using random values for output masks
[04/22/2023-07:28:37] [I] Created output binding for masks with dimensions 1x100x640x640
[04/22/2023-07:28:37] [I] Starting inference
[04/22/2023-07:28:40] [I] Warmup completed 19 queries over 200 ms
[04/22/2023-07:28:40] [I] Timing trace has 318 queries over 3.03174 s
[04/22/2023-07:28:40] [I] 
[04/22/2023-07:28:40] [I] === Trace details ===
[04/22/2023-07:28:40] [I] Trace averages of 10 runs:
[04/22/2023-07:28:40] [I] Average on 10 runs - GPU latency: 9.5402 ms - Host latency: 16.4676 ms (end to end 18.9873 ms, enqueue 1.30154 ms)
[04/22/2023-07:28:40] [I] Average on 10 runs - GPU latency: 9.54306 ms - Host latency: 16.471 ms (end to end 18.992 ms, enqueue 1.34883 ms)
[04/22/2023-07:28:40] [I] Average on 10 runs - GPU latency: 9.54009 ms - Host latency: 16.4699 ms (end to end 18.9961 ms, enqueue 1.02559 ms)
[04/22/2023-07:28:40] [I] Average on 10 runs - GPU latency: 9.54132 ms - Host latency: 16.4694 ms (end to end 18.9849 ms, enqueue 1.37072 ms)
[04/22/2023-07:28:40] [I] Average on 10 runs - GPU latency: 9.5408 ms - Host latency: 16.4686 ms (end to end 18.9876 ms, enqueue 1.34462 ms)
[04/22/2023-07:28:40] [I] Average on 10 runs - GPU latency: 9.54213 ms - Host latency: 16.4719 ms (end to end 18.9886 ms, enqueue 1.36243 ms)
[04/22/2023-07:28:40] [I] Average on 10 runs - GPU latency: 9.54327 ms - Host latency: 16.4703 ms (end to end 18.8484 ms, enqueue 1.33591 ms)
[04/22/2023-07:28:40] [I] Average on 10 runs - GPU latency: 9.53734 ms - Host latency: 16.4651 ms (end to end 18.9783 ms, enqueue 1.37103 ms)
[04/22/2023-07:28:40] [I] Average on 10 runs - GPU latency: 9.54008 ms - Host latency: 16.4713 ms (end to end 18.982 ms, enqueue 1.28795 ms)
[04/22/2023-07:28:40] [I] Average on 10 runs - GPU latency: 9.53604 ms - Host latency: 16.4628 ms (end to end 18.9804 ms, enqueue 1.304 ms)
[04/22/2023-07:28:40] [I] Average on 10 runs - GPU latency: 9.57103 ms - Host latency: 16.4979 ms (end to end 19.0143 ms, enqueue 1.32554 ms)
[04/22/2023-07:28:40] [I] Average on 10 runs - GPU latency: 9.48518 ms - Host latency: 16.4099 ms (end to end 18.9133 ms, enqueue 1.37776 ms)
[04/22/2023-07:28:40] [I] Average on 10 runs - GPU latency: 9.48109 ms - Host latency: 16.4035 ms (end to end 18.8697 ms, enqueue 1.29844 ms)
[04/22/2023-07:28:40] [I] Average on 10 runs - GPU latency: 9.47885 ms - Host latency: 16.4053 ms (end to end 18.8591 ms, enqueue 1.4662 ms)
[04/22/2023-07:28:40] [I] Average on 10 runs - GPU latency: 9.47958 ms - Host latency: 16.4048 ms (end to end 18.8671 ms, enqueue 1.30724 ms)
[04/22/2023-07:28:40] [I] Average on 10 runs - GPU latency: 9.48488 ms - Host latency: 16.4072 ms (end to end 18.8807 ms, enqueue 1.34695 ms)
[04/22/2023-07:28:40] [I] Average on 10 runs - GPU latency: 9.48153 ms - Host latency: 16.4004 ms (end to end 18.8663 ms, enqueue 1.30504 ms)
[04/22/2023-07:28:40] [I] Average on 10 runs - GPU latency: 9.4802 ms - Host latency: 16.4054 ms (end to end 18.7374 ms, enqueue 1.34243 ms)
[04/22/2023-07:28:40] [I] Average on 10 runs - GPU latency: 9.48224 ms - Host latency: 16.4067 ms (end to end 18.7836 ms, enqueue 1.33301 ms)
[04/22/2023-07:28:40] [I] Average on 10 runs - GPU latency: 9.47925 ms - Host latency: 16.4049 ms (end to end 18.8622 ms, enqueue 1.48417 ms)
[04/22/2023-07:28:40] [I] Average on 10 runs - GPU latency: 9.4821 ms - Host latency: 16.4087 ms (end to end 18.8655 ms, enqueue 1.49673 ms)
[04/22/2023-07:28:40] [I] Average on 10 runs - GPU latency: 9.48032 ms - Host latency: 16.4039 ms (end to end 18.8653 ms, enqueue 1.32368 ms)
[04/22/2023-07:28:40] [I] Average on 10 runs - GPU latency: 9.47856 ms - Host latency: 16.4031 ms (end to end 18.8656 ms, enqueue 1.3196 ms)
[04/22/2023-07:28:40] [I] Average on 10 runs - GPU latency: 9.47346 ms - Host latency: 16.3982 ms (end to end 18.8478 ms, enqueue 1.3748 ms)
[04/22/2023-07:28:40] [I] Average on 10 runs - GPU latency: 9.47429 ms - Host latency: 16.3958 ms (end to end 18.8591 ms, enqueue 1.29553 ms)
[04/22/2023-07:28:40] [I] Average on 10 runs - GPU latency: 9.48191 ms - Host latency: 16.4084 ms (end to end 18.8641 ms, enqueue 1.4228 ms)
[04/22/2023-07:28:40] [I] Average on 10 runs - GPU latency: 9.47817 ms - Host latency: 16.4033 ms (end to end 18.8665 ms, enqueue 1.29167 ms)
[04/22/2023-07:28:40] [I] Average on 10 runs - GPU latency: 9.47803 ms - Host latency: 16.4042 ms (end to end 18.8507 ms, enqueue 1.41506 ms)
[04/22/2023-07:28:40] [I] Average on 10 runs - GPU latency: 9.48198 ms - Host latency: 16.4017 ms (end to end 18.8285 ms, enqueue 1.33625 ms)
[04/22/2023-07:28:40] [I] Average on 10 runs - GPU latency: 9.47283 ms - Host latency: 16.3947 ms (end to end 18.7606 ms, enqueue 1.30344 ms)
[04/22/2023-07:28:40] [I] Average on 10 runs - GPU latency: 9.47659 ms - Host latency: 16.3991 ms (end to end 18.8492 ms, enqueue 1.51799 ms)
[04/22/2023-07:28:40] [I] 
[04/22/2023-07:28:40] [I] === Performance summary ===
[04/22/2023-07:28:40] [I] Throughput: 104.89 qps
[04/22/2023-07:28:40] [I] Latency: min = 16.2476 ms, max = 16.7673 ms, mean = 16.4264 ms, median = 16.4114 ms, percentile(99%) = 16.4906 ms
[04/22/2023-07:28:40] [I] End-to-End Host Latency: min = 17.8994 ms, max = 19.2955 ms, mean = 18.8924 ms, median = 18.8774 ms, percentile(99%) = 19.0158 ms
[04/22/2023-07:28:40] [I] Enqueue Time: min = 0.652069 ms, max = 2.1123 ms, mean = 1.34596 ms, median = 1.297 ms, percentile(99%) = 2.09131 ms
[04/22/2023-07:28:40] [I] H2D Latency: min = 0.207581 ms, max = 0.228638 ms, mean = 0.211843 ms, median = 0.211182 ms, percentile(99%) = 0.224854 ms
[04/22/2023-07:28:40] [I] GPU Compute Time: min = 9.44946 ms, max = 9.84778 ms, mean = 9.50146 ms, median = 9.48535 ms, percentile(99%) = 9.55801 ms
[04/22/2023-07:28:40] [I] D2H Latency: min = 6.58667 ms, max = 6.73633 ms, mean = 6.71313 ms, median = 6.71295 ms, percentile(99%) = 6.72925 ms
[04/22/2023-07:28:40] [I] Total Host Walltime: 3.03174 s
[04/22/2023-07:28:40] [I] Total GPU Compute Time: 3.02146 s
[04/22/2023-07:28:40] [I] Explanations of the performance metrics are printed in the verbose logs.
[04/22/2023-07:28:40] [I] 

My platform is also RTX3090. But got 104.89 qps, about 9.5ms latency. I should be 1.93ms latency in document. image

  1. What config dir you run?
## mmdet config
/openmmlab/mmdetection/work_dirs/rtmdet-ins_s_8xb32-300e_coco/rtmdet-ins_s_8xb32-300e_coco.py

## modified mmdeploy config 
/openmmlab/mmdeploy/configs/mmdet/instance-seg/instance-seg_rtmdet-ins_tensorrt_fp16_static-640x640.py

Environment

sys.platform: linux
Python: 3.8.13 | packaged by conda-forge | (default, Mar 25 2022, 06:04:10) [GCC 10.3.0]
CUDA available: True
numpy_random_seed: 2147483648
GPU 0,1,2,3: NVIDIA GeForce RTX 3090
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 11.6, V11.6.124
GCC: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
PyTorch: 1.12.0a0+bd13bc6
PyTorch compiling details: PyTorch built with:
  - GCC 9.4
  - C++ Version: 201402
  - Intel(R) Math Kernel Library Version 2019.0.5 Product Build 20190808 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v2.5.2 (Git Hash N/A)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX512
  - CUDA Runtime 11.6
  - NVCC architecture flags: -gencode;arch=compute_52,code=sm_52;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_86,code=compute_86
  - CuDNN 8.4
  - Magma 2.5.2
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.6, CUDNN_VERSION=8.4.0, CXX_COMPILER=/usr/bin/c++, CXX_FLAGS=-fno-gnu-unique -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.12.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=OFF, USE_MPI=ON, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, 

TorchVision: 0.13.0a0
OpenCV: 4.5.5
MMEngine: 0.7.2
MMDetection: 3.0.0+ecac3a7
mchaniotakis commented 1 year ago

I am actually getting a 64ms (end2end.engine) latency with the same model as above (s) running from the Release Dockerfile from the dev-1.x branch. I have tried other configurations as well. Not sure if its my graphics card, drivers, or other issue. mmdetection was installed through mim install mmdet.

Python: 3.8.10 (default, Mar 13 2023, 10:26:41) [GCC 9.4.0] CUDA available: True numpy_random_seed: 2147483648 GPU 0: NVIDIA GeForce RTX 3090 CUDA_HOME: /usr/local/cuda NVCC: Cuda compilation tools, release 11.3, V11.3.109 GCC: x86_64-linux-gnu-gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0 PyTorch: 1.10.0+cu113 PyTorch compiling details: PyTorch built with:

TorchVision: 0.11.0+cu113 OpenCV: 4.5.4 MMEngine: 0.7.4 MMDetection: 3.0.0+ecac3a

mchaniotakis commented 1 year ago

Please disregard my last comment. The large latency was due to the GPU failing for a hardware issue. After re-running everything I get the following for the small model on 640x640 size: +------------+---------+ | batch size | 1 | | shape | 640x640 | | iterations | 100 | | warmup | 10 | +------------+---------+ ----- Results: +--------+------------+---------+ | Stats | Latency/ms | FPS | +--------+------------+---------+ | Mean | 8.102 | 123.432 | | Median | 8.095 | 123.531 | | Min | 8.000 | 125.007 | | Max | 9.525 | 104.983 | +--------+------------+---------+

ryanalexmartin commented 1 year ago

@mchaniotakis That's better than 64 ms, but still much, much slower than the 1.22 ms latency for the small model shown in Table 3 of the original research paper.

I'd like to know more about the machine used for inference. Did they super-cool it with liquid nitrogen? Was it running some custom Linux kernel? What was the CPU?

Or, if possible, could anybody provide some suggestions for getting inference speed to a bit closer to the claims made in Table 3 of the RTMDet research paper?

chenxinfeng4 commented 1 year ago

I think the possible explanation may be the RTMDet exports the post processing into the tensorrt model, but the document is without the post-processing version.

Evidence: The mask of MaskRCNN is 64x64 feature size (GPU), and then is transformed to HxW as imageshape (CPU, post-processing). However, RTMDet tensorrt return the mask as HxW.

Dumbldore commented 1 year ago

I also tried benchamarking on RTX 3090 ti and couldn't go below 8ms on bigger batch size.