lyuwenyu / RT-DETR

[CVPR 2024] Official RT-DETR (RTDETR paddle pytorch), Real-Time DEtection TRansformer, DETRs Beat YOLOs on Real-time Object Detection. 🔥 🔥 🔥
Apache License 2.0
2.69k stars 306 forks source link

TensorRT Inference #47

Closed lebionick closed 1 year ago

lebionick commented 1 year ago

Hello I have issue with launching converted trt model.
I did step by step converting default model like this:

python tools/export_model.py -c configs/rtdetr/rtdetr_r50vd_6x_coco.yml \
              -o weights=https://bj.bcebos.com/v1/paddledet/models/rtdetr_r50vd_6x_coco.pdparams trt=True \
              --output_dir=output_inference
paddle2onnx --model_dir=./output_inference/rtdetr_r50vd_6x_coco/ \
            --model_filename model.pdmodel  \
            --params_filename model.pdiparams \
            --opset_version 16 \
            --save_file rtdetr_r50vd_6x_coco.onnx

Last one was tricky, I've downloaded TensorRT GA archive and built trtexec inside.

LD_LIBRARY_PATH=TensorRT-8.6.1.6/lib/ TensorRT-8.6.1.6/bin/trtexec --onnx=./rtdetr_r50vd_6x_coco.onnx --workspace=4096 --shapes=image:1x3x640x640 --saveEngine=rtdetr_r50vd_6x_coco.trt --avgRuns=10 --fp16
Convert Log
&&&& RUNNING TensorRT.trtexec [TensorRT v8601] # TensorRT-8.6.1.6/bin/trtexec --onnx=./rtdetr_r50vd_6x_coco.onnx --workspace=4096 --shapes=image:1x3x640x640 --saveEngine=rtdetr_r50vd_6x_coco.trt --avgRuns=10 --fp16
[08/29/2023-19:15:08] [W] --workspace flag has been deprecated by --memPoolSize flag.
[08/29/2023-19:15:08] [I] === Model Options ===
[08/29/2023-19:15:08] [I] Format: ONNX
[08/29/2023-19:15:08] [I] Model: ./rtdetr_r50vd_6x_coco.onnx
[08/29/2023-19:15:08] [I] Output:
[08/29/2023-19:15:08] [I] === Build Options ===
[08/29/2023-19:15:08] [I] Max batch: explicit batch
[08/29/2023-19:15:08] [I] Memory Pools: workspace: 4096 MiB, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default
[08/29/2023-19:15:08] [I] minTiming: 1
[08/29/2023-19:15:08] [I] avgTiming: 8
[08/29/2023-19:15:08] [I] Precision: FP32+FP16
[08/29/2023-19:15:08] [I] LayerPrecisions: 
[08/29/2023-19:15:08] [I] Layer Device Types: 
[08/29/2023-19:15:08] [I] Calibration: 
[08/29/2023-19:15:08] [I] Refit: Disabled
[08/29/2023-19:15:08] [I] Version Compatible: Disabled
[08/29/2023-19:15:08] [I] TensorRT runtime: full
[08/29/2023-19:15:08] [I] Lean DLL Path: 
[08/29/2023-19:15:08] [I] Tempfile Controls: { in_memory: allow, temporary: allow }
[08/29/2023-19:15:08] [I] Exclude Lean Runtime: Disabled
[08/29/2023-19:15:08] [I] Sparsity: Disabled
[08/29/2023-19:15:08] [I] Safe mode: Disabled
[08/29/2023-19:15:08] [I] Build DLA standalone loadable: Disabled
[08/29/2023-19:15:08] [I] Allow GPU fallback for DLA: Disabled
[08/29/2023-19:15:08] [I] DirectIO mode: Disabled
[08/29/2023-19:15:08] [I] Restricted mode: Disabled
[08/29/2023-19:15:08] [I] Skip inference: Disabled
[08/29/2023-19:15:08] [I] Save engine: rtdetr_r50vd_6x_coco.trt
[08/29/2023-19:15:08] [I] Load engine: 
[08/29/2023-19:15:08] [I] Profiling verbosity: 0
[08/29/2023-19:15:08] [I] Tactic sources: Using default tactic sources
[08/29/2023-19:15:08] [I] timingCacheMode: local
[08/29/2023-19:15:08] [I] timingCacheFile: 
[08/29/2023-19:15:08] [I] Heuristic: Disabled
[08/29/2023-19:15:08] [I] Preview Features: Use default preview flags.
[08/29/2023-19:15:08] [I] MaxAuxStreams: -1
[08/29/2023-19:15:08] [I] BuilderOptimizationLevel: -1
[08/29/2023-19:15:08] [I] Input(s)s format: fp32:CHW
[08/29/2023-19:15:08] [I] Output(s)s format: fp32:CHW
[08/29/2023-19:15:08] [I] Input build shape: image=1x3x640x640+1x3x640x640+1x3x640x640
[08/29/2023-19:15:08] [I] Input calibration shapes: model
[08/29/2023-19:15:08] [I] === System Options ===
[08/29/2023-19:15:08] [I] Device: 0
[08/29/2023-19:15:08] [I] DLACore: 
[08/29/2023-19:15:08] [I] Plugins:
[08/29/2023-19:15:08] [I] setPluginsToSerialize:
[08/29/2023-19:15:08] [I] dynamicPlugins:
[08/29/2023-19:15:08] [I] ignoreParsedPluginLibs: 0
[08/29/2023-19:15:08] [I] 
[08/29/2023-19:15:08] [I] === Inference Options ===
[08/29/2023-19:15:08] [I] Batch: Explicit
[08/29/2023-19:15:08] [I] Input inference shape: image=1x3x640x640
[08/29/2023-19:15:08] [I] Iterations: 10
[08/29/2023-19:15:08] [I] Duration: 3s (+ 200ms warm up)
[08/29/2023-19:15:08] [I] Sleep time: 0ms
[08/29/2023-19:15:08] [I] Idle time: 0ms
[08/29/2023-19:15:08] [I] Inference Streams: 1
[08/29/2023-19:15:08] [I] ExposeDMA: Disabled
[08/29/2023-19:15:08] [I] Data transfers: Enabled
[08/29/2023-19:15:08] [I] Spin-wait: Disabled
[08/29/2023-19:15:08] [I] Multithreading: Disabled
[08/29/2023-19:15:08] [I] CUDA Graph: Disabled
[08/29/2023-19:15:08] [I] Separate profiling: Disabled
[08/29/2023-19:15:08] [I] Time Deserialize: Disabled
[08/29/2023-19:15:08] [I] Time Refit: Disabled
[08/29/2023-19:15:08] [I] NVTX verbosity: 0
[08/29/2023-19:15:08] [I] Persistent Cache Ratio: 0
[08/29/2023-19:15:08] [I] Inputs:
[08/29/2023-19:15:08] [I] === Reporting Options ===
[08/29/2023-19:15:08] [I] Verbose: Disabled
[08/29/2023-19:15:08] [I] Averages: 10 inferences
[08/29/2023-19:15:08] [I] Percentiles: 90,95,99
[08/29/2023-19:15:08] [I] Dump refittable layers:Disabled
[08/29/2023-19:15:08] [I] Dump output: Disabled
[08/29/2023-19:15:08] [I] Profile: Disabled
[08/29/2023-19:15:08] [I] Export timing to JSON file: 
[08/29/2023-19:15:08] [I] Export output to JSON file: 
[08/29/2023-19:15:08] [I] Export profile to JSON file: 
[08/29/2023-19:15:08] [I] 
[08/29/2023-19:15:08] [I] === Device Information ===
[08/29/2023-19:15:08] [I] Selected Device: NVIDIA GeForce RTX 2080 Ti
[08/29/2023-19:15:08] [I] Compute Capability: 7.5
[08/29/2023-19:15:08] [I] SMs: 68
[08/29/2023-19:15:08] [I] Device Global Memory: 11011 MiB
[08/29/2023-19:15:08] [I] Shared Memory per SM: 64 KiB
[08/29/2023-19:15:08] [I] Memory Bus Width: 352 bits (ECC disabled)
[08/29/2023-19:15:08] [I] Application Compute Clock Rate: 1.65 GHz
[08/29/2023-19:15:08] [I] Application Memory Clock Rate: 7 GHz
[08/29/2023-19:15:08] [I] 
[08/29/2023-19:15:08] [I] Note: The application clock rates do not reflect the actual clock rates that the GPU is currently running at.
[08/29/2023-19:15:08] [I] 
[08/29/2023-19:15:08] [I] TensorRT version: 8.6.1
[08/29/2023-19:15:08] [I] Loading standard plugins
[08/29/2023-19:15:08] [I] [TRT] [MemUsageChange] Init CUDA: CPU +13, GPU +0, now: CPU 18, GPU 488 (MiB)
[08/29/2023-19:15:13] [I] [TRT] [MemUsageChange] Init builder kernel library: CPU +896, GPU +174, now: CPU 991, GPU 662 (MiB)
[08/29/2023-19:15:13] [I] Start parsing network model.
[08/29/2023-19:15:13] [I] [TRT] ----------------------------------------------------------------
[08/29/2023-19:15:13] [I] [TRT] Input filename:   ./rtdetr_r50vd_6x_coco.onnx
[08/29/2023-19:15:13] [I] [TRT] ONNX IR version:  0.0.8
[08/29/2023-19:15:13] [I] [TRT] Opset version:    16
[08/29/2023-19:15:13] [I] [TRT] Producer name:    
[08/29/2023-19:15:13] [I] [TRT] Producer version: 
[08/29/2023-19:15:13] [I] [TRT] Domain:           
[08/29/2023-19:15:13] [I] [TRT] Model version:    0
[08/29/2023-19:15:13] [I] [TRT] Doc string:       
[08/29/2023-19:15:13] [I] [TRT] ----------------------------------------------------------------
[08/29/2023-19:15:13] [W] [TRT] onnx2trt_utils.cpp:374: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[08/29/2023-19:15:13] [I] Finished parsing network model. Parse time: 0.339291
[08/29/2023-19:15:13] [W] Dynamic dimensions required for input: im_shape, but no shapes were provided. Automatically overriding shape to: 1x2
[08/29/2023-19:15:13] [W] Dynamic dimensions required for input: scale_factor, but no shapes were provided. Automatically overriding shape to: 1x2
[08/29/2023-19:15:13] [I] [TRT] BuilderFlag::kTF32 is set but hardware does not support TF32. Disabling TF32.
[08/29/2023-19:15:13] [W] [TRT] Detected layernorm nodes in FP16: p2o.ReduceMean.10, p2o.Sub.0, p2o.Pow.0, p2o.Add.44, p2o.Sqrt.0, p2o.Div.0, p2o.Mul.2, p2o.Add.46, p2o.Sub.2, p2o.Pow.2, p2o.Add.56, p2o.Sqrt.2, p2o.Div.4, p2o.Mul.7, p2o.Add.58, p2o.Sub.4, p2o.Pow.4, p2o.Add.104, p2o.Sqrt.4, p2o.Div.6, p2o.Mul.57, p2o.Add.106, p2o.Sub.6, p2o.Pow.6, p2o.ReduceMean.14, p2o.Add.134, p2o.Sqrt.6, p2o.Div.8, p2o.Mul.61, p2o.Add.136, p2o.Sub.8, p2o.Pow.8, p2o.ReduceMean.18, p2o.Add.154, p2o.Sqrt.8, p2o.Div.10, p2o.Mul.81, p2o.Add.156, p2o.Sub.10, p2o.Pow.10, p2o.ReduceMean.22, p2o.Add.164, p2o.Sqrt.10, p2o.Div.12, p2o.Mul.83, p2o.Add.166, p2o.Sub.12, p2o.Pow.12, p2o.ReduceMean.26, p2o.Add.194, p2o.Sqrt.12, p2o.Div.16, p2o.Mul.89, p2o.Add.196, p2o.Sub.14, p2o.Pow.14, p2o.ReduceMean.30, p2o.Add.214, p2o.Sqrt.14, p2o.Div.18, p2o.Mul.109, p2o.Add.216, p2o.Sub.16, p2o.Pow.16, p2o.ReduceMean.34, p2o.Add.224, p2o.Sqrt.16, p2o.Div.20, p2o.Mul.111, p2o.Add.226, p2o.Sub.18, p2o.Pow.18, p2o.ReduceMean.38, p2o.Add.254, p2o.Sqrt.18, p2o.Div.24, p2o.Mul.117, p2o.Add.256, p2o.Sub.20, p2o.Pow.20, p2o.ReduceMean.42, p2o.Add.274, p2o.Sqrt.20, p2o.Div.26, p2o.Mul.137, p2o.Add.276, p2o.Sub.22, p2o.Pow.22, p2o.ReduceMean.46, p2o.Add.284, p2o.Sqrt.22, p2o.Div.28, p2o.Mul.139, p2o.Add.286, p2o.Sub.24, p2o.Pow.24, p2o.ReduceMean.50, p2o.Add.314, p2o.Sqrt.24, p2o.Div.32, p2o.Mul.145, p2o.Add.316, p2o.Sub.26, p2o.Pow.26, p2o.ReduceMean.54, p2o.Add.334, p2o.Sqrt.26, p2o.Div.34, p2o.Mul.165, p2o.Add.336, p2o.Sub.28, p2o.Pow.28, p2o.ReduceMean.58, p2o.Add.344, p2o.Sqrt.28, p2o.Div.36, p2o.Mul.167, p2o.Add.346, p2o.Sub.30, p2o.Pow.30, p2o.ReduceMean.62, p2o.Add.374, p2o.Sqrt.30, p2o.Div.40, p2o.Mul.173, p2o.Add.376, p2o.Sub.32, p2o.Pow.32, p2o.ReduceMean.66, p2o.Add.394, p2o.Sqrt.32, p2o.Div.42, p2o.Mul.193, p2o.Add.396, p2o.Sub.34, p2o.Pow.34, p2o.ReduceMean.70, p2o.Add.404, p2o.Sqrt.34, p2o.Div.44, p2o.Mul.195, p2o.Add.406, p2o.Sub.36, p2o.Pow.36, p2o.ReduceMean.74, p2o.Add.434, p2o.Sqrt.36, p2o.Div.48, p2o.Mul.201, p2o.Add.436, p2o.Sub.38, p2o.Pow.38, p2o.ReduceMean.78, p2o.Add.454, p2o.Sqrt.38, p2o.Div.50, p2o.Mul.221, p2o.Add.456, p2o.Sub.40, p2o.Pow.40, p2o.ReduceMean.82, p2o.Add.464, p2o.Sqrt.40, p2o.Div.52, p2o.Mul.223, p2o.Add.466, p2o.ReduceMean.2, p2o.ReduceMean.6
[08/29/2023-19:15:13] [W] [TRT] Running layernorm after self-attention in FP16 may cause overflow. Exporting the model to the latest available ONNX opset (later than opset 17) to use the INormalizationLayer, or forcing layernorm layers to run in FP32 precision can help with preserving accuracy.
[08/29/2023-19:15:14] [I] [TRT] Graph optimization time: 0.432647 seconds.
[08/29/2023-19:15:14] [I] [TRT] BuilderFlag::kTF32 is set but hardware does not support TF32. Disabling TF32.
[08/29/2023-19:15:14] [I] [TRT] Local timing cache in use. Profiling results in this builder pass will not be stored.
[08/29/2023-19:20:31] [I] [TRT] Detected 3 inputs and 2 output network tensors.
[08/29/2023-19:20:31] [I] [TRT] Total Host Persistent Memory: 443840
[08/29/2023-19:20:31] [I] [TRT] Total Device Persistent Memory: 833536
[08/29/2023-19:20:31] [I] [TRT] Total Scratch Memory: 14330880
[08/29/2023-19:20:31] [I] [TRT] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 172 MiB, GPU 73 MiB
[08/29/2023-19:20:31] [I] [TRT] [BlockAssignment] Started assigning block shifts. This will take 141 steps to complete.
[08/29/2023-19:20:31] [I] [TRT] [BlockAssignment] Algorithm ShiftNTopDown took 5.75286ms to assign 9 blocks to 141 nodes requiring 34204160 bytes.
[08/29/2023-19:20:31] [I] [TRT] Total Activation Memory: 34204160
[08/29/2023-19:20:32] [W] [TRT] TensorRT encountered issues when converting weights between types and that could affect accuracy.
[08/29/2023-19:20:32] [W] [TRT] If this is not the desired behavior, please modify the weights or retrain with regularization to adjust the magnitude of the weights.
[08/29/2023-19:20:32] [W] [TRT] Check verbose logs for the list of affected weights.
[08/29/2023-19:20:32] [W] [TRT] - 1 weights are affected by this issue: Detected FP32 infinity values and converted them to corresponding FP16 infinity.
[08/29/2023-19:20:32] [W] [TRT] - 223 weights are affected by this issue: Detected subnormal FP16 values.
[08/29/2023-19:20:32] [W] [TRT] - 63 weights are affected by this issue: Detected values less than smallest positive FP16 subnormal value and converted them to the FP16 minimum subnormalized value.
[08/29/2023-19:20:32] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +65, GPU +81, now: CPU 65, GPU 81 (MiB)
[08/29/2023-19:20:32] [I] Engine built in 324.144 sec.
[08/29/2023-19:20:32] [I] [TRT] Loaded engine size: 85 MiB
[08/29/2023-19:20:32] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +81, now: CPU 0, GPU 81 (MiB)
[08/29/2023-19:20:32] [I] Engine deserialized in 0.0416484 sec.
[08/29/2023-19:20:32] [I] [TRT] [MS] Running engine with multi stream info
[08/29/2023-19:20:32] [I] [TRT] [MS] Number of aux streams is 2
[08/29/2023-19:20:32] [I] [TRT] [MS] Number of total worker streams is 3
[08/29/2023-19:20:32] [I] [TRT] [MS] The main stream provided by execute/enqueue calls is the first worker stream
[08/29/2023-19:20:32] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +33, now: CPU 0, GPU 114 (MiB)
[08/29/2023-19:20:32] [I] Setting persistentCacheLimit to 0 bytes.
[08/29/2023-19:20:32] [I] Using random values for input im_shape
[08/29/2023-19:20:32] [I] Input binding for im_shape with dimensions 1x2 is created.
[08/29/2023-19:20:32] [I] Using random values for input image
[08/29/2023-19:20:32] [I] Input binding for image with dimensions 1x3x640x640 is created.
[08/29/2023-19:20:32] [I] Using random values for input scale_factor
[08/29/2023-19:20:32] [I] Input binding for scale_factor with dimensions 1x2 is created.
[08/29/2023-19:20:32] [I] Output binding for tile_3.tmp_0 with dimensions  is created.
[08/29/2023-19:20:32] [I] Output binding for reshape2_95.tmp_0 with dimensions 300x6 is created.
[08/29/2023-19:20:32] [I] Starting inference
[08/29/2023-19:20:35] [I] Warmup completed 45 queries over 200 ms
[08/29/2023-19:20:35] [I] Timing trace has 671 queries over 3.01199 s
[08/29/2023-19:20:35] [I] 
[08/29/2023-19:20:35] [I] === Trace details ===
[08/29/2023-19:20:35] [I] Trace averages of 10 runs:
[08/29/2023-19:20:35] [I] Average on 10 runs - GPU latency: 4.45691 ms - Host latency: 5.00052 ms (enqueue 2.07189 ms)
[08/29/2023-19:20:35] [I] Average on 10 runs - GPU latency: 4.44643 ms - Host latency: 4.98667 ms (enqueue 2.07825 ms)
[08/29/2023-19:20:35] [I] Average on 10 runs - GPU latency: 4.4878 ms - Host latency: 5.02755 ms (enqueue 2.06254 ms)
[08/29/2023-19:20:35] [I] Average on 10 runs - GPU latency: 4.46815 ms - Host latency: 5.01013 ms (enqueue 2.06558 ms)
[08/29/2023-19:20:35] [I] Average on 10 runs - GPU latency: 4.45726 ms - Host latency: 4.99639 ms (enqueue 2.07379 ms)
[08/29/2023-19:20:35] [I] Average on 10 runs - GPU latency: 4.45436 ms - Host latency: 4.99146 ms (enqueue 2.06379 ms)
[08/29/2023-19:20:35] [I] Average on 10 runs - GPU latency: 4.46643 ms - Host latency: 5.0048 ms (enqueue 2.06562 ms)
[08/29/2023-19:20:35] [I] Average on 10 runs - GPU latency: 4.48486 ms - Host latency: 5.02398 ms (enqueue 2.06143 ms)
[08/29/2023-19:20:35] [I] Average on 10 runs - GPU latency: 4.45832 ms - Host latency: 4.99994 ms (enqueue 2.0717 ms)
[08/29/2023-19:20:35] [I] Average on 10 runs - GPU latency: 4.46133 ms - Host latency: 4.99756 ms (enqueue 2.0851 ms)
[08/29/2023-19:20:35] [I] Average on 10 runs - GPU latency: 4.45362 ms - Host latency: 4.99407 ms (enqueue 2.0751 ms)
[08/29/2023-19:20:35] [I] Average on 10 runs - GPU latency: 4.48015 ms - Host latency: 5.0215 ms (enqueue 2.07388 ms)
[08/29/2023-19:20:35] [I] Average on 10 runs - GPU latency: 4.48145 ms - Host latency: 5.02426 ms (enqueue 2.07272 ms)
[08/29/2023-19:20:35] [I] Average on 10 runs - GPU latency: 4.45563 ms - Host latency: 4.99669 ms (enqueue 2.07729 ms)
[08/29/2023-19:20:35] [I] Average on 10 runs - GPU latency: 4.46913 ms - Host latency: 5.01012 ms (enqueue 2.07491 ms)
[08/29/2023-19:20:35] [I] Average on 10 runs - GPU latency: 4.52659 ms - Host latency: 5.0699 ms (enqueue 2.08041 ms)
[08/29/2023-19:20:35] [I] Average on 10 runs - GPU latency: 4.52965 ms - Host latency: 5.07217 ms (enqueue 1.68873 ms)
[08/29/2023-19:20:35] [I] Average on 10 runs - GPU latency: 4.5099 ms - Host latency: 5.04764 ms (enqueue 1.00558 ms)
[08/29/2023-19:20:35] [I] Average on 10 runs - GPU latency: 4.49135 ms - Host latency: 5.03224 ms (enqueue 2.07975 ms)
[08/29/2023-19:20:35] [I] Average on 10 runs - GPU latency: 4.48328 ms - Host latency: 5.02284 ms (enqueue 2.07551 ms)
[08/29/2023-19:20:35] [I] Average on 10 runs - GPU latency: 4.46193 ms - Host latency: 5.00437 ms (enqueue 2.07556 ms)
[08/29/2023-19:20:35] [I] Average on 10 runs - GPU latency: 4.45951 ms - Host latency: 4.9968 ms (enqueue 2.0771 ms)
[08/29/2023-19:20:35] [I] Average on 10 runs - GPU latency: 4.4526 ms - Host latency: 4.99493 ms (enqueue 2.08126 ms)
[08/29/2023-19:20:35] [I] Average on 10 runs - GPU latency: 4.48311 ms - Host latency: 5.02161 ms (enqueue 2.07546 ms)
[08/29/2023-19:20:35] [I] Average on 10 runs - GPU latency: 4.4835 ms - Host latency: 5.02534 ms (enqueue 2.07667 ms)
[08/29/2023-19:20:35] [I] Average on 10 runs - GPU latency: 4.46216 ms - Host latency: 5.00035 ms (enqueue 2.07687 ms)
[08/29/2023-19:20:35] [I] Average on 10 runs - GPU latency: 4.45424 ms - Host latency: 4.99752 ms (enqueue 2.07772 ms)
[08/29/2023-19:20:35] [I] Average on 10 runs - GPU latency: 4.48363 ms - Host latency: 5.02688 ms (enqueue 2.07367 ms)
[08/29/2023-19:20:35] [I] Average on 10 runs - GPU latency: 4.48403 ms - Host latency: 5.02856 ms (enqueue 2.083 ms)
[08/29/2023-19:20:35] [I] Average on 10 runs - GPU latency: 4.47053 ms - Host latency: 5.01046 ms (enqueue 2.15315 ms)
[08/29/2023-19:20:35] [I] Average on 10 runs - GPU latency: 4.46604 ms - Host latency: 5.00835 ms (enqueue 2.06006 ms)
[08/29/2023-19:20:35] [I] Average on 10 runs - GPU latency: 4.46427 ms - Host latency: 5.00811 ms (enqueue 2.08953 ms)
[08/29/2023-19:20:35] [I] Average on 10 runs - GPU latency: 4.50837 ms - Host latency: 5.0422 ms (enqueue 1.79258 ms)
[08/29/2023-19:20:35] [I] Average on 10 runs - GPU latency: 4.48506 ms - Host latency: 5.02858 ms (enqueue 2.10857 ms)
[08/29/2023-19:20:35] [I] Average on 10 runs - GPU latency: 4.48263 ms - Host latency: 5.02445 ms (enqueue 2.0649 ms)
[08/29/2023-19:20:35] [I] Average on 10 runs - GPU latency: 4.47365 ms - Host latency: 5.01277 ms (enqueue 2.09479 ms)
[08/29/2023-19:20:35] [I] Average on 10 runs - GPU latency: 4.47847 ms - Host latency: 5.01809 ms (enqueue 2.04818 ms)
[08/29/2023-19:20:35] [I] Average on 10 runs - GPU latency: 4.50176 ms - Host latency: 5.04219 ms (enqueue 2.0748 ms)
[08/29/2023-19:20:35] [I] Average on 10 runs - GPU latency: 4.50762 ms - Host latency: 5.04794 ms (enqueue 2.06871 ms)
[08/29/2023-19:20:35] [I] Average on 10 runs - GPU latency: 4.50122 ms - Host latency: 5.04116 ms (enqueue 2.07521 ms)
[08/29/2023-19:20:35] [I] Average on 10 runs - GPU latency: 4.48833 ms - Host latency: 5.02892 ms (enqueue 2.07524 ms)
[08/29/2023-19:20:35] [I] Average on 10 runs - GPU latency: 4.4823 ms - Host latency: 5.02545 ms (enqueue 2.06674 ms)
[08/29/2023-19:20:35] [I] Average on 10 runs - GPU latency: 4.4752 ms - Host latency: 5.0144 ms (enqueue 2.07634 ms)
[08/29/2023-19:20:35] [I] Average on 10 runs - GPU latency: 4.47554 ms - Host latency: 5.0187 ms (enqueue 2.07507 ms)
[08/29/2023-19:20:35] [I] Average on 10 runs - GPU latency: 4.47976 ms - Host latency: 5.01873 ms (enqueue 2.07749 ms)
[08/29/2023-19:20:35] [I] Average on 10 runs - GPU latency: 4.47539 ms - Host latency: 5.01572 ms (enqueue 2.07463 ms)
[08/29/2023-19:20:35] [I] Average on 10 runs - GPU latency: 4.47756 ms - Host latency: 5.01377 ms (enqueue 2.0769 ms)
[08/29/2023-19:20:35] [I] Average on 10 runs - GPU latency: 4.47544 ms - Host latency: 5.01692 ms (enqueue 2.07109 ms)
[08/29/2023-19:20:35] [I] Average on 10 runs - GPU latency: 4.47598 ms - Host latency: 5.0145 ms (enqueue 2.07634 ms)
[08/29/2023-19:20:35] [I] Average on 10 runs - GPU latency: 4.47068 ms - Host latency: 5.01055 ms (enqueue 2.06948 ms)
[08/29/2023-19:20:35] [I] Average on 10 runs - GPU latency: 4.4782 ms - Host latency: 5.01492 ms (enqueue 2.06895 ms)
[08/29/2023-19:20:35] [I] Average on 10 runs - GPU latency: 4.47981 ms - Host latency: 5.02217 ms (enqueue 2.06775 ms)
[08/29/2023-19:20:35] [I] Average on 10 runs - GPU latency: 4.47961 ms - Host latency: 5.02263 ms (enqueue 2.07134 ms)
[08/29/2023-19:20:35] [I] Average on 10 runs - GPU latency: 4.48474 ms - Host latency: 5.02625 ms (enqueue 2.09255 ms)
[08/29/2023-19:20:35] [I] Average on 10 runs - GPU latency: 4.52488 ms - Host latency: 5.06416 ms (enqueue 2.07371 ms)
[08/29/2023-19:20:35] [I] Average on 10 runs - GPU latency: 4.50693 ms - Host latency: 5.04375 ms (enqueue 2.07349 ms)
[08/29/2023-19:20:35] [I] Average on 10 runs - GPU latency: 4.49211 ms - Host latency: 5.03394 ms (enqueue 2.07925 ms)
[08/29/2023-19:20:35] [I] Average on 10 runs - GPU latency: 4.48464 ms - Host latency: 5.02498 ms (enqueue 2.07437 ms)
[08/29/2023-19:20:35] [I] Average on 10 runs - GPU latency: 4.47805 ms - Host latency: 5.02173 ms (enqueue 2.07852 ms)
[08/29/2023-19:20:35] [I] Average on 10 runs - GPU latency: 4.47354 ms - Host latency: 5.01377 ms (enqueue 2.07297 ms)
[08/29/2023-19:20:35] [I] Average on 10 runs - GPU latency: 4.50503 ms - Host latency: 5.04609 ms (enqueue 2.06426 ms)
[08/29/2023-19:20:35] [I] Average on 10 runs - GPU latency: 4.50857 ms - Host latency: 5.04917 ms (enqueue 2.07549 ms)
[08/29/2023-19:20:35] [I] Average on 10 runs - GPU latency: 4.4821 ms - Host latency: 5.02339 ms (enqueue 2.07751 ms)
[08/29/2023-19:20:35] [I] Average on 10 runs - GPU latency: 4.48254 ms - Host latency: 5.02136 ms (enqueue 2.07803 ms)
[08/29/2023-19:20:35] [I] Average on 10 runs - GPU latency: 4.46255 ms - Host latency: 5.003 ms (enqueue 2.06277 ms)
[08/29/2023-19:20:35] [I] Average on 10 runs - GPU latency: 4.45852 ms - Host latency: 4.99438 ms (enqueue 2.08047 ms)
[08/29/2023-19:20:35] [I] Average on 10 runs - GPU latency: 4.50474 ms - Host latency: 5.04731 ms (enqueue 2.0748 ms)
[08/29/2023-19:20:35] [I] 
[08/29/2023-19:20:35] [I] === Performance summary ===
[08/29/2023-19:20:35] [I] Throughput: 222.776 qps
[08/29/2023-19:20:35] [I] Latency: min = 4.95123 ms, max = 5.09808 ms, mean = 5.02032 ms, median = 5.01892 ms, percentile(90%) = 5.04999 ms, percentile(95%) = 5.0614 ms, percentile(99%) = 5.0824 ms
[08/29/2023-19:20:35] [I] Enqueue Time: min = 0.812134 ms, max = 2.21973 ms, mean = 2.04985 ms, median = 2.07434 ms, percentile(90%) = 2.09802 ms, percentile(95%) = 2.11633 ms, percentile(99%) = 2.16772 ms
[08/29/2023-19:20:35] [I] H2D Latency: min = 0.506348 ms, max = 0.560791 ms, mean = 0.533722 ms, median = 0.53418 ms, percentile(90%) = 0.541504 ms, percentile(95%) = 0.543945 ms, percentile(99%) = 0.549316 ms
[08/29/2023-19:20:35] [I] GPU Compute Time: min = 4.41754 ms, max = 4.55377 ms, mean = 4.47985 ms, median = 4.47876 ms, percentile(90%) = 4.51099 ms, percentile(95%) = 4.51831 ms, percentile(99%) = 4.53955 ms
[08/29/2023-19:20:35] [I] D2H Latency: min = 0.00415039 ms, max = 0.0128174 ms, mean = 0.00676096 ms, median = 0.0065918 ms, percentile(90%) = 0.00817871 ms, percentile(95%) = 0.00878906 ms, percentile(99%) = 0.00970459 ms
[08/29/2023-19:20:35] [I] Total Host Walltime: 3.01199 s
[08/29/2023-19:20:35] [I] Total GPU Compute Time: 3.00598 s
[08/29/2023-19:20:35] [I] Explanations of the performance metrics are printed in the verbose logs.
[08/29/2023-19:20:35] [I] 
&&&& PASSED TensorRT.trtexec [TensorRT v8601] # TensorRT-8.6.1.6/bin/trtexec --onnx=./rtdetr_r50vd_6x_coco.onnx --workspace=4096 --shapes=image:1x3x640x640 --saveEngine=rtdetr_r50vd_6x_coco.trt --avgRuns=10 --fp16
    

For launch I'm using TRTInference class from benchmark directory:

inf = trtinfer.TRTInference(trt_engine_path, backend="torch", max_batch_size=32, verbose=True)

and get error:

File ~/Projects/sandbox/unvalidated/RT-DETR/benchmark/trtinfer.py:73, in TRTInference.get_bindings(self, engine, context, max_batch_size, device)
     70 shape = engine.get_tensor_shape(name)
     71 dtype = trt.nptype(engine.get_tensor_dtype(name))
---> 73 if shape[0] == -1:
     74     dynamic = True 
     75     shape[0] = max_batch_size

IndexError: Out of bounds

So one of shapes is empty tuple ().

I wonder if you could help me, maybe I am doing something wrong during converting.

P.S. paddlepaddle-gpu==2.5.1.post117

lyuwenyu commented 1 year ago

Can your print some message about name and shape after this line shape = engine.get_tensor_shape(name)?


The model include tree inputs, perhaps it should declare all these shaps explicitly in --shapes

LD_LIBRARY_PATH=TensorRT-8.6.1.6/lib/ TensorRT-8.6.1.6/bin/trtexec --onnx=./rtdetr_r50vd_6x_coco.onnx --workspace=4096 --shapes=image:1x3x640x640 --saveEngine=rtdetr_r50vd_6x_coco.trt --avgRuns=10 --fp16
lebionick commented 1 year ago
name='im_shape' shape=(1, 2)
name='image' shape=(1, 3, 640, 640)
name='scale_factor' shape=(1, 2)
name='tile_3.tmp_0' shape=()

It crashes at tile_3.tmp_0, which seems to be last layer(?)

I tried passing shapes like this:

LD_LIBRARY_PATH=TensorRT-8.6.1.6/lib/ TensorRT-8.6.1.6/bin/trtexec --onnx=./rtdetr_r50vd_6x_coco.onnx --workspace=4096 --shapes="image:1x3x640x640,scale_factor:1x2,im_shape:1x2" --saveEngine=rtdetr_r50vd_6x_coco.trt --avgRuns=10 --fp16

And got no warnings, still tile_3 has zero shape

Log
[08/31/2023-14:55:01] [W] --workspace flag has been deprecated by --memPoolSize flag.
[08/31/2023-14:55:01] [I] === Model Options ===
[08/31/2023-14:55:01] [I] Format: ONNX
[08/31/2023-14:55:01] [I] Model: ./rtdetr_r50vd_6x_coco.onnx
[08/31/2023-14:55:01] [I] Output:
[08/31/2023-14:55:01] [I] === Build Options ===
[08/31/2023-14:55:01] [I] Max batch: explicit batch
[08/31/2023-14:55:01] [I] Memory Pools: workspace: 4096 MiB, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default
[08/31/2023-14:55:01] [I] minTiming: 1
[08/31/2023-14:55:01] [I] avgTiming: 8
[08/31/2023-14:55:01] [I] Precision: FP32+FP16
[08/31/2023-14:55:01] [I] LayerPrecisions: 
[08/31/2023-14:55:01] [I] Layer Device Types: 
[08/31/2023-14:55:01] [I] Calibration: 
[08/31/2023-14:55:01] [I] Refit: Disabled
[08/31/2023-14:55:01] [I] Version Compatible: Disabled
[08/31/2023-14:55:01] [I] TensorRT runtime: full
[08/31/2023-14:55:01] [I] Lean DLL Path: 
[08/31/2023-14:55:01] [I] Tempfile Controls: { in_memory: allow, temporary: allow }
[08/31/2023-14:55:01] [I] Exclude Lean Runtime: Disabled
[08/31/2023-14:55:01] [I] Sparsity: Disabled
[08/31/2023-14:55:01] [I] Safe mode: Disabled
[08/31/2023-14:55:01] [I] Build DLA standalone loadable: Disabled
[08/31/2023-14:55:01] [I] Allow GPU fallback for DLA: Disabled
[08/31/2023-14:55:01] [I] DirectIO mode: Disabled
[08/31/2023-14:55:01] [I] Restricted mode: Disabled
[08/31/2023-14:55:01] [I] Skip inference: Disabled
[08/31/2023-14:55:01] [I] Save engine: rtdetr_r50vd_6x_coco.trt
[08/31/2023-14:55:01] [I] Load engine: 
[08/31/2023-14:55:01] [I] Profiling verbosity: 0
[08/31/2023-14:55:01] [I] Tactic sources: Using default tactic sources
[08/31/2023-14:55:01] [I] timingCacheMode: local
[08/31/2023-14:55:01] [I] timingCacheFile: 
[08/31/2023-14:55:01] [I] Heuristic: Disabled
[08/31/2023-14:55:01] [I] Preview Features: Use default preview flags.
[08/31/2023-14:55:01] [I] MaxAuxStreams: -1
[08/31/2023-14:55:01] [I] BuilderOptimizationLevel: -1
[08/31/2023-14:55:01] [I] Input(s)s format: fp32:CHW
[08/31/2023-14:55:01] [I] Output(s)s format: fp32:CHW
[08/31/2023-14:55:01] [I] Input build shape: image=1x3x640x640+1x3x640x640+1x3x640x640
[08/31/2023-14:55:01] [I] Input build shape: scale_factor=1x2+1x2+1x2
[08/31/2023-14:55:01] [I] Input build shape: im_shape=1x2+1x2+1x2
[08/31/2023-14:55:01] [I] Input calibration shapes: model
[08/31/2023-14:55:01] [I] === System Options ===
[08/31/2023-14:55:01] [I] Device: 0
[08/31/2023-14:55:01] [I] DLACore: 
[08/31/2023-14:55:01] [I] Plugins:
[08/31/2023-14:55:01] [I] setPluginsToSerialize:
[08/31/2023-14:55:01] [I] dynamicPlugins:
[08/31/2023-14:55:01] [I] ignoreParsedPluginLibs: 0
[08/31/2023-14:55:01] [I] 
[08/31/2023-14:55:01] [I] === Inference Options ===
[08/31/2023-14:55:01] [I] Batch: Explicit
[08/31/2023-14:55:01] [I] Input inference shape: im_shape=1x2
[08/31/2023-14:55:01] [I] Input inference shape: scale_factor=1x2
[08/31/2023-14:55:01] [I] Input inference shape: image=1x3x640x640
[08/31/2023-14:55:01] [I] Iterations: 10
[08/31/2023-14:55:01] [I] Duration: 3s (+ 200ms warm up)
[08/31/2023-14:55:01] [I] Sleep time: 0ms
[08/31/2023-14:55:01] [I] Idle time: 0ms
[08/31/2023-14:55:01] [I] Inference Streams: 1
[08/31/2023-14:55:01] [I] ExposeDMA: Disabled
[08/31/2023-14:55:01] [I] Data transfers: Enabled
[08/31/2023-14:55:01] [I] Spin-wait: Disabled
[08/31/2023-14:55:01] [I] Multithreading: Disabled
[08/31/2023-14:55:01] [I] CUDA Graph: Disabled
[08/31/2023-14:55:01] [I] Separate profiling: Disabled
[08/31/2023-14:55:01] [I] Time Deserialize: Disabled
[08/31/2023-14:55:01] [I] Time Refit: Disabled
[08/31/2023-14:55:01] [I] NVTX verbosity: 0
[08/31/2023-14:55:01] [I] Persistent Cache Ratio: 0
[08/31/2023-14:55:01] [I] Inputs:
[08/31/2023-14:55:01] [I] === Reporting Options ===
[08/31/2023-14:55:01] [I] Verbose: Disabled
[08/31/2023-14:55:01] [I] Averages: 10 inferences
[08/31/2023-14:55:01] [I] Percentiles: 90,95,99
[08/31/2023-14:55:01] [I] Dump refittable layers:Disabled
[08/31/2023-14:55:01] [I] Dump output: Disabled
[08/31/2023-14:55:01] [I] Profile: Disabled
[08/31/2023-14:55:01] [I] Export timing to JSON file: 
[08/31/2023-14:55:01] [I] Export output to JSON file: 
[08/31/2023-14:55:01] [I] Export profile to JSON file: 
[08/31/2023-14:55:01] [I] 
[08/31/2023-14:55:01] [I] === Device Information ===
[08/31/2023-14:55:01] [I] Selected Device: NVIDIA GeForce RTX 2080 Ti
[08/31/2023-14:55:01] [I] Compute Capability: 7.5
[08/31/2023-14:55:01] [I] SMs: 68
[08/31/2023-14:55:01] [I] Device Global Memory: 11011 MiB
[08/31/2023-14:55:01] [I] Shared Memory per SM: 64 KiB
[08/31/2023-14:55:01] [I] Memory Bus Width: 352 bits (ECC disabled)
[08/31/2023-14:55:01] [I] Application Compute Clock Rate: 1.65 GHz
[08/31/2023-14:55:01] [I] Application Memory Clock Rate: 7 GHz
[08/31/2023-14:55:01] [I] 
[08/31/2023-14:55:01] [I] Note: The application clock rates do not reflect the actual clock rates that the GPU is currently running at.
[08/31/2023-14:55:01] [I] 
[08/31/2023-14:55:01] [I] TensorRT version: 8.6.1
[08/31/2023-14:55:01] [I] Loading standard plugins
[08/31/2023-14:55:01] [I] [TRT] [MemUsageChange] Init CUDA: CPU +13, GPU +0, now: CPU 18, GPU 488 (MiB)
[08/31/2023-14:55:06] [I] [TRT] [MemUsageChange] Init builder kernel library: CPU +896, GPU +174, now: CPU 991, GPU 662 (MiB)
[08/31/2023-14:55:06] [I] Start parsing network model.
[08/31/2023-14:55:06] [I] [TRT] ----------------------------------------------------------------
[08/31/2023-14:55:06] [I] [TRT] Input filename:   ./rtdetr_r50vd_6x_coco.onnx
[08/31/2023-14:55:06] [I] [TRT] ONNX IR version:  0.0.8
[08/31/2023-14:55:06] [I] [TRT] Opset version:    16
[08/31/2023-14:55:06] [I] [TRT] Producer name:    
[08/31/2023-14:55:06] [I] [TRT] Producer version: 
[08/31/2023-14:55:06] [I] [TRT] Domain:           
[08/31/2023-14:55:06] [I] [TRT] Model version:    0
[08/31/2023-14:55:06] [I] [TRT] Doc string:       
[08/31/2023-14:55:06] [I] [TRT] ----------------------------------------------------------------
[08/31/2023-14:55:06] [W] [TRT] onnx2trt_utils.cpp:374: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[08/31/2023-14:55:07] [I] Finished parsing network model. Parse time: 0.34223
[08/31/2023-14:55:07] [I] [TRT] BuilderFlag::kTF32 is set but hardware does not support TF32. Disabling TF32.
[08/31/2023-14:55:07] [W] [TRT] Detected layernorm nodes in FP16: p2o.Sub.0, p2o.Pow.0, p2o.Add.44, p2o.Sqrt.0, p2o.Div.0, p2o.Mul.2, p2o.Add.46, p2o.Sub.2, p2o.Pow.2, p2o.Add.56, p2o.Sqrt.2, p2o.Div.4, p2o.Mul.7, p2o.Add.58, p2o.Sub.4, p2o.Pow.4, p2o.Add.104, p2o.Sqrt.4, p2o.Div.6, p2o.Mul.57, p2o.Add.106, p2o.Sub.6, p2o.Pow.6, p2o.ReduceMean.14, p2o.Add.134, p2o.Sqrt.6, p2o.Div.8, p2o.Mul.61, p2o.Add.136, p2o.Sub.8, p2o.Pow.8, p2o.ReduceMean.18, p2o.Add.154, p2o.Sqrt.8, p2o.Div.10, p2o.Mul.81, p2o.Add.156, p2o.Sub.10, p2o.Pow.10, p2o.ReduceMean.22, p2o.Add.164, p2o.Sqrt.10, p2o.Div.12, p2o.Mul.83, p2o.Add.166, p2o.Sub.12, p2o.Pow.12, p2o.ReduceMean.26, p2o.Add.194, p2o.Sqrt.12, p2o.Div.16, p2o.Mul.89, p2o.Add.196, p2o.Sub.14, p2o.Pow.14, p2o.ReduceMean.30, p2o.Add.214, p2o.Sqrt.14, p2o.Div.18, p2o.Mul.109, p2o.Add.216, p2o.Sub.16, p2o.Pow.16, p2o.ReduceMean.34, p2o.Add.224, p2o.Sqrt.16, p2o.Div.20, p2o.Mul.111, p2o.Add.226, p2o.Sub.18, p2o.Pow.18, p2o.ReduceMean.38, p2o.Add.254, p2o.Sqrt.18, p2o.Div.24, p2o.Mul.117, p2o.Add.256, p2o.Sub.20, p2o.Pow.20, p2o.ReduceMean.42, p2o.Add.274, p2o.Sqrt.20, p2o.Div.26, p2o.Mul.137, p2o.Add.276, p2o.Sub.22, p2o.Pow.22, p2o.ReduceMean.46, p2o.Add.284, p2o.Sqrt.22, p2o.Div.28, p2o.Mul.139, p2o.Add.286, p2o.Sub.24, p2o.Pow.24, p2o.ReduceMean.50, p2o.Add.314, p2o.Sqrt.24, p2o.Div.32, p2o.Mul.145, p2o.Add.316, p2o.Sub.26, p2o.Pow.26, p2o.ReduceMean.54, p2o.Add.334, p2o.Sqrt.26, p2o.Div.34, p2o.Mul.165, p2o.Add.336, p2o.Sub.28, p2o.Pow.28, p2o.ReduceMean.58, p2o.Add.344, p2o.Sqrt.28, p2o.Div.36, p2o.Mul.167, p2o.Add.346, p2o.Sub.30, p2o.Pow.30, p2o.ReduceMean.62, p2o.Add.374, p2o.Sqrt.30, p2o.Div.40, p2o.Mul.173, p2o.Add.376, p2o.Sub.32, p2o.Pow.32, p2o.ReduceMean.66, p2o.Add.394, p2o.Sqrt.32, p2o.Div.42, p2o.Mul.193, p2o.Add.396, p2o.Sub.34, p2o.Pow.34, p2o.ReduceMean.70, p2o.Add.404, p2o.Sqrt.34, p2o.Div.44, p2o.Mul.195, p2o.Add.406, p2o.Sub.36, p2o.Pow.36, p2o.ReduceMean.74, p2o.Add.434, p2o.Sqrt.36, p2o.Div.48, p2o.Mul.201, p2o.Add.436, p2o.Sub.38, p2o.Pow.38, p2o.ReduceMean.78, p2o.Add.454, p2o.Sqrt.38, p2o.Div.50, p2o.Mul.221, p2o.Add.456, p2o.Sub.40, p2o.Pow.40, p2o.ReduceMean.82, p2o.Add.464, p2o.Sqrt.40, p2o.Div.52, p2o.Mul.223, p2o.Add.466, p2o.ReduceMean.10, p2o.ReduceMean.2, p2o.ReduceMean.6
[08/31/2023-14:55:07] [W] [TRT] Running layernorm after self-attention in FP16 may cause overflow. Exporting the model to the latest available ONNX opset (later than opset 17) to use the INormalizationLayer, or forcing layernorm layers to run in FP32 precision can help with preserving accuracy.
[08/31/2023-14:55:07] [I] [TRT] Graph optimization time: 0.403732 seconds.
[08/31/2023-14:55:07] [I] [TRT] BuilderFlag::kTF32 is set but hardware does not support TF32. Disabling TF32.
[08/31/2023-14:55:07] [I] [TRT] Local timing cache in use. Profiling results in this builder pass will not be stored.
[08/31/2023-15:00:26] [I] [TRT] Detected 3 inputs and 2 output network tensors.
[08/31/2023-15:00:27] [I] [TRT] Total Host Persistent Memory: 439056
[08/31/2023-15:00:27] [I] [TRT] Total Device Persistent Memory: 834560
[08/31/2023-15:00:27] [I] [TRT] Total Scratch Memory: 14330880
[08/31/2023-15:00:27] [I] [TRT] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 172 MiB, GPU 73 MiB
[08/31/2023-15:00:27] [I] [TRT] [BlockAssignment] Started assigning block shifts. This will take 143 steps to complete.
[08/31/2023-15:00:27] [I] [TRT] [BlockAssignment] Algorithm ShiftNTopDown took 5.89313ms to assign 9 blocks to 143 nodes requiring 34204160 bytes.
[08/31/2023-15:00:27] [I] [TRT] Total Activation Memory: 34204160
[08/31/2023-15:00:27] [W] [TRT] TensorRT encountered issues when converting weights between types and that could affect accuracy.
[08/31/2023-15:00:27] [W] [TRT] If this is not the desired behavior, please modify the weights or retrain with regularization to adjust the magnitude of the weights.
[08/31/2023-15:00:27] [W] [TRT] Check verbose logs for the list of affected weights.
[08/31/2023-15:00:27] [W] [TRT] - 1 weights are affected by this issue: Detected FP32 infinity values and converted them to corresponding FP16 infinity.
[08/31/2023-15:00:27] [W] [TRT] - 223 weights are affected by this issue: Detected subnormal FP16 values.
[08/31/2023-15:00:27] [W] [TRT] - 63 weights are affected by this issue: Detected values less than smallest positive FP16 subnormal value and converted them to the FP16 minimum subnormalized value.
[08/31/2023-15:00:27] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +65, GPU +81, now: CPU 65, GPU 81 (MiB)
[08/31/2023-15:00:27] [I] Engine built in 326.066 sec.
[08/31/2023-15:00:27] [I] [TRT] Loaded engine size: 85 MiB
[08/31/2023-15:00:27] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +81, now: CPU 0, GPU 81 (MiB)
[08/31/2023-15:00:27] [I] Engine deserialized in 0.0433515 sec.
[08/31/2023-15:00:27] [I] [TRT] [MS] Running engine with multi stream info
[08/31/2023-15:00:27] [I] [TRT] [MS] Number of aux streams is 2
[08/31/2023-15:00:27] [I] [TRT] [MS] Number of total worker streams is 3
[08/31/2023-15:00:27] [I] [TRT] [MS] The main stream provided by execute/enqueue calls is the first worker stream
[08/31/2023-15:00:27] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +33, now: CPU 0, GPU 114 (MiB)
[08/31/2023-15:00:27] [I] Setting persistentCacheLimit to 0 bytes.
[08/31/2023-15:00:27] [I] Using random values for input im_shape
[08/31/2023-15:00:27] [I] Input binding for im_shape with dimensions 1x2 is created.
[08/31/2023-15:00:27] [I] Using random values for input image
[08/31/2023-15:00:28] [I] Input binding for image with dimensions 1x3x640x640 is created.
[08/31/2023-15:00:28] [I] Using random values for input scale_factor
[08/31/2023-15:00:28] [I] Input binding for scale_factor with dimensions 1x2 is created.
[08/31/2023-15:00:28] [I] Output binding for tile_3.tmp_0 with dimensions  is created.
[08/31/2023-15:00:28] [I] Output binding for reshape2_95.tmp_0 with dimensions 300x6 is created.
[08/31/2023-15:00:28] [I] Starting inference
[08/31/2023-15:00:31] [I] Warmup completed 45 queries over 200 ms
[08/31/2023-15:00:31] [I] Timing trace has 669 queries over 3.01153 s
[08/31/2023-15:00:31] [I] 
[08/31/2023-15:00:31] [I] === Trace details ===
[08/31/2023-15:00:31] [I] Trace averages of 10 runs:
[08/31/2023-15:00:31] [I] Average on 10 runs - GPU latency: 4.45693 ms - Host latency: 5.00163 ms (enqueue 1.44989 ms)
[08/31/2023-15:00:31] [I] Average on 10 runs - GPU latency: 4.45289 ms - Host latency: 4.99629 ms (enqueue 1.27771 ms)
[08/31/2023-15:00:31] [I] Average on 10 runs - GPU latency: 4.49351 ms - Host latency: 5.03987 ms (enqueue 1.50723 ms)
[08/31/2023-15:00:31] [I] Average on 10 runs - GPU latency: 4.48447 ms - Host latency: 5.03333 ms (enqueue 1.45529 ms)
[08/31/2023-15:00:31] [I] Average on 10 runs - GPU latency: 4.46169 ms - Host latency: 5.00724 ms (enqueue 1.81611 ms)
[08/31/2023-15:00:31] [I] Average on 10 runs - GPU latency: 4.46317 ms - Host latency: 5.00313 ms (enqueue 1.16371 ms)
[08/31/2023-15:00:31] [I] Average on 10 runs - GPU latency: 4.44719 ms - Host latency: 4.99411 ms (enqueue 1.92591 ms)
[08/31/2023-15:00:31] [I] Average on 10 runs - GPU latency: 4.47023 ms - Host latency: 5.01824 ms (enqueue 2.03996 ms)
[08/31/2023-15:00:31] [I] Average on 10 runs - GPU latency: 4.49032 ms - Host latency: 5.03456 ms (enqueue 2.03104 ms)
[08/31/2023-15:00:31] [I] Average on 10 runs - GPU latency: 4.49633 ms - Host latency: 5.04483 ms (enqueue 1.82703 ms)
[08/31/2023-15:00:31] [I] Average on 10 runs - GPU latency: 4.48669 ms - Host latency: 5.02964 ms (enqueue 1.8278 ms)
[08/31/2023-15:00:31] [I] Average on 10 runs - GPU latency: 4.48203 ms - Host latency: 5.02427 ms (enqueue 1.39008 ms)
[08/31/2023-15:00:31] [I] Average on 10 runs - GPU latency: 4.48318 ms - Host latency: 5.02858 ms (enqueue 1.37376 ms)
[08/31/2023-15:00:31] [I] Average on 10 runs - GPU latency: 4.48325 ms - Host latency: 5.02468 ms (enqueue 1.27521 ms)
[08/31/2023-15:00:31] [I] Average on 10 runs - GPU latency: 4.5146 ms - Host latency: 5.06036 ms (enqueue 1.08528 ms)
[08/31/2023-15:00:31] [I] Average on 10 runs - GPU latency: 4.51545 ms - Host latency: 5.06132 ms (enqueue 1.78871 ms)
[08/31/2023-15:00:31] [I] Average on 10 runs - GPU latency: 4.49631 ms - Host latency: 5.04066 ms (enqueue 1.78268 ms)
[08/31/2023-15:00:31] [I] Average on 10 runs - GPU latency: 4.4944 ms - Host latency: 5.04268 ms (enqueue 1.63381 ms)
[08/31/2023-15:00:31] [I] Average on 10 runs - GPU latency: 4.48693 ms - Host latency: 5.02889 ms (enqueue 1.44833 ms)
[08/31/2023-15:00:31] [I] Average on 10 runs - GPU latency: 4.4881 ms - Host latency: 5.03709 ms (enqueue 1.91923 ms)
[08/31/2023-15:00:31] [I] Average on 10 runs - GPU latency: 4.48943 ms - Host latency: 5.03877 ms (enqueue 1.9224 ms)
[08/31/2023-15:00:31] [I] Average on 10 runs - GPU latency: 4.47816 ms - Host latency: 5.02123 ms (enqueue 1.57825 ms)
[08/31/2023-15:00:31] [I] Average on 10 runs - GPU latency: 4.48765 ms - Host latency: 5.03352 ms (enqueue 1.05079 ms)
[08/31/2023-15:00:31] [I] Average on 10 runs - GPU latency: 4.48776 ms - Host latency: 5.02734 ms (enqueue 1.0729 ms)
[08/31/2023-15:00:31] [I] Average on 10 runs - GPU latency: 4.49819 ms - Host latency: 5.0441 ms (enqueue 1.86301 ms)
[08/31/2023-15:00:31] [I] Average on 10 runs - GPU latency: 4.52578 ms - Host latency: 5.07155 ms (enqueue 1.98652 ms)
[08/31/2023-15:00:31] [I] Average on 10 runs - GPU latency: 4.5193 ms - Host latency: 5.06476 ms (enqueue 1.92823 ms)
[08/31/2023-15:00:31] [I] Average on 10 runs - GPU latency: 4.48706 ms - Host latency: 5.03068 ms (enqueue 1.09467 ms)
[08/31/2023-15:00:31] [I] Average on 10 runs - GPU latency: 4.47623 ms - Host latency: 5.02465 ms (enqueue 1.70804 ms)
[08/31/2023-15:00:31] [I] Average on 10 runs - GPU latency: 4.46544 ms - Host latency: 5.00983 ms (enqueue 2.03997 ms)
[08/31/2023-15:00:31] [I] Average on 10 runs - GPU latency: 4.50546 ms - Host latency: 5.04865 ms (enqueue 2.05879 ms)
[08/31/2023-15:00:31] [I] Average on 10 runs - GPU latency: 4.51282 ms - Host latency: 5.05753 ms (enqueue 1.87915 ms)
[08/31/2023-15:00:31] [I] Average on 10 runs - GPU latency: 4.50448 ms - Host latency: 5.04973 ms (enqueue 1.43376 ms)
[08/31/2023-15:00:31] [I] Average on 10 runs - GPU latency: 4.48164 ms - Host latency: 5.01892 ms (enqueue 1.56187 ms)
[08/31/2023-15:00:31] [I] Average on 10 runs - GPU latency: 4.47012 ms - Host latency: 5.00725 ms (enqueue 1.41846 ms)
[08/31/2023-15:00:31] [I] Average on 10 runs - GPU latency: 4.50891 ms - Host latency: 5.05251 ms (enqueue 1.05018 ms)
[08/31/2023-15:00:31] [I] Average on 10 runs - GPU latency: 4.5177 ms - Host latency: 5.05718 ms (enqueue 1.60492 ms)
[08/31/2023-15:00:31] [I] Average on 10 runs - GPU latency: 4.48661 ms - Host latency: 5.022 ms (enqueue 1.25336 ms)
[08/31/2023-15:00:31] [I] Average on 10 runs - GPU latency: 4.48765 ms - Host latency: 5.0371 ms (enqueue 1.88019 ms)
[08/31/2023-15:00:31] [I] Average on 10 runs - GPU latency: 4.48152 ms - Host latency: 5.02531 ms (enqueue 2.06493 ms)
[08/31/2023-15:00:31] [I] Average on 10 runs - GPU latency: 4.51396 ms - Host latency: 5.05807 ms (enqueue 2.06946 ms)
[08/31/2023-15:00:31] [I] Average on 10 runs - GPU latency: 4.51018 ms - Host latency: 5.06058 ms (enqueue 2.05632 ms)
[08/31/2023-15:00:31] [I] Average on 10 runs - GPU latency: 4.48259 ms - Host latency: 5.03088 ms (enqueue 1.86101 ms)
[08/31/2023-15:00:31] [I] Average on 10 runs - GPU latency: 4.47998 ms - Host latency: 5.02954 ms (enqueue 2.07688 ms)
[08/31/2023-15:00:31] [I] Average on 10 runs - GPU latency: 4.49255 ms - Host latency: 5.03833 ms (enqueue 2.05876 ms)
[08/31/2023-15:00:31] [I] Average on 10 runs - GPU latency: 4.52056 ms - Host latency: 5.06689 ms (enqueue 2.0646 ms)
[08/31/2023-15:00:31] [I] Average on 10 runs - GPU latency: 4.50173 ms - Host latency: 5.04526 ms (enqueue 1.44507 ms)
[08/31/2023-15:00:31] [I] Average on 10 runs - GPU latency: 4.48713 ms - Host latency: 5.03086 ms (enqueue 1.45964 ms)
[08/31/2023-15:00:31] [I] Average on 10 runs - GPU latency: 4.47297 ms - Host latency: 5.0217 ms (enqueue 1.98777 ms)
[08/31/2023-15:00:31] [I] Average on 10 runs - GPU latency: 4.4876 ms - Host latency: 5.03291 ms (enqueue 1.72166 ms)
[08/31/2023-15:00:31] [I] Average on 10 runs - GPU latency: 4.50876 ms - Host latency: 5.05713 ms (enqueue 2.04985 ms)
[08/31/2023-15:00:31] [I] Average on 10 runs - GPU latency: 4.51277 ms - Host latency: 5.05891 ms (enqueue 2.04622 ms)
[08/31/2023-15:00:31] [I] Average on 10 runs - GPU latency: 4.47832 ms - Host latency: 5.02549 ms (enqueue 2.03547 ms)
[08/31/2023-15:00:31] [I] Average on 10 runs - GPU latency: 4.48232 ms - Host latency: 5.02056 ms (enqueue 1.15129 ms)
[08/31/2023-15:00:31] [I] Average on 10 runs - GPU latency: 4.46726 ms - Host latency: 5.01082 ms (enqueue 1.07268 ms)
[08/31/2023-15:00:31] [I] Average on 10 runs - GPU latency: 4.51665 ms - Host latency: 5.06235 ms (enqueue 1.45144 ms)
[08/31/2023-15:00:31] [I] Average on 10 runs - GPU latency: 4.51802 ms - Host latency: 5.06448 ms (enqueue 2.05554 ms)
[08/31/2023-15:00:31] [I] Average on 10 runs - GPU latency: 4.49619 ms - Host latency: 5.04353 ms (enqueue 2.02344 ms)
[08/31/2023-15:00:31] [I] Average on 10 runs - GPU latency: 4.49478 ms - Host latency: 5.03818 ms (enqueue 1.64333 ms)
[08/31/2023-15:00:31] [I] Average on 10 runs - GPU latency: 4.48669 ms - Host latency: 5.03503 ms (enqueue 2.04768 ms)
[08/31/2023-15:00:31] [I] Average on 10 runs - GPU latency: 4.50999 ms - Host latency: 5.0573 ms (enqueue 2.04263 ms)
[08/31/2023-15:00:31] [I] Average on 10 runs - GPU latency: 4.51379 ms - Host latency: 5.05249 ms (enqueue 1.92029 ms)
[08/31/2023-15:00:31] [I] Average on 10 runs - GPU latency: 4.54653 ms - Host latency: 5.09473 ms (enqueue 2.05488 ms)
[08/31/2023-15:00:31] [I] Average on 10 runs - GPU latency: 4.52471 ms - Host latency: 5.07537 ms (enqueue 2.03877 ms)
[08/31/2023-15:00:31] [I] Average on 10 runs - GPU latency: 4.50703 ms - Host latency: 5.04866 ms (enqueue 2.06172 ms)
[08/31/2023-15:00:31] [I] Average on 10 runs - GPU latency: 4.48103 ms - Host latency: 5.02456 ms (enqueue 2.02048 ms)
[08/31/2023-15:00:31] [I] 
[08/31/2023-15:00:31] [I] === Performance summary ===
[08/31/2023-15:00:31] [I] Throughput: 222.146 qps
[08/31/2023-15:00:31] [I] Latency: min = 4.97229 ms, max = 5.25537 ms, mean = 5.03733 ms, median = 5.03564 ms, percentile(90%) = 5.07141 ms, percentile(95%) = 5.07983 ms, percentile(99%) = 5.10303 ms
[08/31/2023-15:00:31] [I] Enqueue Time: min = 0.800537 ms, max = 2.9259 ms, mean = 1.70667 ms, median = 2.02332 ms, percentile(90%) = 2.09106 ms, percentile(95%) = 2.12305 ms, percentile(99%) = 2.23798 ms
[08/31/2023-15:00:31] [I] H2D Latency: min = 0.498535 ms, max = 0.581665 ms, mean = 0.537984 ms, median = 0.53833 ms, percentile(90%) = 0.547852 ms, percentile(95%) = 0.551025 ms, percentile(99%) = 0.557373 ms
[08/31/2023-15:00:31] [I] GPU Compute Time: min = 4.42865 ms, max = 4.71533 ms, mean = 4.49246 ms, median = 4.48999 ms, percentile(90%) = 4.52344 ms, percentile(95%) = 4.53076 ms, percentile(99%) = 4.55444 ms
[08/31/2023-15:00:31] [I] D2H Latency: min = 0.00488281 ms, max = 0.0146484 ms, mean = 0.00689398 ms, median = 0.00683594 ms, percentile(90%) = 0.00805664 ms, percentile(95%) = 0.00842285 ms, percentile(99%) = 0.009552 ms
[08/31/2023-15:00:31] [I] Total Host Walltime: 3.01153 s
[08/31/2023-15:00:31] [I] Total GPU Compute Time: 3.00545 s
[08/31/2023-15:00:31] [I] Explanations of the performance metrics are printed in the verbose logs.
[08/31/2023-15:00:31] [I] 
&&&& PASSED TensorRT.trtexec [TensorRT v8601] # TensorRT-8.6.1.6/bin/trtexec --onnx=./rtdetr_r50vd_6x_coco.onnx --workspace=4096 --shapes=image:1x3x640x640,scale_factor:1x2,im_shape:1x2 --saveEngine=rtdetr_r50vd_6x_coco.trt --avgRuns=10 --fp16
    
lyuwenyu commented 1 year ago

Yes, that should be the output name. It doesn't have a shape, perhaps due to version issues.

You can try onnxslim to process the onnx file, and check the output has shape

onnxsim rtdetr_r50vd_6x_coco.onnx rtdetr_r50vd_6x_coco_new.onnx  --overwrite-input-shape im_shape:1,2 image:1,3,640,640 scale_factor:1,2
lebionick commented 1 year ago

I've installed 2.4.2 paddlepaddle and trt now works! As well as onnxsim (python3 -m pip install paddlepaddle-gpu==2.4.2.post117 -f https://www.paddlepaddle.org.cn/whl/linux/mkl/avx/stable.html)