laugh12321 / TensorRT-YOLO

🚀 你的YOLO部署神器。TensorRT Plugin、CUDA Kernel、CUDA Graphs三管齐下,享受闪电般的推理速度。| Your YOLO Deployment Powerhouse. With the synergy of TensorRT Plugins, CUDA Kernels, and CUDA Graphs, experience lightning-fast inference speeds.
https://github.com/laugh12321/TensorRT-YOLO
GNU General Public License v3.0
720 stars 81 forks source link

[Help]: Invalid access in BaseDet::postProcess() #55

Closed STVHA closed 2 weeks ago

STVHA commented 2 weeks ago

DetectionResult BaseDet::postProcess(const int idx) { int num = static_cast<int>(tensorInfos[1].tensor.host())[idx]; float boxes = static_cast<float>(tensorInfos[2].tensor.host()) + idx tensorInfos[2].dims.d[1] tensorInfos[2].dims.d[2]; float scores = static_cast<float>(tensorInfos[3].tensor.host()) + idx tensorInfos[3].dims.d[1]; int classes = static_cast<int>(tensorInfos[4].tensor.host()) + idx * tensorInfos[4].dims.d[1]; // .... }

When I run the demo/detect, getNbIOTensors() returns 2 hence tensorInfos[2], tensorInfos[3] and tensorInfos[4] go crashing.

laugh12321 commented 2 weeks ago

@STVHA If you are using ultralytics for exporting 和 getNbIOTensors() returns 2 (images and output0), you should refer to the Model Inference Example - Detection Model section in the provided documentation. To achieve the desired getNbIOTensors() return value of 5, you need to use the trtyolo CLI tool to export the ONNX model with the EfficientNMS plugin registered.


Use ultralytics

Use trtyolo

STVHA commented 2 weeks ago

@laugh12321 Thank you for your advice. So does that mean the code not support the model exported by ultralytics?

laugh12321 commented 2 weeks ago

@STVHA Yes, TensorRT-YOLO does not support models exported by ultralytics. trtyolo has made modifications to the models exported by ultralytics, incorporating an NMS plugin, which significantly speeds up post-processing. This approach is faster than implementing NMS on CUDA kernels and CPUs, making it the optimal solution for YOLO inference on NVIDIA device.

STVHA commented 1 week ago

I followed the guideline to get exported onnx and engine files by using trtyolo as below:

trtyolo export -w yolov10-models\yolov10m.pt -v yolov10 -o output
trtexec --onnx=output\yolov10m.onnx --saveEngine=output\yolov10m.engine --fp16

But at inference I always get zero object regardless of yolov10 or yolov11. int num = static_cast<int*>(tensorInfos[1].tensor.host())[idx];

So what is possible reasons?

laugh12321 commented 1 week ago

@STVHA If you use the same images from the official repository to perform inference on yolov10m.pt and the result is that no targets are detected, then theoretically, the same result should be obtained when using TensorRT-YOLO for inference, meaning no targets are detected. To further confirm this, it is recommended to test with multiple images to verify the model's consistency and accuracy.

STVHA commented 1 week ago

@laugh12321 I tested the same image with "yolo" and pt weight, the result returned with detected objects. Here is the engine export log, could you please point out some clues?

[11/19/2024-14:10:44] [I] === Model Options ===
[11/19/2024-14:10:44] [I] Format: ONNX
[11/19/2024-14:10:44] [I] Model: yolov10m.onnx
[11/19/2024-14:10:44] [I] Output:
[11/19/2024-14:10:44] [I] === Build Options ===
[11/19/2024-14:10:44] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default, tacticSharedMem: default
[11/19/2024-14:10:44] [I] avgTiming: 8
[11/19/2024-14:10:44] [I] Precision: FP32
[11/19/2024-14:10:44] [I] LayerPrecisions:
[11/19/2024-14:10:44] [I] Layer Device Types:
[11/19/2024-14:10:44] [I] Calibration:
[11/19/2024-14:10:44] [I] Refit: Disabled
[11/19/2024-14:10:44] [I] Strip weights: Disabled
[11/19/2024-14:10:44] [I] Version Compatible: Disabled
[11/19/2024-14:10:44] [I] ONNX Plugin InstanceNorm: Disabled
[11/19/2024-14:10:44] [I] TensorRT runtime: full
[11/19/2024-14:10:44] [I] Lean DLL Path:
[11/19/2024-14:10:44] [I] Tempfile Controls: { in_memory: allow, temporary: allow }
[11/19/2024-14:10:44] [I] Exclude Lean Runtime: Disabled
[11/19/2024-14:10:44] [I] Sparsity: Disabled
[11/19/2024-14:10:44] [I] Safe mode: Disabled
[11/19/2024-14:10:44] [I] Build DLA standalone loadable: Disabled
[11/19/2024-14:10:44] [I] Allow GPU fallback for DLA: Disabled
[11/19/2024-14:10:44] [I] DirectIO mode: Disabled
[11/19/2024-14:10:44] [I] Restricted mode: Disabled
[11/19/2024-14:10:44] [I] Skip inference: Disabled
[11/19/2024-14:10:44] [I] Save engine: yolov10m.engine
[11/19/2024-14:10:44] [I] Load engine:
[11/19/2024-14:10:44] [I] Profiling verbosity: 0
[11/19/2024-14:10:44] [I] Tactic sources: Using default tactic sources
[11/19/2024-14:10:44] [I] timingCacheMode: local
[11/19/2024-14:10:44] [I] timingCacheFile:
[11/19/2024-14:10:44] [I] Enable Compilation Cache: Enabled
[11/19/2024-14:10:44] [I] Enable Monitor Memory: Disabled
[11/19/2024-14:10:44] [I] errorOnTimingCacheMiss: Disabled
[11/19/2024-14:10:44] [I] Preview Features: Use default preview flags.
[11/19/2024-14:10:44] [I] MaxAuxStreams: -1
[11/19/2024-14:10:44] [I] BuilderOptimizationLevel: -1
[11/19/2024-14:10:44] [I] MaxTactics: -1
[11/19/2024-14:10:44] [I] Calibration Profile Index: 0
[11/19/2024-14:10:44] [I] Weight Streaming: Disabled
[11/19/2024-14:10:44] [I] Runtime Platform: Same As Build
[11/19/2024-14:10:44] [I] Debug Tensors:
[11/19/2024-14:10:44] [I] Input(s)s format: fp32:CHW
[11/19/2024-14:10:44] [I] Output(s)s format: fp32:CHW
[11/19/2024-14:10:44] [I] Input build shapes: model
[11/19/2024-14:10:44] [I] Input calibration shapes: model
[11/19/2024-14:10:44] [I] === System Options ===
[11/19/2024-14:10:44] [I] Device: 0
[11/19/2024-14:10:44] [I] DLACore:
[11/19/2024-14:10:44] [I] Plugins:
[11/19/2024-14:10:44] [I] setPluginsToSerialize:
[11/19/2024-14:10:44] [I] dynamicPlugins:
[11/19/2024-14:10:44] [I] ignoreParsedPluginLibs: 0
[11/19/2024-14:10:44] [I]
[11/19/2024-14:10:44] [I] === Inference Options ===
[11/19/2024-14:10:44] [I] Batch: Explicit
[11/19/2024-14:10:44] [I] Input inference shapes: model
[11/19/2024-14:10:44] [I] Iterations: 10
[11/19/2024-14:10:44] [I] Duration: 3s (+ 200ms warm up)
[11/19/2024-14:10:44] [I] Sleep time: 0ms
[11/19/2024-14:10:44] [I] Idle time: 0ms
[11/19/2024-14:10:44] [I] Inference Streams: 1
[11/19/2024-14:10:44] [I] ExposeDMA: Disabled
[11/19/2024-14:10:44] [I] Data transfers: Enabled
[11/19/2024-14:10:44] [I] Spin-wait: Disabled
[11/19/2024-14:10:44] [I] Multithreading: Disabled
[11/19/2024-14:10:44] [I] CUDA Graph: Disabled
[11/19/2024-14:10:44] [I] Separate profiling: Disabled
[11/19/2024-14:10:44] [I] Time Deserialize: Disabled
[11/19/2024-14:10:44] [I] Time Refit: Disabled
[11/19/2024-14:10:44] [I] NVTX verbosity: 0
[11/19/2024-14:10:44] [I] Persistent Cache Ratio: 0
[11/19/2024-14:10:44] [I] Optimization Profile Index: 0
[11/19/2024-14:10:44] [I] Weight Streaming Budget: 100.000000%
[11/19/2024-14:10:44] [I] Inputs:
[11/19/2024-14:10:44] [I] Debug Tensor Save Destinations:
[11/19/2024-14:10:44] [I] === Reporting Options ===
[11/19/2024-14:10:44] [I] Verbose: Disabled
[11/19/2024-14:10:44] [I] Averages: 10 inferences
[11/19/2024-14:10:44] [I] Percentiles: 90,95,99
[11/19/2024-14:10:44] [I] Dump refittable layers:Disabled
[11/19/2024-14:10:44] [I] Dump output: Disabled
[11/19/2024-14:10:44] [I] Profile: Disabled
[11/19/2024-14:10:44] [I] Export timing to JSON file:
[11/19/2024-14:10:44] [I] Export output to JSON file:
[11/19/2024-14:10:44] [I] Export profile to JSON file:
[11/19/2024-14:10:44] [I]
[11/19/2024-14:10:44] [I] === Device Information ===
[11/19/2024-14:10:44] [I] Available Devices:
[11/19/2024-14:10:44] [I]   Device 0: "NVIDIA GeForce RTX 3060" UUID: GPU-703a0999-1e37-5b32-798a-f153bee4c3e7
[11/19/2024-14:10:44] [I] Selected Device: NVIDIA GeForce RTX 3060
[11/19/2024-14:10:44] [I] Selected Device ID: 0
[11/19/2024-14:10:44] [I] Selected Device UUID: GPU-703a0999-1e37-5b32-798a-f153bee4c3e7
[11/19/2024-14:10:44] [I] Compute Capability: 8.6
[11/19/2024-14:10:44] [I] SMs: 28
[11/19/2024-14:10:44] [I] Device Global Memory: 12287 MiB
[11/19/2024-14:10:44] [I] Shared Memory per SM: 100 KiB
[11/19/2024-14:10:44] [I] Memory Bus Width: 192 bits (ECC disabled)
[11/19/2024-14:10:44] [I] Application Compute Clock Rate: 1.777 GHz
[11/19/2024-14:10:44] [I] Application Memory Clock Rate: 7.501 GHz
[11/19/2024-14:10:44] [I]
[11/19/2024-14:10:44] [I] Note: The application clock rates do not reflect the actual clock rates that the GPU is currently running at.
[11/19/2024-14:10:44] [I]
[11/19/2024-14:10:44] [I] TensorRT version: 10.6.0
[11/19/2024-14:10:44] [I] Loading standard plugins
[11/19/2024-14:10:44] [I] [TRT] [MemUsageChange] Init CUDA: CPU +1, GPU +0, now: CPU 14957, GPU 1040 (MiB)
[11/19/2024-14:10:47] [I] [TRT] [MemUsageChange] Init builder kernel library: CPU +3154, GPU +406, now: CPU 18437, GPU 1446 (MiB)
[11/19/2024-14:10:47] [I] Start parsing network model.
[11/19/2024-14:10:47] [I] [TRT] ----------------------------------------------------------------
[11/19/2024-14:10:47] [I] [TRT] Input filename:   yolov10m.onnx
[11/19/2024-14:10:47] [I] [TRT] ONNX IR version:  0.0.10
[11/19/2024-14:10:47] [I] [TRT] Opset version:    19
[11/19/2024-14:10:47] [I] [TRT] Producer name:    pytorch
[11/19/2024-14:10:47] [I] [TRT] Producer version: 2.5.1
[11/19/2024-14:10:47] [I] [TRT] Domain:
[11/19/2024-14:10:47] [I] [TRT] Model version:    0
[11/19/2024-14:10:47] [I] [TRT] Doc string:
[11/19/2024-14:10:47] [I] [TRT] ----------------------------------------------------------------
[11/19/2024-14:10:47] [I] Finished parsing network model. Parse time: 0.101279
[11/19/2024-14:10:47] [I] [TRT] Local timing cache in use. Profiling results in this builder pass will not be stored.
[11/19/2024-14:11:26] [I] [TRT] Compiler backend is used during engine build.
[11/19/2024-14:12:12] [I] [TRT] Detected 1 inputs and 1 output network tensors.
[11/19/2024-14:12:13] [I] [TRT] Total Host Persistent Memory: 701344 bytes
[11/19/2024-14:12:13] [I] [TRT] Total Device Persistent Memory: 48640 bytes
[11/19/2024-14:12:13] [I] [TRT] Max Scratch Memory: 5580800 bytes
[11/19/2024-14:12:13] [I] [TRT] [BlockAssignment] Started assigning block shifts. This will take 360 steps to complete.
[11/19/2024-14:12:13] [I] [TRT] [BlockAssignment] Algorithm ShiftNTopDown took 34.1288ms to assign 12 blocks to 360 nodes requiring 64052736 bytes.
[11/19/2024-14:12:13] [I] [TRT] Total Activation Memory: 64051200 bytes
[11/19/2024-14:12:13] [I] [TRT] Total Weights Memory: 77523908 bytes
[11/19/2024-14:12:13] [I] [TRT] Compiler backend is used during engine execution.
[11/19/2024-14:12:13] [I] [TRT] Engine generation completed in 86.1752 seconds.
[11/19/2024-14:12:13] [I] [TRT] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 2 MiB, GPU 261 MiB
[11/19/2024-14:12:13] [I] Engine built in 86.3746 sec.
[11/19/2024-14:12:13] [I] Created engine with size: 78.8792 MiB
[11/19/2024-14:12:14] [I] [TRT] Loaded engine size: 78 MiB
[11/19/2024-14:12:14] [I] Engine deserialized in 0.103577 sec.
[11/19/2024-14:12:14] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +1, GPU +62, now: CPU 1, GPU 135 (MiB)
[11/19/2024-14:12:14] [I] Setting persistentCacheLimit to 0 bytes.
[11/19/2024-14:12:14] [I] Created execution context with device memory size: 61.084 MiB
[11/19/2024-14:12:14] [I] Using random values for input images
[11/19/2024-14:12:14] [I] Input binding for images with dimensions 1x3x640x640 is created.
[11/19/2024-14:12:14] [I] Output binding for output0 with dimensions 1x300x6 is created.
[11/19/2024-14:12:14] [I] Starting inference
[11/19/2024-14:12:18] [I] Warmup completed 11 queries over 200 ms
[11/19/2024-14:12:18] [I] Timing trace has 266 queries over 3.02548 s
[11/19/2024-14:12:18] [I]
[11/19/2024-14:12:18] [I] === Trace details ===
[11/19/2024-14:12:18] [I] Trace averages of 10 runs:
[11/19/2024-14:12:18] [I] Average on 10 runs - GPU latency: 11.2312 ms - Host latency: 11.8327 ms (enqueue 2.23381 ms)
.....
[11/19/2024-14:12:18] [I] Average on 10 runs - GPU latency: 10.6207 ms - Host latency: 11.1945 ms (enqueue 2.02371 ms)
[11/19/2024-14:12:18] [I]
[11/19/2024-14:12:18] [I] === Performance summary ===
[11/19/2024-14:12:18] [I] Throughput: 87.9199 qps
[11/19/2024-14:12:18] [I] Latency: min = 10.9751 ms, max = 12.3712 ms, mean = 11.2477 ms, median = 11.2025 ms, percentile(90%) = 11.4659 ms, percentile(95%) = 11.7123 ms, percentile(99%) = 11.9858 ms
[11/19/2024-14:12:18] [I] Enqueue Time: min = 1.48804 ms, max = 4.8717 ms, mean = 2.12565 ms, median = 2.04358 ms, percentile(90%) = 2.79187 ms, percentile(95%) = 3.01666 ms, percentile(99%) = 3.54584 ms
[11/19/2024-14:12:18] [I] H2D Latency: min = 0.538086 ms, max = 0.800293 ms, mean = 0.566996 ms, median = 0.558716 ms, percentile(90%) = 0.593994 ms, percentile(95%) = 0.619141 ms, percentile(99%) = 0.692383 ms
[11/19/2024-14:12:18] [I] GPU Compute Time: min = 10.3894 ms, max = 11.8128 ms, mean = 10.6732 ms, median = 10.625 ms, percentile(90%) = 10.8792 ms, percentile(95%) = 11.0919 ms, percentile(99%) = 11.4063 ms
[11/19/2024-14:12:18] [I] D2H Latency: min = 0.00610352 ms, max = 0.0175781 ms, mean = 0.00748931 ms, median = 0.00708008 ms, percentile(90%) = 0.00756836 ms, percentile(95%) = 0.0101318 ms, percentile(99%) = 0.0148926 ms
[11/19/2024-14:12:18] [I] Total Host Walltime: 3.02548 s
[11/19/2024-14:12:18] [I] Total GPU Compute Time: 2.83906 s
[11/19/2024-14:12:18] [W] * GPU compute time is unstable, with coefficient of variance = 1.79929%.
[11/19/2024-14:12:18] [W]   If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[11/19/2024-14:12:18] [I] Explanations of the performance metrics are printed in the verbose logs.
[11/19/2024-14:12:18] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v100600] [b26] # trtexec --onnx=yolov10m.onnx --saveEngine=yolov10m.engine
laugh12321 commented 1 week ago

@STVHA

[11/19/2024-14:12:14] [I] Input binding for images with dimensions 1x3x640x640 is created.
[11/19/2024-14:12:14] [I] Output binding for output0 with dimensions 1x300x6 is created.

The log indicates that the model has an input node named "images" and an output node named "output0". However, this does not match the number of output nodes expected by TensorRT-YOLO inference, and therefore the model cannot be used directly in TensorRT-YOLO. Below is the structure of the YOLOv10 model exported using trtyolo.

image

laugh12321 commented 1 week ago

@STVHA This is a bug; after testing, I encountered the same issue as you.

STVHA commented 1 week ago

@laugh12321 Thanks for your effort. I hope it will be fixed soon.

laugh12321 commented 1 week ago

@STVHA My apologies for the confusion; I incorrectly treated the images in the "images" folder as files under the "outputs" folder, leading to a false assumption of no output results. In fact, the execution of the examples/detects sample is normal on my end. Here are the inference results using the YOLOv10m model; you might want to try the ONNX model I exported.

yolov10m.pt use trtyolo export

000000000036

STVHA commented 1 week ago

@laugh12321 I have tried with your onnx but got no luck. Please take a look at the export log.

[11/19/2024-15:44:08] [I] Finished parsing network model. Parse time: 0.107279
[11/19/2024-15:44:08] [I] [TRT] Local timing cache in use. Profiling results in this builder pass will not be stored.
[11/19/2024-15:46:05] [I] [TRT] Compiler backend is used during engine build.
[11/19/2024-15:47:49] [I] [TRT] Detected 1 inputs and 4 output network tensors.
[11/19/2024-15:47:52] [I] [TRT] Total Host Persistent Memory: 682256 bytes
[11/19/2024-15:47:52] [I] [TRT] Total Device Persistent Memory: 4096 bytes
[11/19/2024-15:47:52] [I] [TRT] Max Scratch Memory: 2857472 bytes
[11/19/2024-15:47:52] [I] [TRT] [BlockAssignment] Started assigning block shifts. This will take 248 steps to complete.
[11/19/2024-15:47:52] [I] [TRT] [BlockAssignment] Algorithm ShiftNTopDown took 20.8356ms to assign 11 blocks to 248 nodes requiring 30951424 bytes.
[11/19/2024-15:47:52] [I] [TRT] Total Activation Memory: 30950400 bytes
[11/19/2024-15:47:52] [I] [TRT] Total Weights Memory: 30873568 bytes
[11/19/2024-15:47:52] [I] [TRT] Compiler backend is used during engine execution.
[11/19/2024-15:47:52] [I] [TRT] Engine generation completed in 224.579 seconds.
[11/19/2024-15:47:52] [I] [TRT] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 1 MiB, GPU 261 MiB
[11/19/2024-15:47:52] [I] Engine built in 224.714 sec.
[11/19/2024-15:47:52] [I] Created engine with size: 33.9967 MiB
[11/19/2024-15:47:54] [I] [TRT] Loaded engine size: 33 MiB
[11/19/2024-15:47:54] [I] Engine deserialized in 0.0784137 sec.
[11/19/2024-15:47:54] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +1, GPU +29, now: CPU 1, GPU 58 (MiB)
[11/19/2024-15:47:54] [I] Setting persistentCacheLimit to 0 bytes.
[11/19/2024-15:47:54] [I] Created execution context with device memory size: 29.5166 MiB
[11/19/2024-15:47:54] [I] Using random values for input images
[11/19/2024-15:47:54] [I] Input binding for images with dimensions 1x3x640x640 is created.
[11/19/2024-15:47:54] [I] Output binding for num_dets with dimensions 1x1 is created.
[11/19/2024-15:47:54] [I] Output binding for det_boxes with dimensions 1x100x4 is created.
[11/19/2024-15:47:54] [I] Output binding for det_scores with dimensions 1x100 is created.
[11/19/2024-15:47:54] [I] Output binding for det_classes with dimensions 1x100 is created.
[11/19/2024-15:47:54] [I] Starting inference
[11/19/2024-15:47:57] [I] Warmup completed 12 queries over 200 ms
[11/19/2024-15:47:57] [I] Timing trace has 513 queries over 3.00887 s
[11/19/2024-15:47:57] [I]
[11/19/2024-15:47:57] [I] === Trace details ===
[11/19/2024-15:47:57] [I] Trace averages of 10 runs:
[11/19/2024-15:47:57] [I] Average on 10 runs - GPU latency: 9.09302 ms - Host latency: 9.70901 ms (enqueue 1.92417 ms)

[11/19/2024-15:47:57] [I] Average on 10 runs - GPU latency: 4.97632 ms - Host latency: 5.57754 ms (enqueue 3.86582 ms)
[11/19/2024-15:47:57] [I] Average on 10 runs - GPU latency: 5.00093 ms - Host latency: 5.63674 ms (enqueue 5.03738 ms)
[11/19/2024-15:47:57] [I]
[11/19/2024-15:47:57] [I] === Performance summary ===
[11/19/2024-15:47:57] [I] Throughput: 170.496 qps
[11/19/2024-15:47:57] [I] Latency: min = 5.21002 ms, max = 11.5681 ms, mean = 5.72919 ms, median = 5.50769 ms, percentile(90%) = 5.92151 ms, percentile(95%) = 6.34888 ms, percentile(99%) = 11.3956 ms
[11/19/2024-15:47:57] [I] Enqueue Time: min = 1.25708 ms, max = 12.449 ms, mean = 3.65122 ms, median = 2.54761 ms, percentile(90%) = 6.11572 ms, percentile(95%) = 6.72803 ms, percentile(99%) = 9.24976 ms
[11/19/2024-15:47:57] [I] H2D Latency: min = 0.5625 ms, max = 0.759521 ms, mean = 0.58865 ms, median = 0.581665 ms, percentile(90%) = 0.615479 ms, percentile(95%) = 0.638184 ms, percentile(99%) = 0.689453 ms
[11/19/2024-15:47:57] [I] GPU Compute Time: min = 4.61823 ms, max = 10.9517 ms, mean = 5.12365 ms, median = 4.90283 ms, percentile(90%) = 5.29504 ms, percentile(95%) = 5.66357 ms, percentile(99%) = 10.7827 ms
[11/19/2024-15:47:57] [I] D2H Latency: min = 0.0129395 ms, max = 0.162354 ms, mean = 0.0168824 ms, median = 0.0141602 ms, percentile(90%) = 0.0197754 ms, percentile(95%) = 0.0262451 ms, percentile(99%) = 0.0639343 ms
[11/19/2024-15:47:57] [I] Total Host Walltime: 3.00887 s
[11/19/2024-15:47:57] [I] Total GPU Compute Time: 2.62843 s
[11/19/2024-15:47:57] [W] * GPU compute time is unstable, with coefficient of variance = 17.1638%.
[11/19/2024-15:47:57] [W]   If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[11/19/2024-15:47:57] [I] Explanations of the performance metrics are printed in the verbose logs.
[11/19/2024-15:47:57] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v100600] [b26] # trtexec --onnx=output\yolov10m.onnx --saveEngine=output\yolov10m.engine --fp16
laugh12321 commented 1 week ago

@STVHA, the export seems correct, but there's still no output? Try using this command instead: trtyolo infer -e models/yolov10m.engine -m 0 -i images -o output -l labels.txt --cudaGraph

laugh12321 commented 1 week ago

@STVHA Below is the result I obtained by using the yolov10m.pt model provided by Ultralytics and running the example code in the examples/detect directory of the TensorRT-YOLO project.

b3ec812b-db84-4794-b65d-7b1a2385bb19

STVHA commented 1 week ago

@laugh12321 What my result shown above is based on the inference of the C++ demo\detect. Here is the result I tried with "trtyolo infer" (with lastest code), I got this message:

trtyolo infer -e output\yolov10m.engine --mode 0 --input TestData\input --output TestData\ --labels  examples\detect\labels_det.txt --cudaGraph
[I] Successfully found necessary library paths:
{
    "cudart": "D:\\AccelSDK\\CudaSDK\\12.6\\bin",
    "nvinfer": "D:\\AccelSDK\\TensorRT-10.6.0.26\\bin",
    "cudnn": "C:\\Python\\Python312\\Lib\\site-packages\\torch\\lib"
}
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "C:\Python\Python312\Scripts\trtyolo.exe\__main__.py", line 7, in <module>
  File "C:\Python\Python312\Lib\site-packages\rich_click\rich_command.py", line 367, in __call__
    return super().__call__(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Python\Python312\Lib\site-packages\click\core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Python\Python312\Lib\site-packages\rich_click\rich_command.py", line 152, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "C:\Python\Python312\Lib\site-packages\click\core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Python\Python312\Lib\site-packages\click\core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Python\Python312\Lib\site-packages\click\core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Python\Python312\Lib\site-packages\tensorrt_yolo\cli.py", line 130, in infer
    from .infer import generate_labels_with_colors
  File "C:\Python\Python312\Lib\site-packages\tensorrt_yolo\infer\__init__.py", line 1, in <module>
    from .inference import DeployCGDet, DeployCGOBB, DeployCGSeg, DeployDet, DeployOBB, DeploySeg
  File "C:\Python\Python312\Lib\site-packages\tensorrt_yolo\infer\inference.py", line 28, in <module>
    from .result import DetResult, OBBResult, SegResult
  File "C:\Python\Python312\Lib\site-packages\tensorrt_yolo\infer\result.py", line 30, in <module>
    RotatedBox = C.result.RotatedBox
                 ^^^^^^^^^^^^^^^^^^^
AttributeError: module 'tensorrt_yolo.libs.pydeploy.result' has no attribute 'RotatedBox'

Regarding the C++ demo\detect inference, I guess maybe the versions of Cuda and TensorRT are not supported.

laugh12321 commented 1 week ago

@STVHA To resolve the aforementioned issue, you may attempt to update your project's code by executing a git pull. Following that, please refer to the build_and_install.md guide to reconfigure your environment accordingly.

As for the concern regarding CUDA and TensorRT version incompatibility, it's advisable to verify if the versions of CUDA and TensorRT you are using are supported on your GPU.

STVHA commented 1 week ago

@laugh12321 I made trtyolo infer working but it often takes too long latency. trtyolo infer --engine models\yolov10m.trt --mode 0 --input input\036.jpg --output TestData\ --labels examples\detect\labels.txt --cudaGraph [I] Successfully found necessary library paths: { "cudart": "D:\AccelSDK\CudaSDK\12.6\bin", "nvinfer": "D:\AccelSDK\TensorRT-10.6.0.26\bin", "cudnn": "C:\Python\Python312\Lib\site-packages\torch\lib" } [I] Infering data in D:\Devel\AIOD\TestData\input\036.jpg Processing batches ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00 [S] Finished Inference.

I tested with the same as your "umbrella" image above, I got detected objects in the output but it took about 15 seconds.

I tried the same engine file with C++ detect with still gives no object. With the detected result of the trtyolo, I don't think because of CUDA SDK or TensorRT version.

laugh12321 commented 1 week ago

@STVHA, it seems there might be an issue with the environment configuration on your Windows system. You can refer to this document Windows Development Environment Configuration — NVIDIA Chapter: Installing CUDA, cuDNN, and TensorRT to reconfigure your setup. Additionally, you might consider trying a different computer, or perhaps switching to a Linux system, or even setting up your environment using Docker.

STVHA commented 1 week ago

@laugh12321 Thanks to your advices, finally I can make it work with the latest version 5.0.0. The causes for the issues before I found are:

But still I am confused that the python sample infers faster than C++ sample (4.8ms vs 6.5ms avg) though they use the same data and arguments.

STVHA commented 1 week ago

I found the C++ sample was built with debug mode that is why it uses more inference latency. With release build the C++ sample runs a little faster than the Python one. That is my expectation. Thank you @laugh12321 for support.

laugh12321 commented 1 week ago

@STVHA Thank you very much for your feedback. Indeed, it is expected theoretically that the C++ version outperforms the Python version in terms of performance. Moving forward, I will compile the sample code in release mode.