Closed berkay-karlik closed 2 years ago
--half
is not compatible with --dynamic
, i.e. use either --half
or --dynamic
but not both. I am adding an assert there. Thank for pointing this out :smile:
Thank you for pointing that out as well. However the stack trace I shared is from --half
only. I think that is different than the --dynamic
and --half
compatibility issue, right?
I cannot reproduce this error @berkay-karlik when copypasting your command. Notice that my TRT builder has no FP16 option so the export is FP32
(export) ➜ Yolov5_StrongSORT_OSNet git:(master) ✗ python3 reid_export.py --weights ./weights/osnet_x0_25_msmt17.pt --include onnx engine --device 0 --half --batch-size 8
/home/mikel.brostrom/Yolov5_StrongSORT_OSNet
YOLOv5 🚀 2022-10-21 Python-3.8.13 torch-1.9.0+cu102 CUDA:0 (Quadro P2000, 4032MiB)
Successfully loaded pretrained weights from "weights/osnet_x0_25_msmt17.pt"
** The following layers are discarded due to unmatched keys or layer size: ['classifier.weight', 'classifier.bias']
/home/mikel.brostrom/venvs/import_test2/lib/python3.8/site-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at /pytorch/c10/core/TensorImpl.h:1156.)
return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)
PyTorch: starting from weights/osnet_x0_25_msmt17.pt with output shape (8, 512) (9.3 MB)
ONNX: starting export with onnx 1.12.0...
ONNX: export success, saved as weights/osnet_x0_25_msmt17.onnx (0.4 MB)
TensorRT: starting export with TensorRT 8.4.3.1...
[11/03/2022-07:59:57] [TRT] [I] [MemUsageChange] Init CUDA: CPU +191, GPU +0, now: CPU 1913, GPU 1786 (MiB)
[11/03/2022-07:59:58] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +6, GPU +0, now: CPU 1936, GPU 1786 (MiB)
reid_export.py:201: DeprecationWarning: Use set_memory_pool_limit instead.
config.max_workspace_size = workspace * 1 << 30
[11/03/2022-07:59:58] [TRT] [I] ----------------------------------------------------------------
[11/03/2022-07:59:58] [TRT] [I] Input filename: weights/osnet_x0_25_msmt17.onnx
[11/03/2022-07:59:58] [TRT] [I] ONNX IR version: 0.0.6
[11/03/2022-07:59:58] [TRT] [I] Opset version: 12
[11/03/2022-07:59:58] [TRT] [I] Producer name: pytorch
[11/03/2022-07:59:58] [TRT] [I] Producer version: 1.9
[11/03/2022-07:59:58] [TRT] [I] Domain:
[11/03/2022-07:59:58] [TRT] [I] Model version: 0
[11/03/2022-07:59:58] [TRT] [I] Doc string:
[11/03/2022-07:59:58] [TRT] [I] ----------------------------------------------------------------
[11/03/2022-07:59:58] [TRT] [W] onnx2trt_utils.cpp:369: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
TensorRT: Network Description:
TensorRT: input "images" with shape (8, 3, 256, 128) and dtype DataType.HALF
TensorRT: output "output" with shape (8, 512) and dtype DataType.HALF
TensorRT: building FP32 engine in weights/osnet_x0_25_msmt17.engine
reid_export.py:229: DeprecationWarning: Use build_serialized_network instead.
with builder.build_engine(network, config) as engine, open(f, 'wb') as t:
[11/03/2022-07:59:59] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +271, GPU +112, now: CPU 2210, GPU 1907 (MiB)
[11/03/2022-07:59:59] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +114, GPU +42, now: CPU 2324, GPU 1949 (MiB)
[11/03/2022-07:59:59] [TRT] [W] TensorRT was linked against cuDNN 8.4.1 but loaded cuDNN 8.4.0
[11/03/2022-07:59:59] [TRT] [I] Local timing cache in use. Profiling results in this builder pass will not be stored.
[11/03/2022-08:00:17] [TRT] [I] Detected 1 inputs and 1 output network tensors.
[11/03/2022-08:00:17] [TRT] [I] Total Host Persistent Memory: 226544
[11/03/2022-08:00:17] [TRT] [I] Total Device Persistent Memory: 376320
[11/03/2022-08:00:17] [TRT] [I] Total Scratch Memory: 17920
[11/03/2022-08:00:17] [TRT] [I] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 0 MiB, GPU 0 MiB
[11/03/2022-08:00:17] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 78.0308ms to assign 11 blocks to 211 nodes requiring 13635072 bytes.
[11/03/2022-08:00:17] [TRT] [I] Total Activation Memory: 13635072
[11/03/2022-08:00:17] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 2635, GPU 2075 (MiB)
[11/03/2022-08:00:17] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +1, GPU +10, now: CPU 2636, GPU 2085 (MiB)
[11/03/2022-08:00:17] [TRT] [W] TensorRT was linked against cuDNN 8.4.1 but loaded cuDNN 8.4.0
[11/03/2022-08:00:17] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +0, GPU +0, now: CPU 0, GPU 0 (MiB)
[11/03/2022-08:00:17] [TRT] [W] The getMaxBatchSize() function should not be used with an engine built from a network created with NetworkDefinitionCreationFlag::kEXPLICIT_BATCH flag. This function will always return 1.
[11/03/2022-08:00:17] [TRT] [W] The getMaxBatchSize() function should not be used with an engine built from a network created with NetworkDefinitionCreationFlag::kEXPLICIT_BATCH flag. This function will always return 1.
TensorRT: export success, saved as weights/osnet_x0_25_msmt17.engine (1.9 MB)
ONNX: starting export with onnx 1.12.0...
ONNX: export success, saved as weights/osnet_x0_25_msmt17.onnx (0.4 MB)
Export complete (37.3s)
Results saved to /home/mikel.brostrom/Yolov5_StrongSORT_OSNet/weights
Visualize: https://netron.app
This seems to be resolved @berkay-karlik?
Issue still exist for the platform I'm working on. I'm trying this in Nvidia Jetson Orin with Linux ubuntu 5.10.104-tegra. As you pointed out my platform seems to be using FP16 as default. If we could go as low as int8 that would be great for performance. It's okay to close the issue if there is nothing can be done. If you have any pointers or guidance on how can I fix this myself, I can try to fix it and submit a PR. Thanks for the help so far.
So the system is half-ing the models automatically? The FP16 is not an issue? I don't think I am following
https://docs.nvidia.com/deeplearning/tensorrt/support-matrix/index.html#hardware-precision-matrix Jetson orin support int8 precision too. So for this platform I expected half to be int8. Since reid export prints suggest that FP16 is the default.
Aha, now I get it.
I am on a laptop with an old Quadro GPU which only supports FP32. So when half-ing the models I get FP32 again as it has no FP16 support. The fact that your GPU supports FP16 makes it also possible for you to export to FP16 as your TensorRT building engine can handle it. This does not mean that you get int8 when using --half
for export. In order to achieve this you need other methods: QAT or PTQ. Check them out here: https://developer.nvidia.com/blog/achieving-fp32-accuracy-for-int8-inference-using-quantization-aware-training-with-tensorrt/. Feel free to leave a PR :smile:
Search before asking
Yolov5_StrongSORT_OSNet Component
Other
Bug
Following export works:
However I want to use
half
parameter to gain further performance. When I try I get the following error:If I use
dynamic
it complains that cpu does not supporthalf
, however I am trying to generate theengine
for the GPU of the device and it has nothing to do with the CPU.Environment
YOLOv5 🚀 2022-11-1 Python-3.8.10 torch-1.12.0a0+2c916ef.nv22.3 CUDA:0 (Orin, 30536MiB) osnet_x0_25_msmt17.pt OS: Linux ubuntu 5.10.104-tegra Python 3.8.10
Minimal Reproducible Example