mikel-brostrom / boxmot

BoxMOT: pluggable SOTA tracking modules for segmentation, object detection and pose estimation models
GNU Affero General Public License v3.0
6.56k stars 1.7k forks source link

How to get int8 ReID models #584

Closed berkay-karlik closed 1 year ago

berkay-karlik commented 1 year ago

Search before asking

Yolov5_StrongSORT_OSNet Component

Other

Bug

Following export works:

python3 reid_export.py --weights ./weights/osnet_x0_25_msmt17.pt --include onnx engine --device 0 --dynamic --batch-size 8 

However I want to use half parameter to gain further performance. When I try I get the following error:

orin@ubuntu:~/berkay_monitor/Yolov5_StrongSORT_OSNet$ python3 reid_export.py --weights ./weights/osnet_x0_25_msmt17.pt --include onnx engine --device 0 --half --batch-size 8 
/home/orin/.local/lib/python3.8/site-packages/torchvision-0.13.0-py3.8-linux-aarch64.egg/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: 
  warn(f"Failed to load image Python extension: {e}")
/home/orin/berkay_monitor/Yolov5_StrongSORT_OSNet/strong_sort/deep/reid/torchreid/metrics/rank.py:11: UserWarning: Cython evaluation (very fast so highly recommended) is unavailable, now use python evaluation.
  warnings.warn(
YOLOv5 🚀 2022-11-1 Python-3.8.10 torch-1.12.0a0+2c916ef.nv22.3 CUDA:0 (Orin, 30536MiB)

weights/osnet_x0_25_msmt17.pt
Model: osnet_x0_25
- params: 203,568
- flops: 82,316,000
Successfully loaded pretrained weights from "weights/osnet_x0_25_msmt17.pt"
** The following layers are discarded due to unmatched keys or layer size: ['classifier.weight', 'classifier.bias']

PyTorch: starting from weights/osnet_x0_25_msmt17.pt with output shape (8, 512) (9.3 MB)

starting export with onnx 1.12.0...
export success, saved as weights/osnet_x0_25_msmt17.onnx (0.5 MB)
run --dynamic ONNX model inference with: 'python detect.py --weights weights/osnet_x0_25_msmt17.onnx'

starting export with onnx 1.12.0...
export success, saved as weights/osnet_x0_25_msmt17.onnx (0.5 MB)run --dynamic ONNX model inference with: 'python detect.py --weights weights/osnet_x0_25_msmt17.onnx'

TensorRT: starting export with TensorRT 8.4.1.5...
[11/02/2022-18:50:45] [TRT] [I] [MemUsageChange] Init CUDA: CPU +213, GPU +0, now: CPU 2056, GPU 8458 (MiB)
[11/02/2022-18:50:51] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +351, GPU +411, now: CPU 2426, GPU 8879 (MiB)
reid_export.py:270: DeprecationWarning: Use set_memory_pool_limit instead.
  config.max_workspace_size = workspace * 1 << 30
[11/02/2022-18:50:51] [TRT] [I] ----------------------------------------------------------------[11/02/2022-18:50:51] [TRT] [I] Input filename:   weights/osnet_x0_25_msmt17.onnx
[11/02/2022-18:50:51] [TRT] [I] ONNX IR version:  0.0.7
[11/02/2022-18:50:51] [TRT] [I] Opset version:    13
[11/02/2022-18:50:51] [TRT] [I] Producer name:    pytorch[11/02/2022-18:50:51] [TRT] [I] Producer version: 1.12.0
[11/02/2022-18:50:51] [TRT] [I] Domain:           
[11/02/2022-18:50:51] [TRT] [I] Model version:    0
[11/02/2022-18:50:51] [TRT] [I] Doc string:       
[11/02/2022-18:50:51] [TRT] [I] ----------------------------------------------------------------
[11/02/2022-18:50:51] [TRT] [W] onnx2trt_utils.cpp:367: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
TensorRT: Network Description:
TensorRT:       input "images" with shape (8, 3, 256, 128) and dtype DataType.HALF
TensorRT:       output "output" with shape (8, 512) and dtype DataType.HALF
TensorRT: building FP16 engine in weights/osnet_x0_25_msmt17.engine
reid_export.py:298: DeprecationWarning: Use build_serialized_network instead.
  with builder.build_engine(network, config) as engine, open(f, 'wb') as t:
[11/02/2022-18:50:51] [TRT] [E] 4: [network.cpp::operator()::3018] Error Code 4: Internal Error (images: kMIN dimensions in profile 0 are [1,3,256,128] but input has static dimensions [8,3,256,128].)

TensorRT: export failure: __enter__

If I use dynamic it complains that cpu does not support half, however I am trying to generate the engine for the GPU of the device and it has nothing to do with the CPU.

Environment

YOLOv5 🚀 2022-11-1 Python-3.8.10 torch-1.12.0a0+2c916ef.nv22.3 CUDA:0 (Orin, 30536MiB) osnet_x0_25_msmt17.pt OS: Linux ubuntu 5.10.104-tegra Python 3.8.10

Minimal Reproducible Example

 python3 reid_export.py --weights ./weights/osnet_x0_25_msmt17.pt --include onnx engine --device 0 --half --batch-size 8
mikel-brostrom commented 1 year ago

--half is not compatible with --dynamic, i.e. use either --half or --dynamic but not both. I am adding an assert there. Thank for pointing this out :smile:

berkay-karlik commented 1 year ago

Thank you for pointing that out as well. However the stack trace I shared is from --half only. I think that is different than the --dynamic and --half compatibility issue, right?

mikel-brostrom commented 1 year ago

I cannot reproduce this error @berkay-karlik when copypasting your command. Notice that my TRT builder has no FP16 option so the export is FP32

(export) ➜  Yolov5_StrongSORT_OSNet git:(master) ✗  python3 reid_export.py --weights ./weights/osnet_x0_25_msmt17.pt --include onnx engine --device 0 --half --batch-size 8
/home/mikel.brostrom/Yolov5_StrongSORT_OSNet
YOLOv5 🚀 2022-10-21 Python-3.8.13 torch-1.9.0+cu102 CUDA:0 (Quadro P2000, 4032MiB)

Successfully loaded pretrained weights from "weights/osnet_x0_25_msmt17.pt"
** The following layers are discarded due to unmatched keys or layer size: ['classifier.weight', 'classifier.bias']
/home/mikel.brostrom/venvs/import_test2/lib/python3.8/site-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at  /pytorch/c10/core/TensorImpl.h:1156.)
  return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)

PyTorch: starting from weights/osnet_x0_25_msmt17.pt with output shape (8, 512) (9.3 MB)

ONNX: starting export with onnx 1.12.0...
ONNX: export success, saved as weights/osnet_x0_25_msmt17.onnx (0.4 MB)

TensorRT: starting export with TensorRT 8.4.3.1...
[11/03/2022-07:59:57] [TRT] [I] [MemUsageChange] Init CUDA: CPU +191, GPU +0, now: CPU 1913, GPU 1786 (MiB)
[11/03/2022-07:59:58] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +6, GPU +0, now: CPU 1936, GPU 1786 (MiB)
reid_export.py:201: DeprecationWarning: Use set_memory_pool_limit instead.
  config.max_workspace_size = workspace * 1 << 30
[11/03/2022-07:59:58] [TRT] [I] ----------------------------------------------------------------
[11/03/2022-07:59:58] [TRT] [I] Input filename:   weights/osnet_x0_25_msmt17.onnx
[11/03/2022-07:59:58] [TRT] [I] ONNX IR version:  0.0.6
[11/03/2022-07:59:58] [TRT] [I] Opset version:    12
[11/03/2022-07:59:58] [TRT] [I] Producer name:    pytorch
[11/03/2022-07:59:58] [TRT] [I] Producer version: 1.9
[11/03/2022-07:59:58] [TRT] [I] Domain:           
[11/03/2022-07:59:58] [TRT] [I] Model version:    0
[11/03/2022-07:59:58] [TRT] [I] Doc string:       
[11/03/2022-07:59:58] [TRT] [I] ----------------------------------------------------------------
[11/03/2022-07:59:58] [TRT] [W] onnx2trt_utils.cpp:369: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
TensorRT: Network Description:
TensorRT:       input "images" with shape (8, 3, 256, 128) and dtype DataType.HALF
TensorRT:       output "output" with shape (8, 512) and dtype DataType.HALF
TensorRT: building FP32 engine in weights/osnet_x0_25_msmt17.engine
reid_export.py:229: DeprecationWarning: Use build_serialized_network instead.
  with builder.build_engine(network, config) as engine, open(f, 'wb') as t:
[11/03/2022-07:59:59] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +271, GPU +112, now: CPU 2210, GPU 1907 (MiB)
[11/03/2022-07:59:59] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +114, GPU +42, now: CPU 2324, GPU 1949 (MiB)
[11/03/2022-07:59:59] [TRT] [W] TensorRT was linked against cuDNN 8.4.1 but loaded cuDNN 8.4.0
[11/03/2022-07:59:59] [TRT] [I] Local timing cache in use. Profiling results in this builder pass will not be stored.
[11/03/2022-08:00:17] [TRT] [I] Detected 1 inputs and 1 output network tensors.
[11/03/2022-08:00:17] [TRT] [I] Total Host Persistent Memory: 226544
[11/03/2022-08:00:17] [TRT] [I] Total Device Persistent Memory: 376320
[11/03/2022-08:00:17] [TRT] [I] Total Scratch Memory: 17920
[11/03/2022-08:00:17] [TRT] [I] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 0 MiB, GPU 0 MiB
[11/03/2022-08:00:17] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 78.0308ms to assign 11 blocks to 211 nodes requiring 13635072 bytes.
[11/03/2022-08:00:17] [TRT] [I] Total Activation Memory: 13635072
[11/03/2022-08:00:17] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 2635, GPU 2075 (MiB)
[11/03/2022-08:00:17] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +1, GPU +10, now: CPU 2636, GPU 2085 (MiB)
[11/03/2022-08:00:17] [TRT] [W] TensorRT was linked against cuDNN 8.4.1 but loaded cuDNN 8.4.0
[11/03/2022-08:00:17] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +0, GPU +0, now: CPU 0, GPU 0 (MiB)
[11/03/2022-08:00:17] [TRT] [W] The getMaxBatchSize() function should not be used with an engine built from a network created with NetworkDefinitionCreationFlag::kEXPLICIT_BATCH flag. This function will always return 1.
[11/03/2022-08:00:17] [TRT] [W] The getMaxBatchSize() function should not be used with an engine built from a network created with NetworkDefinitionCreationFlag::kEXPLICIT_BATCH flag. This function will always return 1.
TensorRT: export success, saved as weights/osnet_x0_25_msmt17.engine (1.9 MB)

ONNX: starting export with onnx 1.12.0...
ONNX: export success, saved as weights/osnet_x0_25_msmt17.onnx (0.4 MB)

Export complete (37.3s)
Results saved to /home/mikel.brostrom/Yolov5_StrongSORT_OSNet/weights
Visualize:       https://netron.app
mikel-brostrom commented 1 year ago

This seems to be resolved @berkay-karlik?

berkay-karlik commented 1 year ago

Issue still exist for the platform I'm working on. I'm trying this in Nvidia Jetson Orin with Linux ubuntu 5.10.104-tegra. As you pointed out my platform seems to be using FP16 as default. If we could go as low as int8 that would be great for performance. It's okay to close the issue if there is nothing can be done. If you have any pointers or guidance on how can I fix this myself, I can try to fix it and submit a PR. Thanks for the help so far.

mikel-brostrom commented 1 year ago

So the system is half-ing the models automatically? The FP16 is not an issue? I don't think I am following

berkay-karlik commented 1 year ago

https://docs.nvidia.com/deeplearning/tensorrt/support-matrix/index.html#hardware-precision-matrix Jetson orin support int8 precision too. So for this platform I expected half to be int8. Since reid export prints suggest that FP16 is the default.

mikel-brostrom commented 1 year ago

Aha, now I get it.

I am on a laptop with an old Quadro GPU which only supports FP32. So when half-ing the models I get FP32 again as it has no FP16 support. The fact that your GPU supports FP16 makes it also possible for you to export to FP16 as your TensorRT building engine can handle it. This does not mean that you get int8 when using --half for export. In order to achieve this you need other methods: QAT or PTQ. Check them out here: https://developer.nvidia.com/blog/achieving-fp32-accuracy-for-int8-inference-using-quantization-aware-training-with-tensorrt/. Feel free to leave a PR :smile: