Closed GiorgosBetsos closed 2 years ago
π Hello @GiorgosBetsos, thank you for your interest in YOLOv5 π! Please visit our βοΈ Tutorials to get started, where you can find quickstart guides for simple tasks like Custom Data Training all the way to advanced concepts like Hyperparameter Evolution.
If this is a π Bug Report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.
If this is a custom training β Question, please provide as much information as possible, including dataset images, training logs, screenshots, and a public link to online W&B logging if available.
For business inquiries or professional support requests please visit https://ultralytics.com or email support@ultralytics.com.
Python>=3.7.0 with all requirements.txt installed including PyTorch>=1.7. To get started:
git clone https://github.com/ultralytics/yolov5 # clone
cd yolov5
pip install -r requirements.txt # install
YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):
If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training (train.py), validation (val.py), inference (detect.py) and export (export.py) on macOS, Windows, and Ubuntu every 24 hours and on every commit.
@GiorgosBetsos π Hello! Thanks for asking about benchmarks. YOLOv5 π inference is officially supported in 11 formats:
π‘ ProTip: Export to ONNX or OpenVINO for up to 3x CPU speedup. See CPU Benchmarks. π‘ ProTip: Export to TensorRT for up to 5x GPU speedup. See GPU Benchmarks.
Format | export.py --include |
Model |
---|---|---|
PyTorch | - | yolov5s.pt |
TorchScript | torchscript |
yolov5s.torchscript |
ONNX | onnx |
yolov5s.onnx |
OpenVINO | openvino |
yolov5s_openvino_model/ |
TensorRT | engine |
yolov5s.engine |
CoreML | coreml |
yolov5s.mlmodel |
TensorFlow SavedModel | saved_model |
yolov5s_saved_model/ |
TensorFlow GraphDef | pb |
yolov5s.pb |
TensorFlow Lite | tflite |
yolov5s.tflite |
TensorFlow Edge TPU | edgetpu |
yolov5s_edgetpu.tflite |
TensorFlow.js | tfjs |
yolov5s_web_model/ |
Benchmarks below run on a Colab Pro with the YOLOv5 tutorial notebook . To reproduce:
python utils/benchmarks.py --weights yolov5s.pt --imgsz 640 --device 0
benchmarks: weights=/content/yolov5/yolov5s.pt, imgsz=640, batch_size=1, data=/content/yolov5/data/coco128.yaml, device=0, half=False, test=False
Checking setup...
YOLOv5 π v6.1-135-g7926afc torch 1.10.0+cu111 CUDA:0 (Tesla V100-SXM2-16GB, 16160MiB)
Setup complete β
(8 CPUs, 51.0 GB RAM, 46.7/166.8 GB disk)
Benchmarks complete (458.07s)
Format mAP@0.5:0.95 Inference time (ms)
0 PyTorch 0.4623 10.19
1 TorchScript 0.4623 6.85
2 ONNX 0.4623 14.63
3 OpenVINO NaN NaN
4 TensorRT 0.4617 1.89
5 CoreML NaN NaN
6 TensorFlow SavedModel 0.4623 21.28
7 TensorFlow GraphDef 0.4623 21.22
8 TensorFlow Lite NaN NaN
9 TensorFlow Edge TPU NaN NaN
10 TensorFlow.js NaN NaN
benchmarks: weights=/content/yolov5/yolov5s.pt, imgsz=640, batch_size=1, data=/content/yolov5/data/coco128.yaml, device=cpu, half=False, test=False
Checking setup...
YOLOv5 π v6.1-135-g7926afc torch 1.10.0+cu111 CPU
Setup complete β
(8 CPUs, 51.0 GB RAM, 41.5/166.8 GB disk)
Benchmarks complete (241.20s)
Format mAP@0.5:0.95 Inference time (ms)
0 PyTorch 0.4623 127.61
1 TorchScript 0.4623 131.23
2 ONNX 0.4623 69.34
3 OpenVINO 0.4623 66.52
4 TensorRT NaN NaN
5 CoreML NaN NaN
6 TensorFlow SavedModel 0.4623 123.79
7 TensorFlow GraphDef 0.4623 121.57
8 TensorFlow Lite 0.4623 316.61
9 TensorFlow Edge TPU NaN NaN
10 TensorFlow.js NaN NaN
Good luck π and let us know if you have any other questions!
I have seen these benchmarks, this is the reason why I opened this issue. According to the benchmarks TensorRT inference time is 1.89ms vs 10.19ms for PyTorch inference. As opposed to this I get 5.3ms vs 6.9ms.
This is what I get when I execute:
python utils/benchmarks.py --weights yolov5s.pt --imgsz 640 --device 0
on my box:
benchmarks: weights=yolov5s.pt, imgsz=640, batch_size=1, data=/home/giorgos/repos/third-party/yolov5/data/coco128.yaml, device=0, half=False, test=False, pt_only=False, hard_fail=False
Checking setup...
YOLOv5 π v6.1-324-g0b5ac22 Python-3.8.13 torch-1.12.0+cu102 CUDA:0 (NVIDIA GeForce GTX 1080, 8111MiB)
Setup complete β
(12 CPUs, 31.2 GB RAM, 90.1/227.7 GB disk)
Benchmarks complete (139.15s)
Format Size (MB) mAP@0.5:0.95 Inference time (ms)
0 PyTorch 14.1 0.4716 6.77
1 TorchScript 28.1 0.4716 7.14
2 ONNX 28.0 0.4716 66.00
3 OpenVINO NaN NaN NaN
4 TensorRT 31.9 0.4716 4.62
@GiorgosBetsos sure, results will vary with hardware, software, firmware etc. The results I provided are run on a reproducible environments, Colab Pro.
Hey @GiorgosBetsos, I face the same issue. I wanted to speed up inference with the new TensorRT format in my application. But in the benchmarks the improvements between the PyTorch and TensorRT format are only valid for batch sizes of 1. For a batch size of 15, as i would use in my application, the TensorRT format was detrimental to the inference time.
Could this be unintended behaviour that can be fixed by adjusting export parameters @glenn-jocher?
I used the benchmark features on a local PC with an RTX 2060 and a server with a GTX 1080 Ti.
I benchmarked with an image size of 320 and batch size of [1, 15].
benchmarks: weights=models/yolov5/yolov5s.pt, imgsz=320, batch_size=1, data=/usr/src/app/detector/yolov5_61/data/coco128.yaml, device=0, half=False, test=False, pt_only=False, hard_fail=False
Checking setup...
YOLOv5 π 2022-8-5 Python-3.7.7 torch-1.11.0 CUDA:0 (NVIDIA GeForce RTX 2060, 5932MiB)
Setup complete β
(6 CPUs, 15.5 GB RAM, 86.0/93.6 GB disk)
Benchmarks complete (95.81s)
Format Size (MB) mAP@0.5:0.95 Inference time (ms)
0 PyTorch 14.1 0.4114 6.21
1 TorchScript 27.9 0.4114 4.15
2 ONNX 27.7 0.4114 34.92
3 OpenVINO NaN NaN NaN
4 TensorRT 35.0 0.4114 2.71
5 CoreML NaN NaN NaN
6 TensorFlow SavedModel NaN NaN NaN
7 TensorFlow GraphDef NaN NaN NaN
8 TensorFlow Lite NaN NaN NaN
9 TensorFlow Edge TPU NaN NaN NaN
10 TensorFlow.js NaN NaN NaN
benchmarks: weights=models/yolov5/yolov5s.pt, imgsz=320, batch_size=15, data=/usr/src/app/detector/yolov5_61/data/coco128.yaml, device=0, half=False, test=False, pt_only=False, hard_fail=False
Checking setup...
YOLOv5 π 2022-8-5 Python-3.7.7 torch-1.11.0 CUDA:0 (NVIDIA GeForce RTX 2060, 5932MiB)
Setup complete β
(6 CPUs, 15.5 GB RAM, 86.0/93.6 GB disk)
Benchmarks complete (139.08s)
Format Size (MB) mAP@0.5:0.95 Inference time (ms)
0 PyTorch 14.1 0.4114 2.07
1 TorchScript 27.9 0.4114 1.45
2 ONNX 27.7 0.4114 30.42
3 OpenVINO NaN NaN NaN
4 TensorRT 35.0 0.4114 2.68
5 CoreML NaN NaN NaN
6 TensorFlow SavedModel NaN NaN NaN
7 TensorFlow GraphDef NaN NaN NaN
8 TensorFlow Lite NaN NaN NaN
9 TensorFlow Edge TPU NaN NaN NaN
10 TensorFlow.js NaN NaN NaN
I benchmarked with an image size of [320, 620] and batch size of [1, 15].
benchmarks: weights=models/yolov5/yolov5s.pt, imgsz=640, batch_size=15, data=/usr/src/app/detector/yolov5_61/data/coco128.yaml, device=0, half=False, test=False, pt_only=False, hard_fail=False
Checking setup...
YOLOv5 π 2022-8-17 Python-3.8.12 torch-1.11.0a0+17540c5 CUDA:0 (NVIDIA GeForce GTX 1080 Ti, 11179MiB)
Setup complete β
(40 CPUs, 62.8 GB RAM, 754.7/915.8 GB disk)
Benchmarks complete (119.54s)
Format Size (MB) mAP@0.5:0.95 Inference time (ms)
0 PyTorch 14.1 0.4716 3.87
1 TorchScript 28.0 0.4716 3.60
2 ONNX 28.0 0.4716 12.60
3 OpenVINO NaN NaN NaN
4 TensorRT 42.7 0.4716 4.07
5 CoreML NaN NaN NaN
6 TensorFlow SavedModel NaN NaN NaN
7 TensorFlow GraphDef NaN NaN NaN
8 TensorFlow Lite NaN NaN NaN
9 TensorFlow Edge TPU NaN NaN NaN
10 TensorFlow.js NaN NaN NaN
benchmarks: weights=models/yolov5/yolov5s.pt, imgsz=640, batch_size=1, data=/usr/src/app/detector/yolov5_61/data/coco128.yaml, device=0, half=False, test=False, pt_only=False, hard_fail=False
Checking setup...
YOLOv5 π 2022-8-17 Python-3.8.12 torch-1.11.0a0+17540c5 CUDA:0 (NVIDIA GeForce GTX 1080 Ti, 11179MiB)
Setup complete β
(40 CPUs, 62.8 GB RAM, 754.7/915.8 GB disk)
Benchmarks complete (107.81s)
Format Size (MB) mAP@0.5:0.95 Inference time (ms)
0 PyTorch 14.1 0.4716 12.91
1 TorchScript 28.0 0.4716 6.93
2 ONNX 28.0 0.4716 13.30
3 OpenVINO NaN NaN NaN
4 TensorRT 42.7 0.4716 4.05
5 CoreML NaN NaN NaN
6 TensorFlow SavedModel NaN NaN NaN
7 TensorFlow GraphDef NaN NaN NaN
8 TensorFlow Lite NaN NaN NaN
9 TensorFlow Edge TPU NaN NaN NaN
10 TensorFlow.js NaN NaN NaN
benchmarks: weights=models/yolov5/yolov5s.pt, imgsz=320, batch_size=15, data=/usr/src/app/detector/yolov5_61/data/coco128.yaml, device=0, half=False, test=False, pt_only=False, hard_fail=False
Checking setup...
YOLOv5 π 2022-8-17 Python-3.8.12 torch-1.11.0a0+17540c5 CUDA:0 (NVIDIA GeForce GTX 1080 Ti, 11179MiB)
Setup complete β
(40 CPUs, 62.8 GB RAM, 754.7/915.8 GB disk)
Benchmarks complete (101.88s)
Format Size (MB) mAP@0.5:0.95 Inference time (ms)
0 PyTorch 14.1 0.4114 1.36
1 TorchScript 27.9 0.4114 1.14
2 ONNX 27.7 0.4114 6.87
3 OpenVINO NaN NaN NaN
4 TensorRT 42.3 0.4114 2.16
5 CoreML NaN NaN NaN
6 TensorFlow SavedModel NaN NaN NaN
7 TensorFlow GraphDef NaN NaN NaN
8 TensorFlow Lite NaN NaN NaN
9 TensorFlow Edge TPU NaN NaN NaN
10 TensorFlow.js NaN NaN NaN
benchmarks: weights=models/yolov5/yolov5s.pt, imgsz=320, batch_size=1, data=/usr/src/app/detector/yolov5_61/data/coco128.yaml, device=0, half=False, test=False, pt_only=False, hard_fail=False
Checking setup...
YOLOv5 π 2022-8-17 Python-3.8.12 torch-1.11.0a0+17540c5 CUDA:0 (NVIDIA GeForce GTX 1080 Ti, 11179MiB)
Setup complete β
(40 CPUs, 62.8 GB RAM, 754.7/915.8 GB disk)
Benchmarks complete (103.19s)
Format Size (MB) mAP@0.5:0.95 Inference time (ms)
0 PyTorch 14.1 0.4114 10.98
1 TorchScript 27.9 0.4114 7.27
2 ONNX 27.7 0.4114 5.91
3 OpenVINO NaN NaN NaN
4 TensorRT 42.3 0.4114 2.13
5 CoreML NaN NaN NaN
6 TensorFlow SavedModel NaN NaN NaN
7 TensorFlow GraphDef NaN NaN NaN
8 TensorFlow Lite NaN NaN NaN
9 TensorFlow Edge TPU NaN NaN NaN
10 TensorFlow.js NaN NaN NaN
@lennartmoritz torch inference will normally see speedups at larger batch sizes, see https://community.ultralytics.com/t/yolov5-study-batch-size-vs-speed/31, not sure about TRT as I only have batch size 1 experience there.
BTW you'll get significantly better speedup at FP16 with TRT.
@lennartmoritz ah I just realized val.py is forcing batch-size 1 for most formats on L149. You might want to debug expanding this to TRT dynamic and higher batch size models (but not batch-size 1 models) and then submit a PR with your updates. https://github.com/ultralytics/yolov5/blob/4a8ab3bc42d32f3e2e9c026b87dc29fba6143064/val.py#L139-L151
Thank you for the hint. I've created a workaround for the benchmark but I don't have the time right now to look into dynamic batch sizes and add correct "fixed batch size" handling needed in a PR.
With the workaround the TensorRT model "engine" is built with the selected batch size, but the batch size must be divisible without remainder by the amount of images in the benchmark (128 Images for the default COCO images) to avoid an error since batch sizes smaller than what the ThesorRT model is built for are not supported. Custom handling to add padding "placeholder images" to non-full input batches would be required.
By adding elif f == "engine"
case like this:
# Export
if f == "-":
w = weights # PyTorch format
elif f == "engine":
w = export.run(weights=weights, imgsz=[imgsz], batch_size=batch_size, include=[f], device=device, half=half)[-1] # Set correct batch
else:
w = export.run(weights=weights, imgsz=[imgsz], include=[f], device=device, half=half)[-1] # all others
assert suffix in str(w), "export failed"
To the original code: https://github.com/ultralytics/yolov5/blob/4a8ab3bc42d32f3e2e9c026b87dc29fba6143064/utils/benchmarks.py#L71-L76
@lennartmoritz ah got it. You probably just want --dynamic TRT export then for the last batch problem.
Does the TRT batched inference show speedup?
@glenn-jocher yes, the batched inference speed has improved notably compared to the ealier benchmarks. I'll add benchmarks with the RTX 2060 below. I agree that a dynamic TRT export should solve the problem. Since the benchmark.py offers no --dynamic flag for the ArgumentParser, I could add a hardcoded dynamic=true
to the elif case from my earlier comment.
For some reason I could not replicate any improvements of using half precision as in the below benchmarks in my own application yet, but I can probably figure that out on my own. Since I only compared total detection times including input resizing, halving, upload to & download from GPU this might as well be a "me-problem".
benchmarks: weights=models/yolov5/yolov5s.pt, imgsz=320, batch_size=16, data=/usr/src/app/detector/yolov5_61/data/coco128.yaml, device=0, half=False, test=False, pt_only=False, hard_fail=False
Checking setup...
YOLOv5 π 2022-8-5 Python-3.7.7 torch-1.11.0 CUDA:0 (NVIDIA GeForce RTX 2060, 5932MiB)
Setup complete β
(6 CPUs, 15.5 GB RAM, 85.8/93.6 GB disk)
Benchmarks complete (87.86s)
Format Size (MB) mAP@0.5:0.95 Inference time (ms)
0 PyTorch 14.1 0.4114 1.51
1 TorchScript 27.9 0.4114 1.41
2 ONNX NaN NaN NaN
3 OpenVINO NaN NaN NaN
4 TensorRT 30.2 0.4114 1.31
5 CoreML NaN NaN NaN
6 TensorFlow SavedModel NaN NaN NaN
7 TensorFlow GraphDef NaN NaN NaN
8 TensorFlow Lite NaN NaN NaN
9 TensorFlow Edge TPU NaN NaN NaN
10 TensorFlow.js NaN NaN NaN
benchmarks: weights=models/yolov5/yolov5s.pt, imgsz=320, batch_size=16, data=/usr/src/app/detector/yolov5_61/data/coco128.yaml, device=0, half=True, test=False, pt_only=False, hard_fail=False
Checking setup...
YOLOv5 π 2022-8-5 Python-3.7.7 torch-1.11.0 CUDA:0 (NVIDIA GeForce RTX 2060, 5932MiB)
Setup complete β
(6 CPUs, 15.5 GB RAM, 85.8/93.6 GB disk)
Benchmarks complete (248.03s)
Format Size (MB) mAP@0.5:0.95 Inference time (ms)
0 PyTorch 14.1 0.4116 1.29
1 TorchScript 14.1 0.4117 0.99
2 ONNX NaN NaN NaN
3 OpenVINO NaN NaN NaN
4 TensorRT 15.6 0.4111 0.43
5 CoreML NaN NaN NaN
6 TensorFlow SavedModel NaN NaN NaN
7 TensorFlow GraphDef NaN NaN NaN
8 TensorFlow Lite NaN NaN NaN
9 TensorFlow Edge TPU NaN NaN NaN
10 TensorFlow.js NaN NaN NaN
@lennartmoritz this looks good. I looked at the val.py code again and it seems like TensorRT models at different batch sizes should be handled automatically, though as you mentined the last batch will produce an error if not dynamic. If you export TRT --dynamic can you run benchmarks at different batch sizes without any code modifications?
On L144: https://github.com/ultralytics/yolov5/blob/fc8758a49bd30526fb21d0683359e86be3a292a8/val.py#L139-L151
π Hello, this issue has been automatically marked as stale because it has not had recent activity. Please note it will be closed if no further activity occurs.
Access additional YOLOv5 π resources:
Access additional Ultralytics β‘ resources:
Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!
Thank you for your contributions to YOLOv5 π and Vision AI β!
Search before asking
Question
My system specs are the following:
OS: Ubuntu 20.04.4 LTS GPU: Nvidia GTX 1080 with 8GB RAM CPU: Core I5 11500 RAM: 32 GB DDR4
I have cloned the yolov5 repository and installed all requirements, including the ones necessary for onnx and TernsorRT export, in a new conda environment. I then followed the intructions from TFLite, ONNX, CoreML, TensorRT Export guide and managed to successfully produce yolov5s.engine file.
I then used detect.py to perform inference on about 20 jpeg images. The results I got are far from the YOLOv5 Export Benchmarks for GPU.
Part of the output I got using yolov5s.engine file is as follows:
And this is the output from using yolov5s.pt
As you can see the speedup in inference time is nowhere near the one reported in the link at the beginning of this post: using the .engine file I got 5.3ms inference time vs 6.9ms obtained using pytorch model.
Is there sth I could be doing wrong?
Additional
No response