stark-t / PAI

Pollination_Artificial_Intelligence
5 stars 1 forks source link

Results P1: Inference speed GPU vs. CPU #61

Closed stark-t closed 1 year ago

stark-t commented 1 year ago
valentinitnelav commented 1 year ago

It looks like if one exports the weights to ONNX or OpenVINO formats and run detections, that might give up to 3x CPU speedup; In a similar way for GPU, the format TensorRT for up to 5x GPU speedup https://github.com/ultralytics/yolov5/issues/6736#issuecomment-1047510928

Then there is the option of half precision FP16 inference, but I do not understand if this is applicable for only gaining inference speed for the GPU or also for a CPU as well.

EDIT1: Forgot to add the question: should I invest time to try these options, or just go with a simple detect script set for confidence 30 and IoU 10 % thresholds both for a GPU and a CPU running on the test dataset?

EDIT2: Actually, I just realised that for YOLOv5 there is a benchmarks.py (either in the root folder or in utils), but there is none for YOLOv7. Moreover, there might be problems of converting to other formats with YOLOv7 (e.g. https://github.com/WongKinYiu/yolov7/issues/1269). So, I guess if we do not get the same support for yolov7 as we get for yolov5, then I stay with the simpler approach.

valentinitnelav commented 1 year ago

Hi @stark-t , I just realised that YOLOv7 and YOLOv5 differ in terms of maximum number of detections per image, max_det. While YOLOv5 allows this to be adjusted by the user in detect.py with a default of 1000, YOLOv7 doesn't allow this and sets internally max_det = 300 in untils/general.py

I think that for a fair comparison, I need to rerun the detect.py of YOLOv5 with --max-det 300. I do not think this will change the results though. What are your thoughts about this? This affects #54

valentinitnelav commented 1 year ago

For YOLOv5, the GPU detect speed can be taken from the *.err files obtained from running the scripts yolov5_detect_n_640_rtx.sh & yolov5_detect_s_640_rtx.sh. These scripts run on a GPU looping through various values of conf and IoU.

The results for GPU are:

Fusing layers... Model summary: 213 layers, 1769989 parameters, 0 gradients, 4.2 GFLOPs . . . Speed: 0.3ms pre-process, 9.4ms inference, 1.2ms NMS per image at shape (1, 3, 640, 640) Results saved to runs/detect/job_3273403_loop_detect_on_3219882_yolov5_n_img640_b8_e300_hyp_custom/results_at_conf_0.3_iou_0.1 1538 labels saved to runs/detect/job_3273403_loop_detect_on_3219882_yolov5_n_img640_b8_e300_hyp_custom/results_at_conf_0.3_iou_0.1/labels

~Email notification with run time: 2022-09-02T16:46:25: Slurm Job_id=3273403 Name=detect_yolov5_gpu Ended, Run time 02:01:26, COMPLETED, ExitCode 0~ Warning! This actually refers to the duration of entire loop cluster job!

- YOLOv5 small, Job 3273410, file 3273410.err contains:

detect: weights=['/home/sc.uni-leipzig.de/sv127qyji/PAI/detectors/yolov5/runs/train/3219884_yolov5_s_img640_b8_e300_hyp_custom/weights/best.pt'], source=/home/sc.uni-leipzig.de/sv127qyji/datasets/P1_Data_sampled/test/images, data=data/coco128.yaml, imgsz=[640, 640], conf_thres=0.3, iou_thres=0.1, max_det=1000, device=, view_img=False, save_txt=True, save_conf=True, save_crop=False, nosave=True, classes=None, agnostic_nms=False, augment=False, visualize=False, update=False, project=runs/detect/job_3273410_loop_detect_on_3219884_yolov5_s_img640_b8_e300_hyp_custom, name=results_at_conf_0.3_iou_0.1, exist_ok=False, line_thickness=3, hide_labels=False, hide_conf=False, half=False, dnn=False Unknown option: -C usage: git [--version] [--help] [-c name=value] [--exec-path[=]] [--html-path] [--man-path] [--info-path] [-p|--paginate|--no-pager] [--no-replace-objects] [--bare] [--git-dir=] [--work-tree=] [--namespace=] [] YOLOv5 🚀 2022-7-11 Python-3.9.6 torch-1.11.0+cu102 CUDA:0 (NVIDIA GeForce RTX 2080 Ti, 11019MiB)

Fusing layers... Model summary: 213 layers, 7031701 parameters, 0 gradients, 15.8 GFLOPs . . . Speed: 0.3ms pre-process, 9.5ms inference, 1.2ms NMS per image at shape (1, 3, 640, 640) Results saved to runs/detect/job_3273410_loop_detect_on_3219884_yolov5_s_img640_b8_e300_hyp_custom/results_at_conf_0.3_iou_0.1 1626 labels saved to runs/detect/job_3273410_loop_detect_on_3219884_yolov5_s_img640_b8_e300_hyp_custom/results_at_conf_0.3_iou_0.1/labels


~Email notification with run time: 2022-09-02T16:53:29: Slurm Job_id=3273410 Name=detect_yolov5_gpu Ended, Run time 02:01:14, COMPLETED, ExitCode 0~ Warning! This actually refers to the duration of entire loop cluster job!
valentinitnelav commented 1 year ago

YOLOv7 outputs the information about inference speed in the *.log files. However, the info is not as detailed as for YOLOv5, after enumerating the time needed for each image, at the very end it prints the total time - see below. So the results obtained from running the script yolov7_detect_tiny_640_rtx.sh are:

YOLOv7 tiny, Job 191860, file 191860.err contains (search for "conf_thres=0.3, iou_thres=0.1"):

Namespace(weights=['/home/sc.uni-leipzig.de/sv127qyji/PAI/detectors/yolov7/runs/train/191623_yolov7_tiny_img640_b8_e300_hyp_custom/weights/best.pt'], source='/home/sc.uni-leipzig.de/sv127qyji/datasets/P1_Data_sampled/test/images', img_size=640, conf_thres=0.3, iou_thres=0.1, device='', view_img=False, save_txt=True, save_conf=True, nosave=True, classes=None, agnostic_nms=False, augment=False, update=False, project='runs/detect/job_191860_loop_detect_on_191623_yolov7_tiny_img640_b8_e300_hyp_custom', name='results_at_conf_0.3_iou_0.1', exist_ok=False, no_trace=False)
Fusing layers... 
 Convert model to Traced-model... 
 traced_script_module saved! 
 model is traced! 
.
.
.
Done. (100.852s)

The 191860.err file contains this info:

Model Summary: 200 layers, 6025525 parameters, 0 gradients, 13.1 GFLOPS
YOLOR 🚀 v0.1-115-g072f76c torch 1.11.0+cu102 CUDA:0 (NVIDIA GeForce RTX 2080 Ti, 11019.5625MB)

Unfortunately, due to cluster updates, I didn't get the total cluster job 191860 run time as email notification (this was fixed later by IT).

valentinitnelav commented 1 year ago

Note also the parameter counts for each model:

valentinitnelav commented 1 year ago

Here are some first results for YOLOv5 nano CPU vs GPU:

YOLOv5 nano CPU; 5 iterations with detect.py over the test dataset (210*8=1680 images); values in seconds: 296.905122231 297.216807024 297.304537163 296.560027893 296.715367913 average = 296.9051 sec

Roughly, that means: 1680 img = 296.9051 sec 1 img = x = 296.9051/1680 = 0.1767 sec/img average CPU time for inference


YOLOv5 nano GPU; 5 iterations with detect.py over the test dataset (210*8=1680 images); values in seconds: 89.853897407 90.387605006 90.272118806 89.911479118 90.141063244 average = 89.8539 sec

Roughly, that means: 1680 img = 89.8539 sec 1 img = x = 89.8539/1680 = 0.05348 sec/img average GPU time for inference

stark-t commented 1 year ago

New best thresholds for the three models to run speed test (@valentinitnelav ):

valentinitnelav commented 1 year ago

I have the script now, but somehow I cannot get access to GPUs, but only to CPUs. The CPU jobs are running so I'll get results for that. I think there is an issue on the cluster side because yesterday I could get GPUs to use. This will have to wait until the cluster is available again.

stark-t commented 1 year ago

ok no porblem

valentinitnelav commented 1 year ago

CPU detection time results for 5 iterations for each model. These were run on the test dataset.

CPU time, YOLOv5 nano

Job ID 883890 which run the script yolov5_detect_n_640_cpu_speed_test.sh

Time results extracted from the file job_883890_yolov5_nano_cpu_results_at_0.2_iou_0.5.txt (in PAI/detectors/yolov5/runs/detect/detect_speed_jobs on the cluster):

mean(c(310.145018077,
       317.129052547,
       320.828178548,
       321.441730295,
       320.760053670))
# [1] 318.0608

That means: 1680 img = 318.0608 sec on average 1 img = 318.06081/1680 = 0.1893219 sec/img average time for inference (detection)

CPU time, YOLOv5 small

Job ID 883892 which run the script yolov5_detect_s_640_cpu_speed_test.sh

Time results extracted from the file job_883892_yolov5_small_cpu_results_at_0.3_iou_0.6.txt (in PAI/detectors/yolov5/runs/detect/detect_speed_jobs on the cluster):

mean(c(823.321014627,
       815.641425552,
       803.712151891,
       810.795042564,
       806.412835093))
# [1] 811.9765

That means: 1680 img = 811.9765 sec on average 1 img = 811.9765/1680 = 0.4833193 sec/img average time for inference (detection)

CPU time, YOLOv7 tiny

Job ID 883893 which run the script yolov7_detect_tiny_640_cpu_speed_test.sh

Time results extracted from the file job_883893_yolov7_tiny_cpu_results_at_0.1_iou_0.3.txt (in PAI/detectors/yolov7/runs/detect/detect_speed_jobs on the cluster):

mean(c(687.549698531,
       683.905298020,
       678.930599692,
       674.795639732,
       674.587100398))
# [1] 679.9537

That means: 1680 img = 679.9537 sec on average 1 img = 679.9537/1680 = 0.4047343 sec/img average time for inference (detection)

valentinitnelav commented 1 year ago

GPU results.

Note that the first iteration can take up to two times more time than the other iterations. Perhaps there is some GPU "warm up" taking place? Possibly related to this? https://github.com/ultralytics/yolov5/issues/5806

To solve this, I run 6 intereations and dropped the results of the first iteration. See commit https://github.com/stark-t/PAI/commit/b120ad9b07d0e3b1140d421a2ccdf2e95bf2fff6

GPU time, YOLOv5 nano

Job ID 1268065 which run the script yolov5_detect_n_640_gpu_rtx_speed_test.sh

Total run time for 6 iterations: 2023-01-09T18:11:01: Slurm Job_id=1268065 Name=detect_speed Ended, Run time 00:11:40, COMPLETED, ExitCode 0

Time results extracted from the file job_1268065_yolov5_nano_gpurtx_results_at_0.2_iou_0.5.txt (in PAI/detectors/yolov5/runs/detect/detect_speed_jobs on the cluster):

231.833884053 # this will be dropped
93.032718445
92.409938917
93.398442854
92.190620597
92.786358883

# Average the last 5
mean(c(93.032718445,
       92.409938917,
       93.398442854,
       92.190620597,
       92.786358883))
# [1] 92.76362

That means: 1680 img = 92.76362 sec on average 1 img = 92.76362/1680 = 0.05521644 sec/img average time for inference (detection)

It is a bit strange that this is similar to the small weights results from YOLOv5. This might be because I do not use the exact same GPU or the exact node. For each job, I get a new GPU from a different node (or same node; whatever is available on the cluster; but always the rtx2080ti with 11 Gb of RAM). I run this several times now and I get very similar results between YOLOv5 nano and small.

GPU time, YOLOv5 small

Job ID 1268064 which run the script yolov5_detect_s_640_gpu_rtx_speed_test.sh

Total run time for 6 iterations: 2023-01-09T18:11:03: Slurm Job_id=1268064 Name=detect_speed Ended, Run time 00:13:32, COMPLETED, ExitCode 0

Time results extracted from the file job_1268064_yolov5_small_gpurtx_results_at_0.3_iou_0.6.txt (in PAI/detectors/yolov5/runs/detect/detect_speed_jobs on the cluster):

319.840739573 # this will be dropped
93.036818359
93.183861865
92.507176553
92.128141101
94.300841453

# Average the last 5
mean(c(93.036818359,
       93.183861865,
       92.507176553,
       92.128141101,
       94.300841453))
# [1] 93.03137

That means: 1680 img = 93.03137 sec on average 1 img = 93.03137/1680 = 0.05537582 sec/img average time for inference (detection)

GPU time, YOLOv7 tiny

Job ID 1268059 which run the script yolov7_detect_tiny_640_gpu_rtx_speed_test.sh

Total run time for 6 iterations: 2023-01-09T18:03:11: Slurm Job_id=1268059 Name=detect_speed Ended, Run time 00:12:16, COMPLETED, ExitCode 0

Time results extracted from the file job_1268059_yolov7_tiny_gpurtx_results_at_0.1_iou_0.3.txt (in PAI/detectors/yolov7/runs/detect/detect_speed_jobs on the cluster):

115.640858734 # usually this one takes longer (see Job ID 1268047 with 455 sec & 1268032 with 289 sec)
117.910907591
115.626886750
119.962117953
135.876173902
126.866752428

# Average the last 5
mean(c(117.910907591,
       115.626886750,
       119.962117953,
       135.876173902,
       126.866752428))
# [1] 123.2486

That means: 1680 img = 123.2486 sec on average 1 img = 123.2486/1680 = 0.07336226 sec/img average time for inference (detection)

valentinitnelav commented 1 year ago

I'll close this issue now. I have put the results in the overleaf manuscript - Table 2