Low FPS after TensorRT optimization

Wassssim commented 3 years ago

Hello @jkjung-avt,

I recently flashed my new Jetson Nano's card with a fresh Jetpack-4.5 image , and I followed your Jetpack-4.5 guide to the letter, yet when I run the trt_yolo.py script with a pretrainded Yolov4 model on a video, I only get 11-12 FPS (instead of 22-23), and if I run it on an image I get 12-13 FPS.

And yes, I made sure that the power mode is 10W and that the jetson clocks are ON. if I turn both of those OFF I get an unbearable 5 FPS.

I would really appericiate your feedback on this issue, it has been really bugging me for a while, I'm happy to provide any other information you need.

Best,

Wassim

jkjung-avt commented 3 years ago

Which yolo model are you testing with? Why do you expect to get 22~23 FPS?

Wassssim commented 3 years ago

I'm using a pretrained yolov4 tiny with 416x416 input size and FP16 precision, I downloaded the model using the download_yolo.sh shell script.

In the README of this repo you said that you got 25.5 FPS running this model on nano, so I figured I should be getting around the same number of FPS, yet I'm only getting 11-12 FPS.

jkjung-avt commented 3 years ago

I just verified the code on my Jetson Nano DevKit with JetPack-4.5.1. I'm able to get 24~25 FPS with the "yolov4-tiny-416" TensorRT engine.

Could you run sudo tegrastats in another terminal (while "trt_yolo.py" is running) and see if you get similar numbers as me (as shown below)?

$ sudo tegrastats
......
RAM 2018/3956MB (lfb 164x4MB) SWAP 0/1978MB (cached 0MB) IRAM 0/252kB(lfb 252kB) CPU [20%@1479,33%@1479,28%@1479,28%@1479] EMC_FREQ 29%@1600 GR3D_FREQ 99%@921 VIC_FREQ 0%@140 APE 25 PLL@62C CPU@67.5C PMIC@100C GPU@64C AO@73.5C thermal@65.5C POM_5V_IN 7320/7108 POM_5V_GPU 3261/2912 POM_5V_CPU 1141/1402
RAM 2018/3956MB (lfb 164x4MB) SWAP 0/1978MB (cached 0MB) IRAM 0/252kB(lfb 252kB) CPU [21%@1479,16%@1479,38%@1479,31%@1479] EMC_FREQ 29%@1600 GR3D_FREQ 30%@921 VIC_FREQ 0%@140 APE 25 PLL@62C CPU@67.5C PMIC@100C GPU@64C AO@73C thermal@65.75C POM_5V_IN 7045/7107 POM_5V_GPU 3104/2916 POM_5V_CPU 1180/1397
RAM 2018/3956MB (lfb 164x4MB) SWAP 0/1978MB (cached 0MB) IRAM 0/252kB(lfb 252kB) CPU [35%@1479,22%@1479,27%@1479,20%@1479] EMC_FREQ 29%@1600 GR3D_FREQ 51%@921 VIC_FREQ 0%@140 APE 25 PLL@62C CPU@67C PMIC@100C GPU@63.5C AO@73.5C thermal@65.75C POM_5V_IN 6938/7103 POM_5V_GPU 2912/2916 POM_5V_CPU 1259/1394
RAM 2018/3956MB (lfb 164x4MB) SWAP 0/1978MB (cached 0MB) IRAM 0/252kB(lfb 252kB) CPU [26%@1479,21%@1479,34%@1479,32%@1479] EMC_FREQ 29%@1600 GR3D_FREQ 41%@921 VIC_FREQ 0%@140 APE 25 PLL@62C CPU@67C PMIC@100C GPU@64C AO@73.5C thermal@65.75C POM_5V_IN 6752/7095 POM_5V_GPU 2641/2910 POM_5V_CPU 1379/1394

Wassssim commented 3 years ago

So this is what I got:

RAM 2864/3956MB (lfb 6x2MB) SWAP 30/1978MB (cached 0MB) IRAM 0/252kB(lfb 252kB) CPU [66%@1479,74%@1479,62%@1479,67%@1479] EMC_FREQ 21%@1600 GR3D_FREQ 8%@921 VIC_FREQ 0%@192 APE 25 PLL@40C CPU@45.5C PMIC@100C GPU@40C AO@48.5C thermal@42.25C POM_5V_IN 6459/5828 POM_5V_GPU 1116/1081 POM_5V_CPU 2866/2262
RAM 2864/3956MB (lfb 6x2MB) SWAP 30/1978MB (cached 0MB) IRAM 0/252kB(lfb 252kB) CPU [73%@1479,69%@1479,71%@1479,61%@1479] EMC_FREQ 21%@1600 GR3D_FREQ 13%@921 VIC_FREQ 0%@192 APE 25 PLL@40C CPU@44C PMIC@100C GPU@40C AO@49C thermal@42.5C POM_5V_IN 6140/5842 POM_5V_GPU 1156/1084 POM_5V_CPU 2432/2270
RAM 2864/3956MB (lfb 6x2MB) SWAP 30/1978MB (cached 0MB) IRAM 0/252kB(lfb 252kB) CPU [67%@1479,66%@1479,64%@1479,65%@1479] EMC_FREQ 21%@1600 GR3D_FREQ 99%@921 VIC_FREQ 0%@192 APE 25 PLL@40C CPU@44C PMIC@100C GPU@40C AO@50C thermal@42.25C POM_5V_IN 6539/5872 POM_5V_GPU 1870/1118 POM_5V_CPU 1950/2256
RAM 2864/3956MB (lfb 6x2MB) SWAP 30/1978MB (cached 0MB) IRAM 0/252kB(lfb 252kB) CPU [76%@1479,60%@1479,81%@1479,62%@1479] EMC_FREQ 21%@1600 GR3D_FREQ 71%@921 VIC_FREQ 0%@192 APE 25 PLL@40C CPU@44.5C PMIC@100C GPU@40C AO@49C thermal@42C POM_5V_IN 6419/5895 POM_5V_GPU 1594/1138 POM_5V_CPU 2308/2258

This time I got an average of 9.2 FPS which is even worse than before, you can find the full log file here if you want.

jkjung-avt commented 3 years ago

CPU [76%@1479,60%@1479,81%@1479,62%@1479]

This means your Jetson Nano CPU's are running at maximum clock speed (1479MHz). But CPU utilization rates are higher than mine on average. Are you running some additional applications in the background?

GR3D_FREQ 71%@921

This means the GPU is also running at maximum clock speed (921MHz). GPU utilization rate looks normal.

Wassssim commented 3 years ago

I haven't ran any application in the background myself, I ran the try_yolo.py script immediately after booting the Jetson up. This is really weird.

I'm currently Downloading Jetpack 4.5.1 and I will install it using Nvidia SDK Manager this time (instead of directly flashing an image with Etcher), I doubt this will make a difference but who knows.

jkjung-avt commented 3 years ago

It should not matter whether you use SDK manager or flash the image with Etcher directly. I myself also used Etcher to flash the JetPack image onto the MicroSD card.

It might be somewhat of a long shot. But please make sure:

You are using a MicroSD card with decent read/write speed.
You are using a suitable power adapter for the Jetson Nano DevKit. (I'm using a 5V-4A adapter.)

Wassssim commented 3 years ago

So I installed Jetpack 4.5.1 using SDK manager and, as expected, nothing has changed, for the video that is, but when I ran the trt_yolo.py on an image, I got 22 FPS. So I think this could be a codec related issue, this could explain the higher CPU usage.

As for the SD and power Supply:

I tried flashing different SDs from different manufacturers yet this FPS problem still persists so I don't think it's an SD issue.
I'm using a 12V Battery and a 5V regulator so it's all good.

Codyuzumaki commented 3 years ago

Hello!

I'm having the same problem. I'm achieving only 11.5 FPS with yolov4-tiny-416.

Actually, I've tested also Tiny-608, Yolo-608 and Yolo-288, and these are the FPS averages compared to yours:

As you can see, the Yolov4-288 is barely hitting the half FPS of your model, while the Yolov4-608 FPS response is quite similar.

Could there be a problem with the shape reconfiguration (I only change the Width and Height values from the CFG file) or maybe with the Leaky Relu of Tiny?

I must say I'm analzing separated images from a stored video with a custom python script

jkjung-avt commented 3 years ago

I must say I'm analzing separated images from a stored video with a custom python script

What FPS do you get if you use the "trt_yolo.py" in this repo?

Codyuzumaki commented 3 years ago

Hi. I get 10.8 ± 0.2 fps with trt_yolo.py inferencing with the YOLOv4-tiny-416

PS: Thank you for the development and for your help ;)

jkjung-avt commented 3 years ago

I have no idea why you get a much lower FPS. I'm not able to reproduce the problem here.

If you want to dig further, please run "trt_yolo.py" with "cProfile" as: (Keep it running for 1 minute then quit and collect the log.)

$ python3 -m cProile -s cumtime trt_yolo.py --image dog.jpg -m yolov4-tiny-416

Then send me the log so I could analyze which part of the program is taking too long. The log will look like the following.

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    499/1    0.029    0.000   78.272   78.272 {built-in method builtins.exec}
        1    0.000    0.000   78.272   78.272 trt_yolo.py:5(<module>)
        1    0.000    0.000   76.971   76.971 trt_yolo.py:83(main)
        1    0.167    0.167   72.671   72.671 trt_yolo.py:48(loop_and_detect)
     1793    0.383    0.000   62.591    0.035 yolo_with_plugins.py:306(detect)
     1793   40.850    0.023   43.164    0.024 yolo_with_plugins.py:247(do_inference_v2)
     1793    5.640    0.003   11.932    0.007 yolo_with_plugins.py:25(_preprocess_yolo)
    18036    5.133    0.000    5.133    0.000 {built-in method numpy.array}
     1793    5.104    0.003    5.104    0.003 {resize}
     1793    4.568    0.003    4.568    0.003 {waitKey}
        1    0.029    0.029    4.175    4.175 yolo_with_plugins.py:274(__init__)
        1    3.791    3.791    4.142    4.142 yolo_with_plugins.py:269(_load_engine)
     1793    0.016    0.000    3.920    0.002 _asarray.py:139(ascontiguousarray)
     1793    1.077    0.001    2.610    0.001 yolo_with_plugins.py:100(_postprocess_yolo)
     1793    2.037    0.001    2.037    0.001 yolo_with_plugins.py:255(<listcomp>)
     1793    0.198    0.000    1.892    0.001 visualization.py:91(draw_bboxes)
32276/25104    0.239    0.000    1.857    0.000 {built-in method numpy.core._multiarray_um
ath.implement_array_function}
     1793    1.520    0.001    1.520    0.001 {imshow}
     ......

Codyuzumaki commented 3 years ago

Hi. I swear I have no idea why, maybe a reboot or maybe a rest day, but today I'm getting better fps than yesterday in all models but the yolov4-608 (where I'm getting the same). This is the updated table:

By the other hand, I've run the test you've just told both with a video and with a single image, getting 18 fps with the video and 26.4 fps with the single image analysis.

Here are the results:

Image loop analysis:

Video analysis:

Watching at the time reports, It seems the difference lies in the read method (line 237 of camera.py), since reading frames from cv2.VideoCapture object costs 21 ms, compared to reading from a single image, that costs 3 ms.

I guess the tests where you achieved 25.5 fps with yolov4-tiny-416 were with image analysis, not video, and we can conclude that the top fps reachable with that model inferencing a video is placed around 18 fps.

On a separate issue, I would like to ask you a couple of questions:

The converted model is built in FP16 by default? Or is it necessary to specify the precission somewhere? I saw that the Int8 quantization is indeed an argument of the conversion script.
To make batch analysis (let's say of 2 images at time), where must I make the changes? Do you think it would be faster than concatenated single frame inferences?

Thank you.

jkjung-avt commented 3 years ago

Yes, I always used dog.jpg to test FPS of the yolo models. I think this makes the results more comparable since the processing time would be less skewed by image capturing delays.
"onnx_to_tensorrt.py" builds FP16 engines by default. You could find the relevant source code here: https://github.com/jkjung-avt/tensorrt_demos/blob/c818a53af38eb7fbf601c1b24d248c2787f624c5/yolo/onnx_to_tensorrt.py#L137
For batched inference, you might refer to: https://github.com/jkjung-avt/tensorrt_demos/issues/366

Codyuzumaki commented 3 years ago

Awesome!

Thank you very much.

jkjung-avt / tensorrt_demos

Low FPS after TensorRT optimization #403