marcoslucianops / DeepStream-Yolo

NVIDIA DeepStream SDK 7.0 / 6.4 / 6.3 / 6.2 / 6.1.1 / 6.1 / 6.0.1 / 6.0 / 5.1 implementation for YOLO models
MIT License
1.45k stars 357 forks source link

Low GPU utilization #379

Open lzylzylzy123456 opened 1 year ago

lzylzylzy123456 commented 1 year ago

We have a problem with low GPU utilization. We once achieved high GPU utilization by running 50 pipelines on 4 T4s, and the inference time was fast. However, after that environment was reset, the environment we reinstalled could no longer achieve the same effect. We suspect it is an environmental issue, but we are not sure. Have you ever encountered such a situation?Please take a look at this post, which contains detailed descriptions.https://forums.developer.nvidia.com/t/low-gpu-utilization/255835

marcoslucianops commented 1 year ago

The limitation is the GPU decoder. You can check it with the command:

watch -n 0.5 "nvidia-smi -a | grep Decoder"
lzylzylzy123456 commented 1 year ago

I have conducted some experiments on my own server using two RTX 3090 graphics cards. The rest of the environment was configured according to DeepStream 6.2 version, and the inference model used was YOLOv7. I measured the inference time by calculating it using two probes, one before and one after the detector bin. In my case, if the inference time is less than 80ms, it is considered acceptable.

Following your method, I made the following attempts: Firstly, I used the program itself to set the GPU ID and tested both with a single card and with two cards. With a single 3090, I was able to run up to 23 pipelines, and the GPU decoder utilization reached around 70%. With two 3090 cards, I could run up to 30 pipelines, and the GPU decoder utilization reached around 50%.

Then, I conducted another experiment by running two instances of the program, each specifying a different GPU for processing. This allowed me to handle 32 pipelines, meaning each program could handle 16 pipelines, and the GPU decoder reached around 50% utilization. In these scenarios, the CPU usage was between 55% and 60%.

Based on the above experiments, I wonder if the performance of the two 3090 cards is not fully utilized. Since a single card can handle up to 23 pipelines, and it seems that the GPU decoder is not the bottleneck when testing with two cards, I am curious if two RTX 3090 cards can handle 40 pipelines. Do you have any experience or insights to share on this matter?

marcoslucianops commented 1 year ago

In my experience, I can run 2 pipelines in 2 A6000 GPUs with almost the same performance of running 1 GPU in 1 pipelines in each. When I ran 1 code with 2 GPUs was having only ~10-20% improvement. In my benchmarking tests, I saw that the T4/V100 GPU can process only ~650 FPS in the Decoder, but the RTX can archive more than 1000 FPS.

Things that you need to check:

lzylzylzy123456 commented 1 year ago

Thank you for your suggestion. There is another question that has arisen, and I have asked under this issue. Here is the link,https://github.com/marcoslucianops/DeepStream-Yolo/issues/346,the difference is that I am using yolov7

marcoslucianops commented 1 year ago

I replied there