jkjung-avt / tensorrt_demos

TensorRT MODNet, YOLOv4, YOLOv3, SSD, MTCNN, and GoogLeNet
https://jkjung-avt.github.io/
MIT License
1.74k stars 545 forks source link

Concurrent use of GPU and two DLA’s #364

Closed ghost closed 3 years ago

ghost commented 3 years ago

Nvidia provides benchmarks where a set of networks is the concurrent use of GPU (INT8) and two DLA’s (FP16) with the use or TensorRT and Jetson AGX Xavier, see: [https://developer.nvidia.com/embedded/jetson-agx-xavier-dl-inference-benchmarks]

Would this also be possible with this repo, and the specifically yolov4 model?

Would you consider adding an example in the future, or do you have any suggestions how to implement this for yolov4 with your repo?

jkjung-avt commented 3 years ago

For concurrent use of DLA cores and GPU, you could run 3 separate processes of "trt_yolo.py", one for DLA0, one for DLA1 and the other for GPU (INT8).

However, based on my own test, a lot of layers/operations in the yolov4 model are not supported by DLA core (would fall back to GPU). So you don't get much of a speed-up by doing the above (at least not for yolov4 models).

Tetsujinfr commented 3 years ago

same observations on my side on my xavier NX (I am not on max clocks but regular 15W profile):

The memory usage for Int8 or DLA0 seems pretty much the same, a bit less for DLA in my experience. So not clear DLA0 is practical for Yolov4 at this stage. I need to try int with Yolov3 to see if it is more useful.

Btw @jkjung-avt , really nice job on this new version on the repo, the boost on fp16 is quite significant vs prev version. And int8 is a really nice piece of work too. You are amazing!

Tetsujinfr commented 3 years ago

I have been testing the DLA vs INT8 GPU usage and speed for both yolov4, yolov3 and yolov3-tiny. The GPU is as much used on the DLA version than on the INT8 version, for yolov4 as well as yolov3 as well as yolov3-tiny. INT8 perf is always better than the DLA perf, although this difference is bigger for yolov4 than for yolov3 and yolov3-tiny (there might be some bias I do not account for when the framerate is high though). Those are quick and perf measurements based on jtop GPU trend graph.

-> So ultimately I am not sure if the DLA is used, how can I check that for sure do you know?

If it is used, it looks like it is not of much benefit since it does burn as much GPU as a non-DLA inference, at least with th current state of the repo. Really tricky so far to get any use for the NX DLAs in my experience. Would love that NVIDIA provide more support/demos for those resources.

jkjung-avt commented 3 years ago

@Tetsujinfr Thanks for your sharing of test results, as well as your compliment.

You might read through NVIDIA DLA documentation to see which NN layers/operations are supported by the DLA core. Based on the verbose logs thrown out by TensorRT, I think:

In summary, if you'd really like to use NVDLA to speed up inferencing of your model, you have to design the model so that all layers of the model could run on NVDLA.

Tetsujinfr commented 3 years ago

Ok yes understood. There are similar limitations with NVDLA than with a edge TPU, whereby a limited set of operations seems to discard using models more recent than say 5years old. It is not straightforward to use at this stage for me.

Do you know if the nvidia cloud native demos were using the NX NVDLA resource? https://youtu.be/uS4A0tBFLao If yes, do you know which model would have used it ? (Body parts, face, Bert, object detect) There is no mention of nvdla in the demo video

jkjung-avt commented 3 years ago

@Tetsujinfr Please refer to https://github.com/NVIDIA-AI-IOT/jetson-cloudnative-demo. The 4 cloud native demos are all containerized. The documentation does not specify whether they use DLA cores or not. However, it is mentioned that the pose estimation container only works for Jetson XNX and AGX. So I guess that model might be running on DLA cores.