FPS / Performance Data - Githubissues

mfoglio commented 2 years ago

Hello,

I just started working with Deepstream 6.0 and I am looking for the best real-time object detector capable of running at about 200-400FPS on a Tesla T4. So far, I have been a little disappointed by the speed of the official models. I am working with a Tesla T4 on a machine with 64GB of ram and 16 cores CPU. I tested the blazing fast detectnet which seems to achieve 1000FPS (incredible!) but I am looking for a model with higher accuracy.

I tried the official YoloV3 but it cannot achieve more 32 FPS at full GPU utilization is 100%.
I tried the official YoloV4 but it cannot achieve more than 120 FPS at full GPU utilization. It also doesn't seem to detect any object, as it is probably not compatible with Deepstream 6 yet.
The official SSD implementation with Nvidia Triton Server cannot achieve more 45FPS but it doesn't seem to reach full GPU. I am not really interested in SSD but I tried it given the issues with the Yolo models.

I am a bit surprised by the slowness of the models above. I have always been able to achieve better performance than the one above even when using PyTorch and a custom Python code. On a T4 I could always get a few hundred frames per seconds when using TensorRT, and I suspect even a CPU achieve faster FPS than the one above. I really hope I am doing something wrong, but it is kind of surprising that just switching the detect net with Nvidia YoloV3 I achieve such poor performance. The GPU utilization at 100% is extremely suspicious.

To sum up, looking for alternatives and answers, I found this repository!

Leaving aside the SSD model (not really interested in it) and the official YoloV3 implementation (it might be bugged), I was wondering whether the official Nvidia YoloV4 implementation slowness could be caused by the NMS on the GPU.

My questions for you are:

Do you have any data on the FPS for the models in this repository on dGPU? Do you know if they are faster or slower than the official models?
Is there any particular reason why you haven't implementation NMS on GPU? Is it just lack of time or do you believe it might not be beneficial in terms of speed?

Thank you for your help!

marcoslucianops commented 2 years ago

Hi,

About the speed of DeepStream, it’s faster than use PyTorch, Darknet, or others. In the previous version of the repository, I did a benchmark, but I removed in the current version because it’s outdated and I need to do the tests again.

Your comparison between models is incorrect, because you can’t compare the official DetectNet model with YOLOv3 or YOLOv4 model. The YOLOv4 model has more layers, is heavier, and requires more computational resource to run. You need to compare it with tiny models (YOLOv4-Tiny for example) to have a fair comparison. More accurate models will be heavier to run.

For better performance, you need to use FP16 or INT8 mode.

About benchmark, I will add them soon.

About GPU NMS, I’m creating the code for it yet, that’s why it’s not available in this repo.

mfoglio commented 2 years ago

Hi @marcoslucianops , thank you for your answer. I definitely agree that I should not expect the same performance of the DetectNet when running the Yolo V3. But 30 FPS on a T4 with 100% GPU utilization seems a little weird. I hope Nvidia will help understand what's the issue.

I'll try this repository tomorrow and I will let you know how it performs compared to the official Nvidia implementation.

mfoglio commented 2 years ago

Hi @marcoslucianops , I'd like to try YoloV5 but I can't find the file gen_wts_yoloV5 inside the YoloV5 repository. Previously I used another repository ( https://github.com/wang-xinyu/tensorrtx.git ) to generate YoloV5 wts file, but it does not seem to generate the cfg file. Would they be yaml files in the model directory?

marcoslucianops commented 2 years ago

https://github.com/marcoslucianops/DeepStream-Yolo#yolov5-usage

mfoglio commented 2 years ago

It seems I am having an error at build time:

CUDA_VER=11.4 make -C nvdsinfer_custom_impl_Yolo

make: Entering directory '/src/models/yolov5/DeepStream-Yolo/nvdsinfer_custom_impl_Yolo'
g++ -c  -o nvdsinfer_yolo_engine.o -Wall -std=c++11 -shared -fPIC -Wno-error=deprecated-declarations -I/opt/nvidia/deepstream/deepstream/sources/includes -I/usr/local/cuda-11.4/include nvdsinfer_yolo_engine.cpp
g++ -c  -o nvdsparsebbox_Yolo.o -Wall -std=c++11 -shared -fPIC -Wno-error=deprecated-declarations -I/opt/nvidia/deepstream/deepstream/sources/includes -I/usr/local/cuda-11.4/include nvdsparsebbox_Yolo.cpp
g++ -c  -o yoloPlugins.o -Wall -std=c++11 -shared -fPIC -Wno-error=deprecated-declarations -I/opt/nvidia/deepstream/deepstream/sources/includes -I/usr/local/cuda-11.4/include yoloPlugins.cpp
g++ -c  -o layers/convolutional_layer.o -Wall -std=c++11 -shared -fPIC -Wno-error=deprecated-declarations -I/opt/nvidia/deepstream/deepstream/sources/includes -I/usr/local/cuda-11.4/include layers/convolutional_layer.cpp
g++ -c  -o layers/implicit_layer.o -Wall -std=c++11 -shared -fPIC -Wno-error=deprecated-declarations -I/opt/nvidia/deepstream/deepstream/sources/includes -I/usr/local/cuda-11.4/include layers/implicit_layer.cpp
g++ -c  -o layers/channels_layer.o -Wall -std=c++11 -shared -fPIC -Wno-error=deprecated-declarations -I/opt/nvidia/deepstream/deepstream/sources/includes -I/usr/local/cuda-11.4/include layers/channels_layer.cpp
g++ -c  -o layers/dropout_layer.o -Wall -std=c++11 -shared -fPIC -Wno-error=deprecated-declarations -I/opt/nvidia/deepstream/deepstream/sources/includes -I/usr/local/cuda-11.4/include layers/dropout_layer.cpp
layers/dropout_layer.cpp: In function 'nvinfer1::ILayer* dropoutLayer(float, nvinfer1::ITensor*, nvinfer1::INetworkDefinition*)':
layers/dropout_layer.cpp:14:12: warning: 'output' is used uninitialized in this function [-Wuninitialized]
   14 |     return output;
      |            ^~~~~~
g++ -c  -o layers/shortcut_layer.o -Wall -std=c++11 -shared -fPIC -Wno-error=deprecated-declarations -I/opt/nvidia/deepstream/deepstream/sources/includes -I/usr/local/cuda-11.4/include layers/shortcut_layer.cpp
g++ -c  -o layers/route_layer.o -Wall -std=c++11 -shared -fPIC -Wno-error=deprecated-declarations -I/opt/nvidia/deepstream/deepstream/sources/includes -I/usr/local/cuda-11.4/include layers/route_layer.cpp
g++ -c  -o layers/upsample_layer.o -Wall -std=c++11 -shared -fPIC -Wno-error=deprecated-declarations -I/opt/nvidia/deepstream/deepstream/sources/includes -I/usr/local/cuda-11.4/include layers/upsample_layer.cpp
g++ -c  -o layers/maxpool_layer.o -Wall -std=c++11 -shared -fPIC -Wno-error=deprecated-declarations -I/opt/nvidia/deepstream/deepstream/sources/includes -I/usr/local/cuda-11.4/include layers/maxpool_layer.cpp
g++ -c  -o layers/activation_layer.o -Wall -std=c++11 -shared -fPIC -Wno-error=deprecated-declarations -I/opt/nvidia/deepstream/deepstream/sources/includes -I/usr/local/cuda-11.4/include layers/activation_layer.cpp
g++ -c  -o utils.o -Wall -std=c++11 -shared -fPIC -Wno-error=deprecated-declarations -I/opt/nvidia/deepstream/deepstream/sources/includes -I/usr/local/cuda-11.4/include utils.cpp
g++ -c  -o yolo.o -Wall -std=c++11 -shared -fPIC -Wno-error=deprecated-declarations -I/opt/nvidia/deepstream/deepstream/sources/includes -I/usr/local/cuda-11.4/include yolo.cpp
yolo.cpp: In member function 'nvinfer1::ICudaEngine* Yolo::createEngine(nvinfer1::IBuilder*)':
yolo.cpp:118:85: warning: 'nvinfer1::ICudaEngine* nvinfer1::IBuilder::buildEngineWithConfig(nvinfer1::INetworkDefinition&, nvinfer1::IBuilderConfig&)' is deprecated [-Wdeprecated-declarations]
  118 |     nvinfer1::ICudaEngine *engine = builder->buildEngineWithConfig(*network, *config);
      |                                                                                     ^
In file included from layers/convolutional_layer.h:12,
                 from yolo.h:29,
                 from yolo.cpp:26:
/usr/include/x86_64-linux-gnu/NvInfer.h:7990:43: note: declared here
 7990 |     TRT_DEPRECATED nvinfer1::ICudaEngine* buildEngineWithConfig(
      |                                           ^~~~~~~~~~~~~~~~~~~~~
yolo.cpp: In member function 'NvDsInferStatus Yolo::buildYoloNetwork(std::vector<float>&, nvinfer1::INetworkDefinition&)':
yolo.cpp:395:48: error: 'createReorgPlugin' was not declared in this scope; did you mean 'reorgPlugin'?
  395 |             nvinfer1::IPluginV2* reorgPlugin = createReorgPlugin(2);
      |                                                ^~~~~~~~~~~~~~~~~
      |                                                reorgPlugin
make: *** [Makefile:83: yolo.o] Error 1
make: Leaving directory '/src/models/yolov5/DeepStream-Yolo/nvdsinfer_custom_impl_Yolo'

marcoslucianops commented 2 years ago

are you using triton deepstream container?

mfoglio commented 2 years ago

Yes, here's my Dockerfile:

FROM nvcr.io/nvidia/deepstream:6.0-triton
ENV GIT_SSL_NO_VERIFY=1
RUN sh docker_python_setup.sh
RUN update-alternatives --set python3 /usr/bin/python3.8
RUN apt install --fix-broken -y
RUN apt -y install python3-gi python3-gst-1.0 python-gi-dev git python3 python3-pip cmake g++ build-essential \
  libglib2.0-dev python3-dev python3.8-dev libglib2.0-dev-bin python-gi-dev libtool m4 autoconf automake
RUN cd /opt/nvidia/deepstream/deepstream-6.0/sources/apps && \
    git clone https://github.com/NVIDIA-AI-IOT/deepstream_python_apps.git
RUN cd /opt/nvidia/deepstream/deepstream-6.0/sources/apps/deepstream_python_apps && \
    git submodule update --init
RUN cd /opt/nvidia/deepstream/deepstream-6.0/sources/apps/deepstream_python_apps/3rdparty/gst-python/ && \
   ./autogen.sh && \
   make && \
   make install

RUN pip3 install --upgrade pip
RUN cd /opt/nvidia/deepstream/deepstream-6.0/sources/apps/deepstream_python_apps/bindings && \
    mkdir build && \
    cd build && \
    cmake -DPYTHON_MAJOR_VERSION=3 -DPYTHON_MINOR_VERSION=8 -DPIP_PLATFORM=linux_x86_64 -DDS_PATH=/opt/nvidia/deepstream/deepstream-6.0 .. && \
    make && \
    pip3 install pyds-1.1.0-py3-none-linux_x86_64.whl
RUN cd /opt/nvidia/deepstream/deepstream-6.0/sources/apps/deepstream_python_apps && \
    mv apps/* ./
RUN pip3 install --upgrade pip
RUN pip3 install numpy opencv-python

# RTSP
RUN apt update && \
    apt install -y python3-gi python3-dev python3-gst-1.0
RUN apt update && \
    apt install -y libgstrtspserver-1.0-0 gstreamer1.0-rtsp && \
    apt install -y libgirepository1.0-dev && \
    apt-get install -y gobject-introspection gir1.2-gst-rtsp-server-1.0

# DEVELOPMENT TOOLS
RUN apt install -y ipython3 graphviz

marcoslucianops commented 2 years ago

Change the triton image to devel image or comment the lines 393 to 408 in the yolo.cpp file before compile it.

mfoglio commented 2 years ago

Thank you. I commented out the lines and now it works. I have a question about dynamic vs static batch size. Since the batch size used is static, am I obligated to specify it in the config_infer_primary_yolov5.txt file, or can I specify it at runtime like in the following example?

pgie = Gst.ElementFactory.make("nvinfer", "primary-inference")
pgie.set_property("config-file-path", "src_deepstream/components/models/yolov5/config_infer_primary_yoloV5.txt")
pgie.set_property("batch-size", batch_size)

marcoslucianops commented 2 years ago

You can set it in the nvinfer element but it can't change dynamically. If you set a batch-size > 1, it will be better to add a property to model-engine-file specifying the batch-size in the filename too.

mfoglio commented 2 years ago

Thank you. Do you mean the file config_infer_primary_yolov5.txt or another one?

marcoslucianops commented 2 years ago

I mean add

pgie.set_property("model-engine-file", "model....")

marcoslucianops commented 2 years ago

the b1 in the filename means batch-size = 1

mfoglio commented 2 years ago

Thank you for your help. I really appreciate it!

I am currently running the yolov5s model with deepstream 6.0 on a machine with Tesla T4, 64Gb of ram and 16 cores cpu. The pipeline is pretty simple: it decodes rtsp stream, create batches, and run the primary model. I am currently seeing 80-100 FPS. I was able to achieve 200-400FPS with the same Yolo v5s using custom Python code and TensorRT (the custom Python code was also doing more stuff than the simple deepstream pipeline). I am wondering whether there are some issues related to Deepstream 6.0 as I am reading on the forum of many people reporting slow FPS.

I also created a gist https://gist.github.com/mfoglio/46fc6ab9015153ddac233c20187f09a6 that you could use to reproduce the issue if you are willing to do that. In order to run the script you just need as rtsp stream but you can try to use a file too. Other than that, the commands to run the gist are in the readme.txt file. Any tip / help / hint is highly appreciated!

Note: I also created a similar gist for the official Nvidia Yolov3 implementation here https://gist.github.com/mfoglio/5dca2c6cc82fc17c71742d6d3c3aaf92 . I created it to ask for help on Nvidia forum, but maybe it might be useful for comparison here too. With YoloV3 I just get 32 fps with full gpu utilization.

mfoglio commented 2 years ago

This might be related: https://forums.developer.nvidia.com/t/deepstream-6-yolo-performance-issue/194238/31?page=2

marcoslucianops commented 2 years ago

I will do some tests and compare it with DeepStream 5.1. About benchmarks (FPS and mAP comparison) , I will add to repository this week.

marcoslucianops commented 2 years ago

@mfoglio Added models benchmarks: https://github.com/marcoslucianops/DeepStream-Yolo#benchmarks

mfoglio commented 2 years ago

HI @marcoslucianops , thank you for your reply. Do you know if the code is CPU or GPU bound? All the NMS implementations I found for Deepstream are usually CPU bound as they do Yolo post processing on a single CPU.

marcoslucianops commented 2 years ago

The NMS is did by the CPU in DeepStream. In future, I want to implement the GPU NMS.

mfoglio commented 2 years ago

I think NMS on the CPU is fine as long as it runs on multiple cores. Do you know if right now that's a bottleneck?

marcoslucianops commented 2 years ago

it's only a bottleneck when you have too many objects in the frames.

marcoslucianops commented 2 years ago

I updated the repo changing the YOLO Decoder from CPU to GPU.

The DeepStream was using a single-core CPU to decode the YOLO and generate the bboxes. Changing it to CPU, the performance for AGX significantly increased.

Results: https://github.com/marcoslucianops/DeepStream-Yolo/issues/138

passerbythesun commented 2 years ago

@marcoslucianops @mfoglio How do you measure running performance, i.e., FPS? Is it measured on the whole pipline or just the pgie(nvinfer) element? DeepStream offically has a performance benchmark, but they also don't tell how it is measured. Thanks!

btw, I found a tools which works. https://github.com/RidgeRun/gst-perf

marcoslucianops commented 2 years ago

it's on the whole pipeline. This is the implementation in deepstream-app https://forums.developer.nvidia.com/t/deepstream-sdk-faq/80236/13

passerbythesun commented 2 years ago

Thanks! I'll go through the official deepstream-app source code. ^_^

marcoslucianops / DeepStream-Yolo

FPS / Performance Data #108