Closed mfoglio closed 2 years ago
Hi,
About the speed of DeepStream, it’s faster than use PyTorch, Darknet, or others. In the previous version of the repository, I did a benchmark, but I removed in the current version because it’s outdated and I need to do the tests again.
Your comparison between models is incorrect, because you can’t compare the official DetectNet model with YOLOv3 or YOLOv4 model. The YOLOv4 model has more layers, is heavier, and requires more computational resource to run. You need to compare it with tiny models (YOLOv4-Tiny for example) to have a fair comparison. More accurate models will be heavier to run.
For better performance, you need to use FP16 or INT8 mode.
About benchmark, I will add them soon.
About GPU NMS, I’m creating the code for it yet, that’s why it’s not available in this repo.
Hi @marcoslucianops , thank you for your answer. I definitely agree that I should not expect the same performance of the DetectNet when running the Yolo V3. But 30 FPS on a T4 with 100% GPU utilization seems a little weird. I hope Nvidia will help understand what's the issue.
I'll try this repository tomorrow and I will let you know how it performs compared to the official Nvidia implementation.
Hi @marcoslucianops , I'd like to try YoloV5 but I can't find the file gen_wts_yoloV5
inside the YoloV5 repository. Previously I used another repository ( https://github.com/wang-xinyu/tensorrtx.git ) to generate YoloV5 wts
file, but it does not seem to generate the cfg
file. Would they be yaml
files in the model
directory?
It seems I am having an error at build time:
CUDA_VER=11.4 make -C nvdsinfer_custom_impl_Yolo
make: Entering directory '/src/models/yolov5/DeepStream-Yolo/nvdsinfer_custom_impl_Yolo'
g++ -c -o nvdsinfer_yolo_engine.o -Wall -std=c++11 -shared -fPIC -Wno-error=deprecated-declarations -I/opt/nvidia/deepstream/deepstream/sources/includes -I/usr/local/cuda-11.4/include nvdsinfer_yolo_engine.cpp
g++ -c -o nvdsparsebbox_Yolo.o -Wall -std=c++11 -shared -fPIC -Wno-error=deprecated-declarations -I/opt/nvidia/deepstream/deepstream/sources/includes -I/usr/local/cuda-11.4/include nvdsparsebbox_Yolo.cpp
g++ -c -o yoloPlugins.o -Wall -std=c++11 -shared -fPIC -Wno-error=deprecated-declarations -I/opt/nvidia/deepstream/deepstream/sources/includes -I/usr/local/cuda-11.4/include yoloPlugins.cpp
g++ -c -o layers/convolutional_layer.o -Wall -std=c++11 -shared -fPIC -Wno-error=deprecated-declarations -I/opt/nvidia/deepstream/deepstream/sources/includes -I/usr/local/cuda-11.4/include layers/convolutional_layer.cpp
g++ -c -o layers/implicit_layer.o -Wall -std=c++11 -shared -fPIC -Wno-error=deprecated-declarations -I/opt/nvidia/deepstream/deepstream/sources/includes -I/usr/local/cuda-11.4/include layers/implicit_layer.cpp
g++ -c -o layers/channels_layer.o -Wall -std=c++11 -shared -fPIC -Wno-error=deprecated-declarations -I/opt/nvidia/deepstream/deepstream/sources/includes -I/usr/local/cuda-11.4/include layers/channels_layer.cpp
g++ -c -o layers/dropout_layer.o -Wall -std=c++11 -shared -fPIC -Wno-error=deprecated-declarations -I/opt/nvidia/deepstream/deepstream/sources/includes -I/usr/local/cuda-11.4/include layers/dropout_layer.cpp
layers/dropout_layer.cpp: In function 'nvinfer1::ILayer* dropoutLayer(float, nvinfer1::ITensor*, nvinfer1::INetworkDefinition*)':
layers/dropout_layer.cpp:14:12: warning: 'output' is used uninitialized in this function [-Wuninitialized]
14 | return output;
| ^~~~~~
g++ -c -o layers/shortcut_layer.o -Wall -std=c++11 -shared -fPIC -Wno-error=deprecated-declarations -I/opt/nvidia/deepstream/deepstream/sources/includes -I/usr/local/cuda-11.4/include layers/shortcut_layer.cpp
g++ -c -o layers/route_layer.o -Wall -std=c++11 -shared -fPIC -Wno-error=deprecated-declarations -I/opt/nvidia/deepstream/deepstream/sources/includes -I/usr/local/cuda-11.4/include layers/route_layer.cpp
g++ -c -o layers/upsample_layer.o -Wall -std=c++11 -shared -fPIC -Wno-error=deprecated-declarations -I/opt/nvidia/deepstream/deepstream/sources/includes -I/usr/local/cuda-11.4/include layers/upsample_layer.cpp
g++ -c -o layers/maxpool_layer.o -Wall -std=c++11 -shared -fPIC -Wno-error=deprecated-declarations -I/opt/nvidia/deepstream/deepstream/sources/includes -I/usr/local/cuda-11.4/include layers/maxpool_layer.cpp
g++ -c -o layers/activation_layer.o -Wall -std=c++11 -shared -fPIC -Wno-error=deprecated-declarations -I/opt/nvidia/deepstream/deepstream/sources/includes -I/usr/local/cuda-11.4/include layers/activation_layer.cpp
g++ -c -o utils.o -Wall -std=c++11 -shared -fPIC -Wno-error=deprecated-declarations -I/opt/nvidia/deepstream/deepstream/sources/includes -I/usr/local/cuda-11.4/include utils.cpp
g++ -c -o yolo.o -Wall -std=c++11 -shared -fPIC -Wno-error=deprecated-declarations -I/opt/nvidia/deepstream/deepstream/sources/includes -I/usr/local/cuda-11.4/include yolo.cpp
yolo.cpp: In member function 'nvinfer1::ICudaEngine* Yolo::createEngine(nvinfer1::IBuilder*)':
yolo.cpp:118:85: warning: 'nvinfer1::ICudaEngine* nvinfer1::IBuilder::buildEngineWithConfig(nvinfer1::INetworkDefinition&, nvinfer1::IBuilderConfig&)' is deprecated [-Wdeprecated-declarations]
118 | nvinfer1::ICudaEngine *engine = builder->buildEngineWithConfig(*network, *config);
| ^
In file included from layers/convolutional_layer.h:12,
from yolo.h:29,
from yolo.cpp:26:
/usr/include/x86_64-linux-gnu/NvInfer.h:7990:43: note: declared here
7990 | TRT_DEPRECATED nvinfer1::ICudaEngine* buildEngineWithConfig(
| ^~~~~~~~~~~~~~~~~~~~~
yolo.cpp: In member function 'NvDsInferStatus Yolo::buildYoloNetwork(std::vector<float>&, nvinfer1::INetworkDefinition&)':
yolo.cpp:395:48: error: 'createReorgPlugin' was not declared in this scope; did you mean 'reorgPlugin'?
395 | nvinfer1::IPluginV2* reorgPlugin = createReorgPlugin(2);
| ^~~~~~~~~~~~~~~~~
| reorgPlugin
make: *** [Makefile:83: yolo.o] Error 1
make: Leaving directory '/src/models/yolov5/DeepStream-Yolo/nvdsinfer_custom_impl_Yolo'
are you using triton deepstream container?
Yes, here's my Dockerfile:
FROM nvcr.io/nvidia/deepstream:6.0-triton
ENV GIT_SSL_NO_VERIFY=1
RUN sh docker_python_setup.sh
RUN update-alternatives --set python3 /usr/bin/python3.8
RUN apt install --fix-broken -y
RUN apt -y install python3-gi python3-gst-1.0 python-gi-dev git python3 python3-pip cmake g++ build-essential \
libglib2.0-dev python3-dev python3.8-dev libglib2.0-dev-bin python-gi-dev libtool m4 autoconf automake
RUN cd /opt/nvidia/deepstream/deepstream-6.0/sources/apps && \
git clone https://github.com/NVIDIA-AI-IOT/deepstream_python_apps.git
RUN cd /opt/nvidia/deepstream/deepstream-6.0/sources/apps/deepstream_python_apps && \
git submodule update --init
RUN cd /opt/nvidia/deepstream/deepstream-6.0/sources/apps/deepstream_python_apps/3rdparty/gst-python/ && \
./autogen.sh && \
make && \
make install
RUN pip3 install --upgrade pip
RUN cd /opt/nvidia/deepstream/deepstream-6.0/sources/apps/deepstream_python_apps/bindings && \
mkdir build && \
cd build && \
cmake -DPYTHON_MAJOR_VERSION=3 -DPYTHON_MINOR_VERSION=8 -DPIP_PLATFORM=linux_x86_64 -DDS_PATH=/opt/nvidia/deepstream/deepstream-6.0 .. && \
make && \
pip3 install pyds-1.1.0-py3-none-linux_x86_64.whl
RUN cd /opt/nvidia/deepstream/deepstream-6.0/sources/apps/deepstream_python_apps && \
mv apps/* ./
RUN pip3 install --upgrade pip
RUN pip3 install numpy opencv-python
# RTSP
RUN apt update && \
apt install -y python3-gi python3-dev python3-gst-1.0
RUN apt update && \
apt install -y libgstrtspserver-1.0-0 gstreamer1.0-rtsp && \
apt install -y libgirepository1.0-dev && \
apt-get install -y gobject-introspection gir1.2-gst-rtsp-server-1.0
# DEVELOPMENT TOOLS
RUN apt install -y ipython3 graphviz
Change the triton image to devel image or comment the lines 393 to 408 in the yolo.cpp file before compile it.
Thank you. I commented out the lines and now it works. I have a question about dynamic vs static batch size. Since the batch size used is static, am I obligated to specify it in the config_infer_primary_yolov5.txt
file, or can I specify it at runtime like in the following example?
pgie = Gst.ElementFactory.make("nvinfer", "primary-inference")
pgie.set_property("config-file-path", "src_deepstream/components/models/yolov5/config_infer_primary_yoloV5.txt")
pgie.set_property("batch-size", batch_size)
You can set it in the nvinfer element but it can't change dynamically. If you set a batch-size > 1, it will be better to add a property to model-engine-file specifying the batch-size in the filename too.
Thank you. Do you mean the file config_infer_primary_yolov5.txt
or another one?
I mean add
pgie.set_property("model-engine-file", "model....")
the b1 in the filename means batch-size = 1
Thank you for your help. I really appreciate it!
I am currently running the yolov5s model with deepstream 6.0 on a machine with Tesla T4, 64Gb of ram and 16 cores cpu. The pipeline is pretty simple: it decodes rtsp stream, create batches, and run the primary model. I am currently seeing 80-100 FPS. I was able to achieve 200-400FPS with the same Yolo v5s using custom Python code and TensorRT (the custom Python code was also doing more stuff than the simple deepstream pipeline). I am wondering whether there are some issues related to Deepstream 6.0 as I am reading on the forum of many people reporting slow FPS.
I also created a gist https://gist.github.com/mfoglio/46fc6ab9015153ddac233c20187f09a6 that you could use to reproduce the issue if you are willing to do that. In order to run the script you just need as rtsp stream but you can try to use a file too. Other than that, the commands to run the gist are in the readme.txt
file. Any tip / help / hint is highly appreciated!
Note: I also created a similar gist for the official Nvidia Yolov3 implementation here https://gist.github.com/mfoglio/5dca2c6cc82fc17c71742d6d3c3aaf92 . I created it to ask for help on Nvidia forum, but maybe it might be useful for comparison here too. With YoloV3 I just get 32 fps with full gpu utilization.
This might be related: https://forums.developer.nvidia.com/t/deepstream-6-yolo-performance-issue/194238/31?page=2
I will do some tests and compare it with DeepStream 5.1. About benchmarks (FPS and mAP comparison) , I will add to repository this week.
@mfoglio Added models benchmarks: https://github.com/marcoslucianops/DeepStream-Yolo#benchmarks
HI @marcoslucianops , thank you for your reply. Do you know if the code is CPU or GPU bound? All the NMS implementations I found for Deepstream are usually CPU bound as they do Yolo post processing on a single CPU.
The NMS is did by the CPU in DeepStream. In future, I want to implement the GPU NMS.
I think NMS on the CPU is fine as long as it runs on multiple cores. Do you know if right now that's a bottleneck?
it's only a bottleneck when you have too many objects in the frames.
I updated the repo changing the YOLO Decoder from CPU to GPU.
The DeepStream was using a single-core CPU to decode the YOLO and generate the bboxes. Changing it to CPU, the performance for AGX significantly increased.
Results: https://github.com/marcoslucianops/DeepStream-Yolo/issues/138
@marcoslucianops @mfoglio How do you measure running performance, i.e., FPS? Is it measured on the whole pipline or just the pgie
(nvinfer) element?
DeepStream offically has a performance benchmark, but they also don't tell how it is measured.
Thanks!
btw, I found a tools which works. https://github.com/RidgeRun/gst-perf
it's on the whole pipeline. This is the implementation in deepstream-app https://forums.developer.nvidia.com/t/deepstream-sdk-faq/80236/13
Thanks! I'll go through the official deepstream-app
source code. ^_^
Hello,
I just started working with Deepstream 6.0 and I am looking for the best real-time object detector capable of running at about 200-400FPS on a Tesla T4. So far, I have been a little disappointed by the speed of the official models. I am working with a Tesla T4 on a machine with 64GB of ram and 16 cores CPU. I tested the blazing fast detectnet which seems to achieve 1000FPS (incredible!) but I am looking for a model with higher accuracy.
I am a bit surprised by the slowness of the models above. I have always been able to achieve better performance than the one above even when using PyTorch and a custom Python code. On a T4 I could always get a few hundred frames per seconds when using TensorRT, and I suspect even a CPU achieve faster FPS than the one above. I really hope I am doing something wrong, but it is kind of surprising that just switching the detect net with Nvidia YoloV3 I achieve such poor performance. The GPU utilization at 100% is extremely suspicious.
To sum up, looking for alternatives and answers, I found this repository!
Leaving aside the SSD model (not really interested in it) and the official YoloV3 implementation (it might be bugged), I was wondering whether the official Nvidia YoloV4 implementation slowness could be caused by the NMS on the GPU.
My questions for you are:
Thank you for your help!