Triton server with python backend slow for YOLO inferencing

rahul1728jha commented 2 years ago

Objective:

Running YOLOv5 with triton server for performing inference. Input source is a real time video stream through RTSP URL

Setup: Followed the below template to run my own custom code: https://github.com/triton-inference-server/python_backend/tree/main/examples/add_sub

Server Image: nvcr.io/nvidia/tritonserver:22.08-pyt-python-py3

Code Change:

Changes to examples/custom_yolo/model.py

Read input as per the add_sub example --> ( for request in requests: .....)
Perform YOLO predictions
Return Response

Running the model.py cd python_backend mkdir -p models/custom_yolo/1/ cp examples/custom_yolo/model.py models/custom_yolo/1/model.py cp examples/custom_yolo/config.pbtxt models/custom_yolo/config.pbtxt tritonserver --model-repositorypwd/models

Client Image: nvcr.io/nvidia/tritonserver:22.08-py3-sdk /bin/bash

Code Change:

Changes to client.py -- Input source RTSP stream

Read frame by frame
Convert it to the triton format(similar to add_sub example)
Send to server for inference

Running client: python3 triton_inference_server_python_backend/examples/custom_yolo/client.py

Results Ran the triton server in Azure GPU VMs (Standard NC6s v3 (6 vcpus, 112 GiB memory)).

The triton server performs at a 15 FPS rate which is very slow. Without triton the result is at >= 30 FPS when only running YOLO

Question

Is the expectation of triton performing better/at par with standalone system correct?
Is something being done wrong or is missed?
What can be the remedial steps?

My final inferencing pipeline is :

Read live streams coming from N cameras (N >= 5)
Perform inferencing by a series of different models (Model 1 (YOLO) --> Model 2(Custom Model) --> Model 3(Custom Model) --> Save results to cloud)
For the above is triton a correct selection?

krishung5 commented 2 years ago

Hi @rahul1728jha, there are some factors that would affect the performance like the model configuration, the http/grpc latency, etc. I would suggest using Perf Analyzer and Model Analyzer to better understand what would be a possible bottleneck and better configuration for the custom model. For video streaming workloads, we would recommend using Nvidia DeepStream, which has Triton plugin that you might be interested in.

rahul1728jha commented 2 years ago

@krishung5 Thanks for your reply.

rahul1728jha commented 2 years ago

@krishung5 Thanks a lot for your reply. I have one final question. For the below use case:

My final inferencing pipeline is :

Read live streams coming from N cameras (N >= 5)
Perform inferencing by a series of different models (Model 1 (YOLO) --> Model 2(Custom Model) --> Model 3(Custom Model) --> Save results to cloud)

Is deepstream enough for the entire architecture or is triton needed ?

My objective is :

Have the architecture as robust and as scalable as it can be( can scale from 5 to 100 cameras)
Can be deployed in cloud as well as edge devices
Keep cost of the entire suite as low as possible
Have as low latency as possible as it has to be a near real time output with an expectable latency of (~5-10 seconds) per camera

Any help would be appreciated.

krishung5 commented 2 years ago

@tanmayv25 Are you familiar with Nvidia DeepStream enough to provide more context here?

rahul1728jha commented 2 years ago

@krishung5 I do not have much familiarity with Nvidia Deep Stream. I have read that Deep Stream is the suitable framework for implementing my use case. So i was wondering if Deep Stream is the correct selection for the mentioned use case

tanmayv25 commented 2 years ago

Unfortunately, I too don't have any hands-on experience with Nvidia DeepStream. From their documentation, it definitely looks like they support most of the use-cases. For Triton plugin within deep stream, they say tensorflow and pytorch backends are supported. So, I am not so sure whether custom triton backends would be supported. Python backend suffers from extra data copies which will have adverse affect on performance. You can write your custom logic in C++ backend to derive more performance. See example backends here: https://github.com/triton-inference-server/backend/tree/main/examples

Looks like there are lots of webinars and technical blogpost here: https://developer.nvidia.com/deepstream-getting-started#introduction

rahul1728jha commented 2 years ago

@tanmayv25 Thanks a lot for your help. Will go through that

zengqingfu1442 commented 3 months ago

m

triton-inference-server / server

Triton server with python backend slow for YOLO inferencing #4959

Objective: