wb666greene / AI-Person-Detector

Python Ai "person detector" using Coral TPU or Movidius NCS/NCS2
17 stars 4 forks source link

Performance stats from readme.md: TPU.py with Nano 3 fps HD and UHD rtsp streams. #3

Open ozett opened 4 years ago

ozett commented 4 years ago

could you clarify if the the FPS from your last mentioned benchmark on the nano vs. i7 PC is with AI processing?

are the framerates measured as RTSP-grabbind AND inference on the nano (wich is both capable)? but what with the I7 PC ? could only be CPU inferenece, not GPU, right?

wb666greene commented 4 years ago

The i7-4500 "mini-PC" is running a single Coral TPU same as the Nano. Its much more efficient at decoding the rtsp streams. Some of it is Intel vs. ARM for these kind of I/O intensive "multimedia" workloads. The exact same Python code is running on each system.

The framerate is the observed framerate -- number of images processed by the TPU thread divided by the elapsed time, calculated when the thread exits.

All these tests were done with local rtsp stream decoding.

Again I'm not looking for the highest possible framerate on a single camera, I'm looking to support as many cameras as practical with round-robin sampling of the cameras. I've set the camera frame-rates to from 3-5 fps (5 is the minimum of some camera models), this saves decoding a lot of frames that will only end up being dropped and tends to reduce the rtsp latency which unfortunately is typically from 2-4 seconds.

ozett commented 4 years ago

thanks for commenting. i am still trying some pieces of your architecture. now i am wondering about possible framerates and looptime - without infereence & without decoding!

i can only get 24 fps with an nearly empty loop in python. does this confirm to your observation? does this depend only on the cpu? do you have any hints or insights on this?

i am trying to get a suitable framerate with some of my camsn on inference, and trying to go step-by-step towards this. but only 20 fps with an empty loop is disappinting. maybe an old cpu. does this depend on the cpu?

any advise to bring the framerate up?

image image

edit: my impression is, the cpu is too old an therefore too weak for an nearly empty python-loop... image

ozett commented 4 years ago

i tried the same, nearly empty-loop on the jetson nano and got only 10 FPS (excactly same code for measurement) so the poor FPS-times for an empty-python loop must have something to do with CPU-Power !?!

(i did not tried to measure and include rtsp/inference on the PC vs. Nano. But i think i will try this next)

image

ozett commented 4 years ago

on my way finding orientation to bring up simply pythonperformance i found some specs. as i stick with sss-mobilenet v2 (have good detection results) - its becomes interesting later for comparison of benchmark times...

image https://qengineering.eu/deep-learning-with-raspberry-pi-and-alternatives.html

ozett commented 4 years ago

you have gone the right way with muliprocessing...i guess

image

---edit: just for comparison: results out of a virtual mashine on my dell poweredge-server:

image

ozett commented 4 years ago

STRANGE! on the nano i just had in hand and tested the CPU perfomance i found jetson.inference missing. i wanted to run the tensorflow 2.2 PC-script without further code-changes. i installed jetson.inference from adrian here an got a loss of 4 fps in my empy loop

image

wb666greene commented 4 years ago

Your code appears to be cut off so I can't see all the client.publish function, but unless you've initialized mqtt to use its own thread, its a synchronous call which will likely dominate the loop timing.

You are never going to get the best frame rate with everything in a single main thread. Python multi-threading or multi-processing is not its strongest suit, but it helps a lot anytime there is an I/O operation in a thread so other threads can run while the I/O happens.

Part of the reason I've moved the file saving to node-red is to get "true" multiprocessing" for the file writes and sending notifications.

Adrian's example only got ~5 fps on the Nano. You have to realize that the AI is based on typically 300x300 pixel images, but useful cameras will start at D1 704x480 resolution and need to be resized for doing the inference.

Its amazing and a mystery to me how a 3840x2160 image frame resized to 300x300 can be so effective in detecting a person!

My tests during development found Python multithreading to be better than Python multiprocessing for everything but the rtsp2mqtt utility. I suspect its the large amount of binary data needing to be exchanged among processes and the overhead of Python's "pickling" of binary data, multithreading seems to avoid this overhead. I started with multithreading as it seemed better documented and expected multiprocessing to be and improvement, it generally wasn't.

Running a variation of my AI where I read frames (approx D1 size) from a 30 fps mp4 video file my i7-6700K desktop gets nearly 80 frames per second from the TPU and processes the file ~3X faster than real time. For this variation the queue put and get were made blocking calls to insure every frame is processed.

In your Q-Engineering table I've never got anywhere near those framerate numbers with NCS2 on Pi4.

ozett commented 4 years ago

yea. thanks for commenting. i will be a help. i should put mqtt into a separate thread an do "multi-threading".

i also found this https://github.com/jkjung-avt/tensorrt_demos very helpful and now tried to test some code-fragments.

SUPRISE: i got 20 FPS for displaying rtsp-grabbed images from my cam in 640x360 (seems it must be a straight divider for gstreamer) BUT i got only 10 FPS with only the initialize inference. not even using it, only loading the libs cuts down the FPS by half on the nano. (may i should try to send grabbed images with mqtt and measure performance?) HUHHH!

import sys
import time
import argparse

import cv2

# LOADING this means loosing 10 FPS!!!
# TensorRT
#import jetson.inference
#import jetson.utils
#net = jetson.inference.detectNet("ssd-mobilenet-v2")

# FUNCTION: Test for availability of X11
import subprocess

def X_is_running():
    from subprocess import Popen, PIPE
    p = Popen(["xset", "-q"], stdout=PIPE, stderr=PIPE)
    p.communicate()
    return p.returncode == 0

#grayscale
#cap1_url = "rtspsrc location=rtspt://usr:pwd@192.168.14.117/Streaming/Channels/101 ! decodebin ! nvvidconv ! video/x-raw,format=I420,width=1061,height=600 ! videoconvert ! appsink sync=0"
# working. ! https://github.com/jkjung-avt/tensorrt_demos/blob/master/trt_ssd_async.py
cap1_url = "rtspsrc location=rtspt://usr:pwd@192.168.14.117/Streaming/Channels/101 latency=200 ! rtph265depay ! h265parse ! omxh265dec ! nvvidconv ! video/x-raw,width=640,height=360,format=(string)BGRx ! videoconvert ! appsink sync=0"
cap1 = cv2.VideoCapture(cap1_url)

WINDOW_NAME = 'TrtSsdDemoAsync'
full_scrn = False
fps = 0.0
#tic = time.time()

while True:

    tic = time.time()

    ret1,img1 = cap1.read()
    if not ret1:
            print('not ret1=unable to open pipeline')
            # release, than continue
            cap1.release()
            # open again
            cap1 = cv2.VideoCapture(cap1_url)
            continue
    if not cap1.isOpened():
            print("Error: Could not open rtsp-stream: %s" % (cap1))

    # show on desktop
    if X_is_running():

            cv2.imshow(WINDOW_NAME, img1)

            toc = time.time()
            curr_fps = 1.0 / (toc - tic)
            # calculate an exponentially decaying average of fps number
            fps = curr_fps if fps == 0.0 else (fps*0.95 + curr_fps*0.05)
            print ("FPS: {:.1f}".format( fps))
            tic = toc

            key = cv2.waitKey(1)
            if key == 27:  # ESC key: quit program
                break
            elif key == ord('F') or key == ord('f'):  # Toggle fullscreen
                full_scrn = not full_scrn
                set_display(WINDOW_NAME, full_scrn)