Can you please give me some suggestions for my use case

varun-tangoit commented 5 years ago

Hi,

I'm very impressed with this approach, I have plan to do same way multi threading with split model for my usecase. I'm currently working on face detection with recognization. I'm using both a model mtcnn and facenet model, when I run those both in Jetson tx2 that performance around 7-8 fps. The same way I have been trying with yolo object detection its around 3-4 fps. Can you please suggest how do I proceed for those problem. I'm stuck on this. Please help us. It would much appreciated.

Thanks

naisy commented 5 years ago

Hi @varun-tangoit,

I think that the following repository is helpful for realtime face detection. https://github.com/naisy/realtime_face_detection After face detection with SSD, it can be realized by rewriting the png image part to the CNN face classification. Please compare this ssd with your mtcnn.

If you really have to use mtcnn you need to write code to parse and split the mtcnn graph. However, it may slow down in complicated models.

varun-tangoit commented 5 years ago

Hey Hi,

Thanks for the respose, i haven't see any issues while detection alone. When i try to run both mtcnn model and recognition its going to very slow. i thought recognition of facenet model is there issue, i need split graph and multi threading of the both model. can you please give me more detail for recognition part. i don't know how to apply to this same procedure.

naisy commented 5 years ago

Hi @varun-tangoit,

Please refer to the following URL for split the graph. https://github.com/naisy/realtime_object_detection/blob/master/About_Split-Model.md See the graph diagram with tensorboard and look for nodes that are easy for split by your eyes. When loading PB file, add new input nodes for the split target nodes. The inputs of the first graph are the default inputs, and the outputs are the split target nodes. The inputs of the second graph are new input nodes, and outputs are the default nodes.

before: default inputs -> (normal graph) -> default outputs after: default inputs -> (1st graph) -> split target nodes new split target nodes -> (2nd graph) -> default outputs

The principle is straightforward, but in complex models it will be fighting a lot of errors due to unclear computation order of nodes.

For threading, you can pass it to 1st graph's outputs -> 2nd graph's inputs with thread-safe queue. Of course, 1st graph and 2nd graph are running on threads, and checking whether the value is in the queue or not in the loop.

BTW, Is facenet slow even only one face?

varun-tangoit commented 5 years ago

Yes @naisy, When i run sample video with set of peoples present, it recognize with one person every frame taking too much time, around 2-3 fps only both mtcnn and facenet i got. Im refering this below github. https://github.com/vudung45/FaceRec I have been working both face recognition and darkflow.. i won't get as much performance in jetson tx2, even when i ran in laptop it reaches higher fps. i trying with this approach for the both a problem.

naisy commented 5 years ago

Hi @varun-tangoit,

I don't use facenet, but facenet seems to be slow because it uses resnet. If you split a model with a large number of arrays, the overhead of value copying is too large and conversely slow.

I think that it is better to change this to more small CNN in order to increase the speed. Or by using a model learned by reducing the number of nodes, speed can be improved.

The input size of facenet seems to be 160x160, but how about changing this to 80x80 and training? I think that this is probably more realistic than dividing the graph.

naisy commented 5 years ago

Hi @varun-tangoit,

When I did facial expression classification in keras xception model instead of facenet, I set it to 48x48.

varun-tangoit commented 5 years ago

Great. Sorry we have already completed our code with facenet, mtcnn model everything works fine in local cpu, alienware 1080ti desktop gpu. When we try to integrate on this jetson tx2 only we have issue. but somehow we need to optimize with respect to edge device.

varun-tangoit commented 5 years ago

is there any other approaches to do improve the performance. I found another solution TensorRT, but i can't able to find any examples for face detection/recognition with python implementation.

naisy commented 5 years ago

Can you reduce facenet input size and train it? Perhaps this is the best.

I am trying to install deeplab v3 in RC car recently. This model does not work well with splitting or threading. I have not considered TensorRT/C++ yet.

Therefore, by reducing the input size at model training from 513x513 to 160x120, it was able to speed up from 40 FPS to 230 FPS on GTX1060. Accuracy depends on the training data and the weight ratio of each class, so accuracy can be obtained by trial and error on the train/val data. On TX2, it will improve from 4 FPS to 20 FPS. (I have not tried it on TX2 yet.)

varun-tangoit commented 5 years ago

Sorry we are using pretrained model. is that sameway darkflow-object detection also taking too much time? im having two weights object detection and number plate detection running on single project. That performance also around 2-3 fps.

naisy commented 5 years ago

Indeed, I understood the situation. Look at CPU/GPU utilization with tegrastats. If the usage rate is about 20-30%, it may be possible to speed up with simple parallel processing.

Now it is the way I am using Mask R-CNN to run. before: camera -> detection -> visualization after: camera -> detection0 -> frame0 camera -> detection1 -> frame1 camera -> detection2 -> frame2 camera -> detection3 -> frame3 frame0 -> visualization frame1 -> visualization frame2 -> visualization frame3 -> visualization

I tried implementing this method also in ssd_mobilenet_v1, but because it was later than split, it became out of print. Because the balance of CPU/GPU usage can not be taken well, it will be slower than a model split perfectly. But it is faster than not doing anything.

about ssd_mobilenet_v1: non-split: 100 FPS split: 160 FPS mtdetection: 140 FPS

You can try multi-detection ssd_mobilenet_1. Add next codes in my repo. lib/mtdetection_nms_v2.py

import numpy as np
from tf_utils import visualization_utils_cv2 as vis_util
from lib.session_worker import SessionWorker
from lib.mtload_graph_nms_v2 import LoadFrozenGraph
from lib.load_label_map import LoadLabelMap
from lib.mpvariable import MPVariable
from lib.mpvisualizeworker import MPVisualizeWorker, visualization
from lib.mpio import start_sender

import time
import cv2
import tensorflow as tf
import os

import sys
PY2 = sys.version_info[0] == 2
PY3 = sys.version_info[0] == 3
if PY2:
    import Queue
elif PY3:
    import queue as Queue

class NMSV2():
    def __init__(self):
        return

    def start(self, cfg):
        """ """ """ """ """ """ """ """ """ """ """
        GET CONFIG
        """ """ """ """ """ """ """ """ """ """ """
        FORCE_GPU_COMPATIBLE = cfg['force_gpu_compatible']
        SAVE_TO_FILE         = cfg['save_to_file']
        VISUALIZE            = cfg['visualize']
        VIS_WORKER           = cfg['vis_worker']
        VIS_TEXT             = cfg['vis_text']
        MAX_FRAMES           = cfg['max_frames']
        WIDTH                = cfg['width']
        HEIGHT               = cfg['height']
        FPS_INTERVAL         = cfg['fps_interval']
        DET_INTERVAL         = cfg['det_interval']
        DET_TH               = cfg['det_th']
        SPLIT_MODEL          = cfg['split_model']
        WORKER_THREADS       = cfg['worker_threads']
        LOG_DEVICE           = cfg['log_device']
        ALLOW_MEMORY_GROWTH  = cfg['allow_memory_growth']
        SPLIT_SHAPE          = cfg['split_shape']
        DEBUG_MODE           = cfg['debug_mode']
        LABEL_PATH           = cfg['label_path']
        NUM_CLASSES          = cfg['num_classes']
        SRC_FROM             = cfg['src_from']
        CAMERA = 0
        MOVIE  = 1
        IMAGE  = 2
        if SRC_FROM == 'camera':
            SRC_FROM = CAMERA
            VIDEO_INPUT = cfg['camera_input']
        elif SRC_FROM == 'movie':
            SRC_FROM = MOVIE
            VIDEO_INPUT = cfg['movie_input']
        elif SRC_FROM == 'image':
            SRC_FROM = IMAGE
            VIDEO_INPUT = cfg['image_input']
        """ """

        """ """ """ """ """ """ """ """ """ """ """
        LOAD FROZEN_GRAPH
        """ """ """ """ """ """ """ """ """ """ """
        load_frozen_graph = LoadFrozenGraph(cfg)
        graph = load_frozen_graph.load_graph()
        """ """

        """ """ """ """ """ """ """ """ """ """ """
        LOAD LABEL MAP
        """ """ """ """ """ """ """ """ """ """ """
        llm = LoadLabelMap()
        category_index = llm.load_label_map(cfg)
        """ """

        """ """ """ """ """ """ """ """ """ """ """
        PREPARE TF CONFIG OPTION
        """ """ """ """ """ """ """ """ """ """ """
        # Session Config: allow seperate GPU/CPU adressing and limit memory allocation
        config = tf.ConfigProto(allow_soft_placement=True, log_device_placement=LOG_DEVICE)
        config.gpu_options.allow_growth = ALLOW_MEMORY_GROWTH
        config.gpu_options.force_gpu_compatible = FORCE_GPU_COMPATIBLE
        #config.gpu_options.per_process_gpu_memory_fraction = 0.01 # 80MB memory is enough to run on TX2
        """ """

        """ """ """ """ """ """ """ """ """ """ """
        PREPARE GRAPH I/O TO VARIABLE
        """ """ """ """ """ """ """ """ """ """ """
        # Define Input and Ouput tensors
        image_tensor = graph.get_tensor_by_name('image_tensor:0')
        detection_boxes = graph.get_tensor_by_name('detection_boxes:0')
        detection_scores = graph.get_tensor_by_name('detection_scores:0')
        detection_classes = graph.get_tensor_by_name('detection_classes:0')
        num_detections = graph.get_tensor_by_name('num_detections:0')

        """ """ """ """ """ """ """ """ """ """ """
        START WORKER THREAD
        """ """ """ """ """ """ """ """ """ """ """
        workers = []
        worker_tag = 'worker'
        # create session worker threads
        for i in range(WORKER_THREADS):
            workers += [SessionWorker(worker_tag, graph, config)]
        worker_opts = [detection_boxes, detection_scores, detection_classes, num_detections]
        """ """

        """
        START VISUALIZE WORKER
        """
        if VISUALIZE and VIS_WORKER:
            q_out = Queue.Queue()
            vis_worker = MPVisualizeWorker(cfg, MPVariable.vis_in_con)
            """ """ """ """ """ """ """ """ """ """ """
            START SENDER THREAD
            """ """ """ """ """ """ """ """ """ """ """
            start_sender(MPVariable.det_out_con, q_out)
        proc_frame_counter = 0
        vis_proc_time = 0

        """ """ """ """ """ """ """ """ """ """ """
        WAIT UNTIL THE FIRST DUMMY IMAGE DONE
        """ """ """ """ """ """ """ """ """ """ """
        print('Loading...')
        sleep_interval = 0.1
        """
        PUT DUMMY DATA INTO WORKERS
        """
        worker_feeds = {image_tensor:  [np.zeros((300, 300, 3))]}
        worker_extras = {}
        for i in range(WORKER_THREADS):
            workers[i].put_sess_queue(worker_opts, worker_feeds, worker_extras)
        """
        WAIT UNTIL JIT-COMPILE DONE
        """
        ready_counter = 0
        is_jit_done = False
        while not is_jit_done:
            for i in range(WORKER_THREADS):
                q = workers[i].get_result_queue()
                if q is None:
                    time.sleep(sleep_interval)
                else:
                    ready_counter += 1
                    if ready_counter == WORKER_THREADS:
                        is_jit_done = True
                        break
        """ """

        """ """ """ """ """ """ """ """ """ """ """
        START CAMERA
        """ """ """ """ """ """ """ """ """ """ """
        if SRC_FROM == CAMERA:
            from lib.webcam import WebcamVideoStream as VideoReader
        elif SRC_FROM == MOVIE:
            from lib.video import VideoReader
        elif SRC_FROM == IMAGE:
            from lib.image import ImageReader as VideoReader
        video_reader = VideoReader()

        if SRC_FROM == IMAGE:
            video_reader.start(VIDEO_INPUT, save_to_file=SAVE_TO_FILE)
        else: # CAMERA, MOVIE
            video_reader.start(VIDEO_INPUT, WIDTH, HEIGHT, save_to_file=SAVE_TO_FILE)
            frame_cols, frame_rows = video_reader.getSize()
            """ STATISTICS FONT """
            fontScale = frame_rows/1000.0
            if fontScale < 0.4:
                fontScale = 0.4
            fontThickness = 1 + int(fontScale)
        fontFace = cv2.FONT_HERSHEY_SIMPLEX
        if SRC_FROM == MOVIE:
            dir_path, filename = os.path.split(VIDEO_INPUT)
            filepath_prefix = filename
        elif SRC_FROM == CAMERA:
            filepath_prefix = 'frame'
        """ """

        """ """ """ """ """ """ """ """ """ """ """
        DETECTION LOOP
        """ """ """ """ """ """ """ """ """ """ """
        print('Starting Detection')
        sleep_interval = 0.005
        top_in_time = None
        frame_in_processing_counter = 0
        current_in_worker_id = -1
        worker_id_queue = Queue.Queue()
        retry_worker_id_queue = Queue.Queue()

        try:
            if not video_reader.running:
                raise IOError(("Input src error."))
            while MPVariable.running.value:
                if top_in_time is None:
                    top_in_time = time.time()
                """
                SPRIT/NON-SPLIT MODEL CAMERA TO WORKER
                """
                if video_reader.running:
                    for i in range(WORKER_THREADS):
                        worker_in_id = i + current_in_worker_id + 1
                        worker_in_id %= WORKER_THREADS
                        if workers[worker_in_id].is_sess_empty(): # must need for speed
                            cap_in_time = time.time()
                            if SRC_FROM == IMAGE:
                                frame, filepath = video_reader.read()
                                if frame is not None:
                                    frame_in_processing_counter += 1
                                    frame = cv2.resize(frame, (frame_cols, frame_rows))
                            else:
                                frame = video_reader.read()
                                if frame is not None:
                                    filepath = filepath_prefix+'_'+str(proc_frame_counter)+'.png'
                                    frame_in_processing_counter += 1
                            if frame is not None:
                                image_expanded = np.expand_dims(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB), axis=0) # np.expand_dims is faster than []
                                #image_expanded = np.expand_dims(frame, axis=0) # BGR image for input. Of couse, bad accuracy in RGB trained model, but speed up.
                                cap_out_time = time.time()
                                # put new queue
                                worker_feeds = {image_tensor: image_expanded}
                                worker_extras = {'image':frame, 'top_in_time':top_in_time, 'cap_in_time':cap_in_time, 'cap_out_time':cap_out_time, 'filepath': filepath} # always image draw.
                                workers[worker_in_id].put_sess_queue(worker_opts, worker_feeds, worker_extras)
                                current_in_worker_id = worker_in_id
                                worker_id_queue.put(worker_in_id)
                                time.sleep(sleep_interval*10/WORKER_THREADS)
                            break
                elif frame_in_processing_counter <= 0:
                    MPVariable.running.value = False
                    break

                q = None
                if not retry_worker_id_queue.empty():
                    print("retry!")
                    worker_out_id = retry_worker_id_queue.get(block=False)
                    worker_out_id %= WORKER_THREADS
                    retry_worker_id_queue.task_done()
                    q = workers[worker_out_id].get_result_queue()
                    if q is None:
                        retry_worker_id_queue.put(worker_out_id)
                elif not worker_id_queue.empty():
                    worker_out_id = worker_id_queue.get(block=False)
                    worker_out_id %= WORKER_THREADS
                    worker_id_queue.task_done()
                    q = workers[worker_out_id].get_result_queue()
                    if q is None:
                        retry_worker_id_queue.put(worker_out_id)
                if q is None:
                    """
                    SPLIT/NON-SPLIT MODEL
                    """
                    # detection is not complete yet. ok nothing to do.
                    time.sleep(sleep_interval)
                    continue

                frame_in_processing_counter -= 1
                boxes, scores, classes, num, extras = q['results'][0], q['results'][1], q['results'][2], q['results'][3], q['extras']
                boxes, scores, classes = np.squeeze(boxes), np.squeeze(scores), np.squeeze(classes)
                det_out_time = time.time()

                """
                ALWAYS BOX DRAW ON IMAGE
                """
                vis_in_time = time.time()
                image = extras['image']
                if SRC_FROM == IMAGE:
                    filepath = extras['filepath']
                    frame_rows, frame_cols = image.shape[:2]
                    """ STATISTICS FONT """
                    fontScale = frame_rows/1000.0
                    if fontScale < 0.4:
                        fontScale = 0.4
                    fontThickness = 1 + int(fontScale)
                else:
                    filepath = extras['filepath']
                image = visualization(category_index, image, boxes, scores, classes, DEBUG_MODE, VIS_TEXT, FPS_INTERVAL,
                                      fontFace=fontFace, fontScale=fontScale, fontThickness=fontThickness)

                """
                VISUALIZATION
                """
                if VISUALIZE:
                    if (MPVariable.vis_skip_rate.value == 0) or (proc_frame_counter % MPVariable.vis_skip_rate.value < 1):
                        if VIS_WORKER:
                            q_out.put({'image':image, 'vis_in_time':vis_in_time})
                        else:
                            """
                            SHOW
                            """
                            cv2.imshow("Object Detection", image)
                            # Press q to quit
                            if cv2.waitKey(1) & 0xFF == 113: #ord('q'):
                                break
                            MPVariable.vis_frame_counter.value += 1
                            vis_out_time = time.time()
                            """
                            PROCESSING TIME
                            """
                            vis_proc_time = vis_out_time - vis_in_time
                            MPVariable.vis_proc_time.value += vis_proc_time
                else:
                    """
                    NO VISUALIZE
                    """
                    for box, score, _class in zip(boxes, scores, classes):
                        if proc_frame_counter % DET_INTERVAL == 0 and score > DET_TH:
                            label = category_index[_class]['name']
                            print("label: {}\nscore: {}\nbox: {}".format(label, score, box))

                    vis_out_time = time.time()
                    """
                    PROCESSING TIME
                    """
                    vis_proc_time = vis_out_time - vis_in_time

                if SAVE_TO_FILE:
                    if SRC_FROM == IMAGE:
                        video_reader.save(image, filepath)
                    else:
                        video_reader.save(image)

                proc_frame_counter += 1
                if proc_frame_counter > 100000:
                    proc_frame_counter = 0
                """
                PROCESSING TIME
                """
                top_in_time = extras['top_in_time']
                cap_proc_time = extras['cap_out_time'] - extras['cap_in_time']
                worker_proc_time = extras['worker_out_time'] - extras['worker_in_time']
                lost_proc_time = det_out_time - top_in_time - cap_proc_time - worker_proc_time
                total_proc_time = det_out_time - top_in_time
                MPVariable.cap_proc_time.value += cap_proc_time
                MPVariable.worker_proc_time.value += worker_proc_time
                MPVariable.lost_proc_time.value += lost_proc_time
                MPVariable.total_proc_time.value += total_proc_time

                if DEBUG_MODE:
                    sys.stdout.write('snapshot FPS:{: ^5.1f} total:{: ^10.5f} cap:{: ^10.5f} worker:{: ^10.5f} lost:{: ^10.5f} | vis:{: ^10.5f}\n'.format(
                        MPVariable.fps.value, total_proc_time, cap_proc_time, worker_proc_time, lost_proc_time, vis_proc_time))
                """
                EXIT WITHOUT GUI
                """
                if not VISUALIZE and MAX_FRAMES > 0:
                    if proc_frame_counter >= MAX_FRAMES:
                        MPVariable.running.value = False
                        break

                """
                CHANGE SLEEP INTERVAL
                """
                if MPVariable.frame_counter.value == 0 and MPVariable.fps.value > 0:
                    sleep_interval = 0.1 / MPVariable.fps.value
                    MPVariable.sleep_interval.value = sleep_interval
                MPVariable.frame_counter.value += 1
                top_in_time = None
            """
            END while
            """
        except KeyboardInterrupt:
            pass
        except:
            import traceback
            traceback.print_exc()
        finally:
            """ """ """ """ """ """ """ """ """ """ """
            CLOSE
            """ """ """ """ """ """ """ """ """ """ """
            if VISUALIZE and VIS_WORKER:
                q_out.put(None)
            MPVariable.running.value = False
            for i in range(WORKER_THREADS):
                workers[i].stop()
            video_reader.stop()

            if VISUALIZE:
                cv2.destroyAllWindows()
            """ """

        return

lib/mtload_graph_nms_v2.py

import tensorflow as tf
from tensorflow.core.framework import graph_pb2
import copy

class LoadFrozenGraph():
    """
    LOAD FROZEN GRAPH
    ssd_movilenet_v2
    """
    def __init__(self, cfg):
        self.cfg = cfg
        return

    def load_graph(self):
        print('Building Graph')
        if not self.cfg['split_model']:
            return self.load_frozen_graph_without_split()
        else:
            return self.load_frozen_graph_with_split()

    def print_graph(self, graph):
        """
        PRINT GRAPH OPERATIONS
        """
        print("{:-^32}".format(" operations in graph "))
        for op in graph.get_operations():
            print(op.name,op.outputs)
        return

    def print_graph_def(self, graph_def):
        """
        PRINT GRAPHDEF NODE NAMES
        """
        print("{:-^32}".format(" nodes in graph_def "))
        for node in graph_def.node:
            print(node)
        return

    def print_graph_operation_by_name(self, graph, name):
        """
        PRINT GRAPH OPERATION DETAILS
        """
        op = graph.get_operation_by_name(name=name)
        print("{:-^32}".format(" operations in graph "))
        print("{:-^32}\n{}".format(" op ", op))
        print("{:-^32}\n{}".format(" op.name ", op.name))
        print("{:-^32}\n{}".format(" op.outputs ", op.outputs))
        print("{:-^32}\n{}".format(" op.inputs ", op.inputs))
        print("{:-^32}\n{}".format(" op.device ", op.device))
        print("{:-^32}\n{}".format(" op.graph ", op.graph))
        print("{:-^32}\n{}".format(" op.values ", op.values()))
        print("{:-^32}\n{}".format(" op.op_def ", op.op_def))
        print("{:-^32}\n{}".format(" op.colocation_groups ", op.colocation_groups))
        print("{:-^32}\n{}".format(" op.get_attr ", op.get_attr("T")))
        i = 0
        for output in op.outputs:
            op_tensor = output
            tensor_shape = op_tensor.get_shape().as_list()
            print("{:-^32}\n{}".format(" outputs["+str(i)+"] shape ", tensor_shape))
            i += 1
        return

    # helper function for split model
    def node_name(self, n):
        if n.startswith("^"):
            return n[1:]
        else:
            return n.split(":")[0]

    def load_frozen_graph_without_split(self):
        """
        Load frozen_graph.
        """
        model_path = self.cfg['model_path']

        tf.reset_default_graph()

        graph_def = tf.GraphDef()
        with tf.gfile.GFile(model_path, 'rb') as fid:
            serialized_graph = fid.read()
            graph_def.ParseFromString(serialized_graph)
            # force CPU device placement for NMS ops
            for node in graph_def.node:
                if 'BatchMultiClassNonMaxSuppression' in node.name:
                    node.device = '/device:CPU:0'
                else:
                    node.device = '/device:GPU:0'
            tf.import_graph_def(graph_def, name='')

        #self.print_graph_operation_by_name(detection_graph, "Postprocessor/Slice")
        #self.print_graph_operation_by_name(detection_graph, "Postprocessor/ExpandDims_1")
        #self.print_graph_operation_by_name(detection_graph, "Postprocessor/stack_1")
        """
        return
        """
        return tf.get_default_graph()

    def load_frozen_graph_with_split(self):
        """
        Load frozen_graph and split it into half of GPU and CPU.
        """
        model_path = self.cfg['model_path']
        split_shape = self.cfg['split_shape']
        num_classes = self.cfg['num_classes']

        """ SPLIT TARGET NAME """
        SPLIT_TARGET_NAME = ['Postprocessor/Slice', # Tensor
                             'Postprocessor/ExpandDims_1', # Tensor
                             'Postprocessor/stack_1', # Float array
        ]

        tf.reset_default_graph()

        graph_def = tf.GraphDef()
        with tf.gfile.GFile(model_path, 'rb') as fid:
            serialized_graph = fid.read()
            graph_def.ParseFromString(serialized_graph)

            """
            Check the connection of all nodes.
            edges[] variable has input information for all nodes.
            """
            edges = {}
            name_to_node_map = {}
            node_seq = {}
            seq = 0
            for node in graph_def.node:
                node.device = '/device:GPU:0'
                n = self.node_name(node.name)
                name_to_node_map[n] = node
                edges[n] = [self.node_name(x) for x in node.input]
                node_seq[n] = seq
                seq += 1

            """
            Alert if split target is not in the graph.
            """
            dest_nodes = SPLIT_TARGET_NAME
            for d in dest_nodes:
                assert d in name_to_node_map, "%s is not in graph" % d

            """
            Making GPU part.
            Follow all input nodes from the split point and add it into first_list.
            """
            nodes_to_first = set()
            next_to_visit = dest_nodes

            while next_to_visit:
                n = next_to_visit[0]
                del next_to_visit[0]
                if n in nodes_to_first:
                    continue
                nodes_to_first.add(n)
                next_to_visit += edges[n]

            nodes_to_first_list = sorted(list(nodes_to_first), key=lambda n: node_seq[n])

            graph = graph_pb2.GraphDef()
            for n in nodes_to_first_list:
                graph.node.extend([copy.deepcopy(name_to_node_map[n])])

            """
            Making CPU part.
            It removes GPU part from loaded graph and add new inputs.
            """
            nodes_to_second = set()
            for n in node_seq:
                if n in nodes_to_first_list: continue
                nodes_to_second.add(n)
            nodes_to_second_list = sorted(list(nodes_to_second), key=lambda n: node_seq[n])

            for n in nodes_to_second_list:
                name_to_node_map[n].device = '/device:CPU:0'
                graph.node.extend([copy.deepcopy(name_to_node_map[n])])

            """
            Import graph_def into default graph.
            """
            tf.import_graph_def(graph, name='')

        """
        return    
        """
        return tf.get_default_graph()

run_stream.py

        elif model_type == 'mtnms_v2':
            from lib.mtdetection_nms_v2 import NMSV2
            detection = NMSV2()
            detection.start(cfg)

config.yml

model_type: 'mtnms_v2'
model_path: 'models/ssd_mobilenet_v1_coco_2018_01_28/frozen_inference_graph.pb'

And run python run_stream.py.

The code of Mask R-CNN is as follows. before: https://github.com/naisy/realtime_object_detection/blob/master/lib/detection_mask_v1.py after: https://github.com/naisy/realtime_object_detection/blob/master/lib/mtdetection_mask_v1.py

I think that it seems to be a lag because it has a problem that the frame interval is not uniform. But this is a simple way to implement parallel processing.

naisy commented 5 years ago

Another possibility is to incorporate tracking. When I made facial expression classification after face detection, I have created other version using tracking to identify the object. In my case this became slow, but the cause seemed to be due to opencv multi-tracker api being slow. However, there seems to be some people who operate detection and tracking alternately to speed up, so it may be faster depending on the implementation of traking.

varun-tangoit commented 5 years ago

Sorry busy with catching up work. Yeah sure we are working optimize our code with respect jetson. is it possible to achieve higher fps on jetson for yolov2 darkflow, im refereing this github https://github.com/thtrieu/darkflow.

naisy commented 5 years ago

In my recognition, tiny-yolo is fast. If you throw mtcnn/facenet, and use tiny-yolo only, it will be really fast.

For example, if you need to exceed 40 FPS, you will use ssd_mobilenet_v1(split-model or tensorrt/c++) by discarding mtcnn/facenet. (of course, need training with original datasets.) If 10 FPS is OK, you may improve by reviewing the mtcnn/facenet execution code.

I am aware that the current problem lies in the slow execution speed of facenet. If you cannot be thrown away facenet, please check facenet standalone, timeline, graph, cpu/gpu usage and consider whether facenet has the potential to give fps as you expect.

varun-tangoit commented 5 years ago

@naisy, Yeah it would tiny-yolo improve performance but it accuracy gonna be not good is it right?

naisy commented 5 years ago

Yes. The accuracy of tiny-yolo seems not good. However, other than ssd_mobilenet_v1, fast object detection would be tiny-yolo only.

naisy / realtime_object_detection

Can you please give me some suggestions for my use case #63