Over 20 FPS on TX2 with thread

naisy commented 6 years ago

Hi, GustavZ

Nice work! It seems very good to split the model into Detection part and NMS part. I changed CPU part run with thread. It over 20 FPS on Jetson TX2.

Thank you.

gustavz commented 6 years ago

would you like to share the code modifications you did?

naisy commented 6 years ago

of course. here https://github.com/naisy/realtime_object_detection

gustavz commented 6 years ago

nice! where did you find the session_worker.py? I did not work with tensorflows multi threading yet. Do you know any manual or tutorial on how to use it?

naisy commented 6 years ago

i wrote it. here. https://github.com/naisy/realtime_object_detection/blob/master/lib/session_worker.py https://github.com/naisy/realtime_object_detection/blob/master/lib/__init__.py

make worker. gpu_worker = SessionWorker("GPU",detection_graph,config)

set queue.

gpu_opts = [score_out, expand_out]
gpu_feeds = {image_tensor: image_expanded}
gpu_extras = image # for visualization frame
gpu_worker.put_sess_queue(gpu_opts,gpu_feeds,gpu_extras)

get result.

g = gpu_worker.get_result_queue()
score,expand,image = g["results"][0],g["results"][1],g["extras"]

sorry, for simple usage.


# usage:
# before:
#     results = sess.run([opt1,opt2],feed_dict={input_x:x,input_y:y})
# after:
#     opts = [opt1,opt2]
#     feeds = {input_x:x,input_y:y}
#     woker = SessionWorker("TAG",graph,config)
#     worker.put_sess_queue(opts,feeds)
#     q = worker.get_result_queue()
#     if q is None:
#         continue
#     results = q['results']
#     extras = q['extras']
#
# extras: None or frame image data for draw. GPU detection thread doesn't wait result. Therefore, keep frame image data if you want to draw detection result boxes on image.

gustavz commented 6 years ago

ah nice, so this no tensorflow code but your own? Maybe i use it? I credit and link you ofcourse.

But one more question: are you sure the fps calculation is not affected by this?

naisy commented 6 years ago

not tensorflow code. I wrote it myself. Anyone can use it freely.

About FPS, this time it is as follows. IMAGE(main-thread) -> GPU(thread-1) -> GPU RESULT(main-thread) -> CPU(thread-2) -> CPU RESULT(main-thread) -> VISUALIZE(main-thread) -> FPS UPDATE(main-thread) If CPU RESULT has not been set yet, FPS will not be updated.

                    c = cpu_worker.get_result_queue()
                    if c is None:
                        cpu_counter += 1
                        '''
                        cpu thread has no output queue. ok, nothing to do. continue
                        '''
                        time.sleep(0.005)
                        continue

gustavz commented 6 years ago

thanks for the explanation :) I read a lot that using multi threading in python is not such a good idea because of the global interpreter lock. Is this a problem here? Or is it not because the Threads are IO bound?

Are you familiar with using multiprocessing? Could this be even faster here?

naisy commented 6 years ago

ok, let's check about GIL slow down.

single thread code:

import time
def count(n):
    while n > 0:
        n -= 1

if __name__ == '__main__':
    start_time,start_clock=time.time(),time.clock()
    count(100000000)
    count(100000000)
    end_time,end_clock=time.time()-start_time,time.clock()-start_clock
    print("Single-Thread time:{:.8}, clock:{:.8}".format(end_time,end_clock))

multi thread code:

import time
from threading import Thread

def count(n):
    while n > 0:
        n -= 1

if __name__ == '__main__':
    start_time,start_clock=time.time(),time.clock()
    t1 = Thread(target=count, args=(100000000,))
    t1.start()
    t2 = Thread(target=count, args=(100000000,))
    t2.start()
    t1.join();
    t2.join();
    end_time,end_clock=time.time()-start_time,time.clock()-start_clock
    print("Multi-Thread time:{:.8}, clock:{:.8}".format(end_time,end_clock))

result on JetsonTX2:

Single-Thread time:23.624587, clock:23.624219
Multi-Thread time:122.06561, clock:123.2893

Normary, multi thread is too slow in python.

How about TF? let's check.

tf single thread code:

import tensorflow as tf
import time

class Variable():
    def __init__(self):
        self.x = tf.Variable(1.0,dtype=tf.float32,name="variable_x")
        self.y = tf.Variable(1.0,dtype=tf.float32,name="variable_y")

def addOp(tag,variable):
    add_op = tf.add(variable.x,variable.y,name=tag+"_add_op")
    return add_op

v = Variable()

with tf.device('/gpu:0'):
    tag = "gpu"
    gpu = addOp(tag,v)

with tf.device('/cpu:0'):
    tag = "cpu"
    cpu = addOp(tag,v)

def work(sess,op,n):
    while n > 0:
        _ = sess.run(op)
        n -= 1

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())

    start_time,start_clock = time.time(),time.clock()
    work(sess,gpu,100000)
    work(sess,cpu,100000)
    end_time,end_clock=time.time()-start_time,time.clock()-start_clock
    print("TF Single-Thread(GPU/CPU) time:{:.8}, clock:{:.8}".format(end_time,end_clock))

tf multi thread code:

import time
import tensorflow as tf
from threading import Thread

class Variable():
    def __init__(self):
        self.x = tf.Variable(1.0,dtype=tf.float32,name="variable_x")
        self.y = tf.Variable(1.0,dtype=tf.float32,name="variable_y")

def addOp(tag,variable):
    add_op = tf.add(variable.x,variable.y,name=tag+"_add_op")
    return add_op

v = Variable()

with tf.device('/gpu:0'):
    tag = "gpu"
    gpu = addOp(tag,v)

with tf.device('/cpu:0'):
    tag = "cpu"
    cpu = addOp(tag,v)

def work(sess,op,n):
    while n > 0:
        _ = sess.run(op)
        n -= 1

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())

    start_time,start_clock = time.time(),time.clock()
    t1 = Thread(target=work, args=(sess,gpu,100000,))
    t1.start()
    t2 = Thread(target=work, args=(sess,cpu,100000,))
    t2.start()
    t1.join();
    t2.join();
    end_time,end_clock=time.time()-start_time,time.clock()-start_clock
    print("TF Multi-Thread(GPU/CPU) time:{:.8}, clock:{:.8}".format(end_time,end_clock))

in TF, multi thread is faster than single thread.

TF Single-Thread(GPU/CPU) time:79.972953, clock:173.22859
TF Multi-Thread(GPU/CPU) time:49.060713, clock:151.49888

I also checked about the same device.

TF Multi-Thread(CPU/CPU) time:49.992958, clock:156.02945
TF Multi-Thread(GPU/GPU) time:42.76184, clock:140.32868

This is also faster.

In TF, it seems that there is no need to worry about the bottleneck of multithreading in python.

gustavz commented 6 years ago

Really interesting test and explanation, thank you!

Other Questions:

Did you already use TF's graph transform tool to decrease the networks size to speed it up on mobile devices like the Jetson (https://www.tensorflow.org/mobile/prepare_models) ?
I saw you are using TensorRT, did you manage to optimize ssd_mobilenet with it?

naisy commented 6 years ago

Multiprocessing can not share objects. Therefore, it is necessary to use file I / O etc. However, sess.run () takes longer than GIL, so I think that there is no big merit in using multiprocessing.

Sorry, I am not familiar with TensorRT and model tuning.

gustavz commented 6 years ago

@naisy maybe you heard or even used mask r-cnn. My plan is to do a mask ssd implementation, so that the ssd ouputs not only a bounding box per class but also a segmentation mask. Would you be intrested in joining?

naisy commented 6 years ago

I think that it is very interesting. I have seen Tensorflow 's Mask R-CNN, but I have not used it yet. Since SSD seems to be faster than R-CNN, I am excited about Mask SSD. I would like to use it if it works with Jetson TX2.

naisy commented 6 years ago

I found my mistake in download_model(). please fix to 'frozen_inference_graph.pb'.

gustavz commented 6 years ago

you need to make sure that yo set the right paths in the config.yaml

model_name: 'ssd_mobilenet_v11_coco'
model_path: 'models/ssd_mobilenet_v11_coco/frozen_inference_graph.pb'
label_path: 'object_detection/data/mscoco_label_map.pbtxt'
num_classes: 90

if your frozen graph is named different, you can easily modify those lines

naisy commented 6 years ago

I see. Thank you!

Should I need to change the download model name? model_name: 'ssd_mobilenet_v1_coco_2017_11_17' # download model name

Download url is "HTTP Error 403: Forbidden" from my network. http://download.tensorflow.org/models/object_detection/ssd_mobilenet_v11_coco.tar.gz

Kowasaki commented 6 years ago

@naisy I think you are right--the link in the model zoo shows the link as http://download.tensorflow.org/models/object_detection/ssd_mobilenet_v1_coco_2017_11_17.tar.gz

I have a quick question about the thread implementation--does this work for self-trained version of ssd-mobilenet instead of the frozen checkpoint based off of COCO? From what I can tell they've altered the implementation of that model a bit in the repo and they've also added a new, even lighter embbedded ssd mobilenet.

Also @GustavZ I am quite interested in implementing the mask. Were you able to get Mask-RCNN running on the TX2? I am also considering implementing a mask based off of SSD but from my limited understanding doing instance segmentation is pretty difficult without region proposal like like RCNN. I would be interested in working with you two on figuring this out.

naisy commented 6 years ago

@Kowasaki Thank you for your information.

The implementation of multi-threading is not dependent on the model implementation. It is effective for separate processing like Detection part and NMS part.

Mask-RCNN works with CPU on Jetson TX2. Add the next line to the code. os.environ['CUDA_VISIBLE_DEVICES'] = ''

gustavz commented 6 years ago

@naisy , @Kowasaki "ssd_mobilenet_v1_coco_2017_11_17" is the original model from the modelzoo. ssd_mobilenet_v11_coco is my own model which i modified and re-exported based on the original. It is not available on the model zoo, just in my model folder on my repo, so ofcourse, trying to download it will fail. The automated model download only works for the model zoo. I think i wrote this in the readme.

And yes you are right, the multithreading is model invariant for own trained models, but they must base on ssd_mobilenet. Splitting and Threading R-CNN will not work with this code.

About mask-ssd, i talked to a guy who did a first try of combining psp-net with ssd-net to be able to predict segmentation masks parallel to bounding boxes.

ghost commented 6 years ago

@GustavZ Thank you for implementation. Is it possible to do the same in c++? i am working on real time object detection with tensorflow c++.

gustavz commented 6 years ago

@SANTHAKUMAR91 please open another issue for c++ related questions as this is another topic. But to give a brief answer: I have no idea as i have never worked with tensorflow in connection with c++.

Kowasaki commented 6 years ago

@naisy @GustavZ Thanks for the heads-up! I'll create another issue for my mask-related question.

ghost commented 6 years ago

Sure thanks.

On 12-Mar-2018 6:58 PM, "Gustav vZ" notifications@github.com wrote:

@SANTHAKUMAR91 https://github.com/santhakumar91 please add another issue for c++ related questions as this is another topic. But to give a brief answer: I have no idea as i have never worked with tensorflow in connection with c++.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/GustavZ/realtime_object_detection/issues/4#issuecomment-372309138, or mute the thread https://github.com/notifications/unsubscribe-auth/AVlWuYXRHeBDum6aItR-PVMBhr0JhVrtks5tdngAgaJpZM4SQZLc .

gustavz commented 6 years ago

Closing this issue now

gustavz / realtime_object_detection

Over 20 FPS on TX2 with thread #4