tensorflow / models

Models and examples built with TensorFlow
Other
77.16k stars 45.75k forks source link

Very Slow inference speed of object detection models #4355

Open shoma88 opened 6 years ago

shoma88 commented 6 years ago

System information

You can collect some of this information using our environment capture script:

https://github.com/tensorflow/tensorflow/tree/master/tools/tf_env_collect.sh

You can obtain the TensorFlow version with

python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"

Describe the problem

The processing time is very high. Using the CPU i get 1.5 Second for each image. Using the GPU i get an higher time (1.7 Second). I get bad performances whit the standard model (dogs and people on the beach) and also whit a retrained model. I have the same problem with the Cifar10 example. The performances during the train seems to be fine (7000 samples per sec) but the processing time on the eval.py code is huge (9 Seconds). Do you have any suggestions about that? i can find a lot of benchmarks about the training time but not many about the evaluation time so i don't know how much time should i need to process a single image.

thanks,

Adriano.

Source code / logs

Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached. Try to provide a reproducible test case that is the bare minimum necessary to generate the problem.

Davidnet commented 6 years ago

Check for #3270 for discussion

shoma88 commented 6 years ago

i have already checked that and searched on other forums. My problem is a little bit differt. He get a small loss of performances (0.06 s insted of 0.03 s) but i get a huge loss of performances (1.5 s instead of 0.03s). i'm pretty new on Tensorflow so i don't know if i need to configure something to get good performances whit the GPUs. My version of tensorflow support the GPUs and i can see on the cmd prompt that tensorflow is using it. Whit GPU shark i can also see that it uses almost all the free memory of the GPU.

dennywangtenk commented 6 years ago

@shoma88

  1. how about training, did you see the same slowness?
  2. how about try this, what do you see? sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
  3. try another lighter model, see any change.
pwuertz commented 6 years ago

I'm experiencing the same thing. Running the object detection tutorial out of the box gives me 2-3 seconds per image, instead of the suggested 20-30ms, so it's a whopping 100x slower. My Tensorflow was built with GPU support, running on a GTX 980.

When I add the log_device_placement=True I can see additional output in the Jupyter console:

2018-06-16 08:28:04.870455: I tensorflow/core/common_runtime/placer.cc:886] SecondStagePostprocessor/BatchMultiClassNonMaxSuppression/map/TensorArrayStack/range/delta: (Const)/job:localhost/replica:0/task:0/device:GPU:0
SecondStagePostprocessor/BatchMultiClassNonMaxSuppression/map/TensorArrayStack_1/range/start: (Const): /job:localhost/replica:0/task:0/device:GPU:0
2018-06-16 08:28:04.870463: I tensorflow/core/common_runtime/placer.cc:886] SecondStagePostprocessor/BatchMultiClassNonMaxSuppression/map/TensorArrayStack_1/range/start: (Const)/job:localhost/replica:0/task:0/device:GPU:0
SecondStagePostprocessor/BatchMultiClassNonMaxSuppression/map/TensorArrayStack_1/range/delta: (Const): /job:localhost/replica:0/task:0/device:GPU:0
2018-06-16 08:28:04.870472: I tensorflow/core/common_runtime/placer.cc:886] SecondStagePostprocessor/BatchMultiClassNonMaxSuppression/map/TensorArrayStack_1/range/delta: (Const)/job:localhost/replica:0/task:0/device:GPU:0
SecondStagePostprocessor/BatchMultiClassNonMaxSuppression/map/TensorArrayStack_2/range/start: (Const): /job:localhost/replica:0/task:0/device:GPU:0

I'm confused about #3270. Is this really the same issue? There only the first iteration was in the order of seconds. Could it be that also the demo code changed for the worse, now slowing down every iteration instead of only the first one?

karmel commented 6 years ago

This sounds like it could be related to general performance tuning, rather than object detection in particular, and you might have better luck getting help on Stack Overflow. That said, this looks related to #4266 -- @derekjchow , any ideas there or here?

dennywangtenk commented 6 years ago

The tutorial was wrote in a way creating One new TF session for One image inference. Not good.

I wrote a new func can inference multiple images without creating new TF session. Find here. On GTX1060, I can reach 6~12 FPS depends on which model I'm using.

But for general tuning tips, I could not find good articles for object detection, so I requested it (issue#4495)

rusuvalentin commented 6 years ago

I have the same issue. Training a batch of 8 with a Faster RCNN Resnet 50 lasts 0.7sec / step, but performing inference and measuring time for session.run(....) yields a consistent 2.2sec / image inference time. It does load the graph onto the GPU! This was not an issue before! What is going on?

daviduarte commented 6 years ago

@dennywangtenk solution works for me.

I was creating a new session every time, like this:

graph_global = None
TENSOR = None

NUM_ITERACOES = 10
BATCH = 1
SOMA = 0
def loadModel():
    global graph_global

    # Load the protobuf file from the disk and parse it to retrieve the 
    # unserialized graph_def
    with tf.gfile.GFile("PATH_TO_PB_FILE/0123456789.pb", "rb") as f:
        graph_def = tf.GraphDef()
        graph_def.ParseFromString(f.read())

    # Then, we import the graph_def into a new Graph and returns it 
    with tf.Graph().as_default() as graph:
        # The name var will prefix every op/nodes in your graph
        # Since we load everything in a new graph, this is not needed
        tf.import_graph_def(graph_def, name="prefix")

    # Make the graph global
    graph_global = graph

# Do the inference
def inferencia():
    global BATCH
    global NUM_ITERACOES
    global SOMA

    for i in range(10):

        # Load an unique image
        image = Image.open("PATH_TO_IMGS/" + str(i) + ".png")

        # Reshape and transform the image in np array
        (im_width, im_height) = image.size
        image = np.array(image.getdata()).reshape((im_height, im_width, 3)).astype(np.uint8)

        # Run the inference and calculate the time
        x = graph_global.get_tensor_by_name('prefix/image_tensor:0')
        y = graph_global.get_tensor_by_name('prefix/detection_boxes:0')

        with tf.Session(graph=graph_global) as sess:        
            start = time.time()
            res = sess.run(y, feed_dict={x: np.expand_dims(image, 0)})
            end = time.time()

            print(end - start)

if __name__ == '__main__':

    loadModel()

    print("Inference time: ")
    inferencia()

And I got high processing time for each image:

3.3824450969696045
3.4309332370758057
3.384895086288452
3.354207754135132
3.3493900299072266
3.5104100704193115
3.4564030170440674
3.632082939147949
3.402231216430664
3.301783561706543

So, I moved the with tf.Session(graph=graph_global) as sess: line out of the looping, like this:

graph_global = None
TENSOR = None

NUM_ITERACOES = 10
BATCH = 1
SOMA = 0

def loadModel():
    global graph_global

    # Load the protobuf file from the disk and parse it to retrieve the 
    # unserialized graph_def
    with tf.gfile.GFile("PATH_TO_PB_FILE/0123456789.pb", "rb") as f:
        graph_def = tf.GraphDef()
        graph_def.ParseFromString(f.read())

    # Then, we import the graph_def into a new Graph and returns it 
    with tf.Graph().as_default() as graph:
        # The name var will prefix every op/nodes in your graph
        # Since we load everything in a new graph, this is not needed
        tf.import_graph_def(graph_def, name="prefix")

    # Make the graph global
    graph_global = graph

# Do the inference
def inferencia(sess):
    global BATCH
    global NUM_ITERACOES
    global SOMA

    for i in range(10):
        #inserirImagens(NUM_IMAGES)

        # Load an unique image
        image = Image.open("PATH_TO_IMGS/" + str(i) + ".png")

        # Reshape and transform the image in np array
        (im_width, im_height) = image.size
        image = np.array(image.getdata()).reshape((im_height, im_width, 3)).astype(np.uint8)

        # Run the inference and calculate the time
        start = time.time()
        res = sess.run(y, feed_dict={x: np.expand_dims(image, 0)})
        end = time.time()

        total_time = end - start
        SOMA = SOMA + total_time
        print(end - start)

        plt.imshow(image)
        #plt.show()

if __name__ == '__main__':

    loadModel()

    x = graph_global.get_tensor_by_name('prefix/image_tensor:0')
    y = graph_global.get_tensor_by_name('prefix/detection_boxes:0')

    print("Time required: ")
    with tf.Session(graph=graph_global) as sess:
        print("Inference time: ")
        inferencia(sess)

And finally I got low time values

3.255561351776123
0.08460521697998047
0.08266162872314453
0.08443999290466309
0.10549187660217285
0.08379626274108887
0.086212158203125
0.08907151222229004
0.10163354873657227
0.08518075942993164
197sh0ta commented 6 years ago

@daviduarte Do you know why only first time takes longer time?

ronykalfarisi commented 5 years ago

it's the time needed for the system to set up the resources, in this case setting up CUDA cores and memory.

svencowart commented 5 years ago

Is there a TF trick to initialize the system resources before running the first inferences in something like a model initialization step so that the first image inference performs like the subsequent inferences?

usamahjundia commented 5 years ago

Any idea to have the session having the graph already loaded stored as an instance variable in a class? I Implemented a class to wrap the whole inference using this approach:

class ObjectDetector(object):
    def __init__(self,model_name):
        self.model_name = model_name
        self.graph = tf.Graph()
        self.num_class = 1
        self.initialize_graph()
        self.initialize_labels()
        self.session = None

    def __del__(self):
        if self.session is not None:
            self.session.close()

    def run_inference_for_single_image(self,image):
        if self.session is None:
            with self.graph.as_default():
                self.session = tf.Session()
    # ......
         # use the session to predict
         # output_dict = self.session.run(.......

but after the session is used, the graph is not loaded. I had to move the rest of the content of the function run_inference_for_single_image under the with self.graph.as_default() to make it work, and as the consequence, i have to reload the graph everytime, suffering in terms of processing time.

Is there a way around this, or do i understand how sessions and graphs work wrongly?

TheAkiraxD commented 5 years ago

@daviduarte A simple solution that I've never thought: removing the session from the loop. Genius. Now I can use real time detection. Obrigado :)

tudordumitriu commented 5 years ago

Hi guys I am as well new to running TF pretrained models and even ML, but I am experiencing something very weird. I running inference for one image it takes even 30 seconds so it must be my environment since everyone is talking about milliseconds. Even ifI have a system with GeForce GTX 1060, and the environment (Windows 10, cudnn 7.5.0) seems to be correctly setup (name: "/device:GPU:0" device_type: "GPU" memory_limit: 2213819187) it seems to be almost unusable.

Towards the end of the running session I see these log entries: Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.05GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. Adding visible gpu devices: 0 But the problem is that before it gets to these it takes the most of time I have attached the logs and I hope it helps TFLogs.txt

Krissy93 commented 5 years ago

Any idea to have the session having the graph already loaded stored as an instance variable in a class? I Implemented a class to wrap the whole inference using this approach:

class ObjectDetector(object):
    def __init__(self,model_name):
        self.model_name = model_name
        self.graph = tf.Graph()
        self.num_class = 1
        self.initialize_graph()
        self.initialize_labels()
        self.session = None

    def __del__(self):
        if self.session is not None:
            self.session.close()

    def run_inference_for_single_image(self,image):
        if self.session is None:
            with self.graph.as_default():
                self.session = tf.Session()
    # ......
         # use the session to predict
         # output_dict = self.session.run(.......

but after the session is used, the graph is not loaded. I had to move the rest of the content of the function run_inference_for_single_image under the with self.graph.as_default() to make it work, and as the consequence, i have to reload the graph everytime, suffering in terms of processing time.

Is there a way around this, or do i understand how sessions and graphs work wrongly?

I managed to do what you asked, thanks to the useful insight of @daviduarte. You have to load the graph as a class variable upon initialization, like so:

def __init__(self):
        self.label_map = label_map_util.load_labelmap(PATH_TO_LABELS)
        self.categories = label_map_util.convert_label_map_to_categories(self.label_map, max_num_classes=NUM_CLASSES,use_display_name=True)
        self.category_index = label_map_util.create_category_index(self.categories)
        # Load a (frozen) Tensorflow model into memory.
        print(color.BOLD + color.RED + 'INIT GRAPH' + color.END)
        with tf.gfile.GFile(PATH_TO_CKPT, 'rb') as fid:
            graph_def = tf.GraphDef()
            graph_def.ParseFromString(fid.read())
        with tf.Graph().as_default() as graph:
            tf.import_graph_def(graph_def, name='prefix')
        self.detection_graph = graph

I also defined a method to perform the detection, something like this:

def detect(self, sess, image):
        image_np_expanded = np.expand_dims(image, axis=0)
        # Extract image tensor
        image_tensor = self.detection_graph.get_tensor_by_name('prefix/image_tensor:0')
        # Extract detection boxes
        boxes = self.detection_graph.get_tensor_by_name('prefix/detection_boxes:0')
        # Extract detection scores
        scores = self.detection_graph.get_tensor_by_name('prefix/detection_scores:0')
        # Extract detection classes
        classes = self.detection_graph.get_tensor_by_name('prefix/detection_classes:0')
        # Extract number of detectionsd
        num_detections = self.detection_graph.get_tensor_by_name('prefix/num_detections:0')
        # Actual detection.
        start = time.time()
        (boxes, scores, classes, num_detections) = sess.run(
            [boxes, scores, classes, num_detections],
            feed_dict={image_tensor: image_np_expanded})
        end = time.time()

        # Visualization of the results of a detection.
        vis_util.visualize_boxes_and_labels_on_image_array(
            image,
            np.squeeze(boxes),
            np.squeeze(classes).astype(np.int32),
            np.squeeze(scores),
            self.category_index,
            use_normalized_coordinates=True,
            line_thickness=8)

        print(str(end-start))
        cv2.imshow('object detection', image)

I actually get the frames from a Kinect camera which is triggered inside the detect() method, but I think it's the same if you pass the image to it outside the method (being it a frame obtained from camera or a saved image it doesn't matter much at this point).

The real trick is performed in the main():

def main():
        updater = ObjectDetector()
        with tf.Session(graph=updater.detection_graph) as sess:
                while True:
                    updater.detect(sess)

You have to define the session outside every kind of loop, and perform the checks to quit the program inside the inner while loop.

Hope it helps!

shoma88 commented 5 years ago

Thank you so much! I will try it very soon!

Best regards,

Adriano!

Il giorno 11 lug 2019, alle ore 10:47, Cristina Nuzzi notifications@github.com<mailto:notifications@github.com> ha scritto:

Any idea to have the session having the graph already loaded stored as an instance variable in a class? I Implemented a class to wrap the whole inference using this approach:

class ObjectDetector(object): def init(self,model_name): self.model_name = model_name self.graph = tf.Graph() self.num_class = 1 self.initialize_graph() self.initialize_labels() self.session = None

def __del__(self):
    if self.session is not None:
        self.session.close()

def run_inference_for_single_image(self,image):
    if self.session is None:
        with self.graph.as_default():
            self.session = tf.Session()
# ......
     # use the session to predict
     # output_dict = self.session.run(.......

but after the session is used, the graph is not loaded. I had to move the rest of the content of the function run_inference_for_single_image under the with self.graph.as_default() to make it work, and as the consequence, i have to reload the graph everytime, suffering in terms of processing time.

Is there a way around this, or do i understand how sessions and graphs work wrongly?

I managed to do what you asked, thanks to the useful insight of @daviduartehttps://github.com/daviduarte. You have to load the graph as a class variable upon initialization, like so:

def init(self): self.label_map = label_map_util.load_labelmap(PATH_TO_LABELS) self.categories = label_map_util.convert_label_map_to_categories(self.label_map, max_num_classes=NUM_CLASSES,use_display_name=True) self.category_index = label_map_util.create_category_index(self.categories)

Load a (frozen) Tensorflow model into memory.

    print(color.BOLD + color.RED + 'INIT GRAPH' + color.END)
    with tf.gfile.GFile(PATH_TO_CKPT, 'rb') as fid:
        graph_def = tf.GraphDef()
        graph_def.ParseFromString(fid.read())
    with tf.Graph().as_default() as graph:
        tf.import_graph_def(graph_def, name='prefix')
    self.detection_graph = graph

I also defined a method to perform the detection, something like this:

def detect(self, sess, image): image_np_expanded = np.expand_dims(image, axis=0)

Extract image tensor

    image_tensor = self.detection_graph.get_tensor_by_name('prefix/image_tensor:0')
    # Extract detection boxes
    boxes = self.detection_graph.get_tensor_by_name('prefix/detection_boxes:0')
    # Extract detection scores
    scores = self.detection_graph.get_tensor_by_name('prefix/detection_scores:0')
    # Extract detection classes
    classes = self.detection_graph.get_tensor_by_name('prefix/detection_classes:0')
    # Extract number of detectionsd
    num_detections = self.detection_graph.get_tensor_by_name('prefix/num_detections:0')
    # Actual detection.
    start = time.time()
    (boxes, scores, classes, num_detections) = sess.run(
        [boxes, scores, classes, num_detections],
        feed_dict={image_tensor: image_np_expanded})
    end = time.time()

    # Visualization of the results of a detection.
    vis_util.visualize_boxes_and_labels_on_image_array(
        image,
        np.squeeze(boxes),
        np.squeeze(classes).astype(np.int32),
        np.squeeze(scores),
        self.category_index,
        use_normalized_coordinates=True,
        line_thickness=8)

    print(str(end-start))
    cv2.imshow('object detection', image)

I actually get the frames from a Kinect camera which is triggered inside the detect() method, but I think it's the same if you pass the image to it outside the method (being it a frame obtained from camera or a saved image it doesn't matter much at this point).

The real trick is performed in the main():

def main(): updater = ObjectDetector() with tf.Session(graph=updater.detection_graph) as sess: while True: updater.detect(sess)

You have to define the session outside every kind of loop, and perform the checks to quit the program inside the inner while loop.

Hope it helps!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/tensorflow/models/issues/4355?email_source=notifications&email_token=AJOAJHGUIZMT5ZTNV25LMF3P63XTLA5CNFSM4FBPLLXKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZV7ONY#issuecomment-510392119, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AJOAJHAXX7GQU7EBMMQ67S3P63XTLANCNFSM4FBPLLXA.

Adnan-annan commented 5 years ago

@daviduarte do you have any idea how to code this with fast inference in C++?

yezifeiafei commented 5 years ago

Is there a TF trick to initialize the system resources before running the first inferences in something like a model initialization step so that the first image inference performs like the subsequent inferences?

@svencowart Have you found a good solution to this question?

usamahjundia commented 5 years ago

Is there a TF trick to initialize the system resources before running the first inferences in something like a model initialization step so that the first image inference performs like the subsequent inferences?

@svencowart Have you found a good solution to this question?

an obvious hack would be to immediately perform forward pass with an array of zeros after loading the model, ugly, but it works..

qoo commented 5 years ago

The first run is slow, the rest are fine.

Inference time: 0 sec @ loop 28.139413833618164. Inference time: 1 sec @ loop 0.45125246047973633. Inference time: 2 sec @ loop 0.4427633285522461. Inference time: 3 sec @ loop 0.436007022857666. Inference time: 4 sec @ loop 0.413358211517334. Inference time: 5 sec @ loop 0.41444969177246094. Inference time: 6 sec @ loop 0.40434837341308594. Inference time: 7 sec @ loop 0.4182426929473877. Inference time: 8 sec @ loop 0.42828917503356934. Inference time: 9 sec @ loop 0.4369192123413086. Total Inference time: 31.987273454666138 sec @ 10 images. faster_rcnn_inception_resnet_v2_atrous_oid_v4_2018_12_12

kzhang28 commented 5 years ago

The process of converting PIL image object to an numpy array: image = np.array(image.getdata()).reshape((im_height, im_width, 3)).astype(np.uint8) may be one performance bottleneck of inference. See my answer in stackoverflow: https://stackoverflow.com/a/57716322/6437725

ashwini-git commented 5 years ago

What if the model is deployed using tensorflow serving on to k8s cluster , how can be the inference time be improved ?

I am using a SSD re-trained model , used tensorflow serving image to containerize it and then deployed it onto k8s cluster. CUDA 9 , K80 gpu environment is setup on the k8s node. Model is accessible on the IP endpoint, and it performs inference for single image in 2-4 seconds. On the client side , I am converting a image to numpy array and passing it to model using requests.post() and printing model response as json.

image=PIL.image.open('image.jpeg')
image_np=numpy.array(image)
payload={"instances": [image_np.tolist()]}
res=request.post("http://model IP_endpoint:predict", json=payload)
print(res.json())

Since there is no post processing done , what can be improved here ?? Any help is much appreciated. Thanks