object_detection: Trained SSD-Inception-v2 Inference Errors on Jetson TX2

ryan-summers commented 6 years ago

System information

What is the top-level directory of the model you are using: research/ and then research/slim
Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes. Please see below for source code.
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 16.04 on Jetson TX2 using Linux4Tegra 28.1 (Linux jetson 4.4.38-tegra)
TensorFlow installed from (source or binary): Source (custom compile for Jetson)
TensorFlow version (use command below): 1.5.0rc1
Bazel version (if compiling from source): 0.10.0-
CUDA/cuDNN version: Cuda 8.0, CuDNN 6.0
GPU model and memory: Jetson TX2 7.67 GiB
Exact command to reproduce: python infer_image.py [image_path] [frozen_inference_graph.pb] [label_map.pbtxt]

Describe the problem

Sporadically, image inference will fail on the Jetson TX2 using a custom-trained SSD-inception-v2 network. The network functions normally on desktops employing identical versions of tensorflow, Cuda, and cuDNN that employ both GPU- and CPU-accelerated environments, but will often begin throwing the following error. The network has worked a number of times in the past, but will begin throwing CUDA errors sporadically.

trained_models/ssd_inception_dice/label_map.pbtxt 
2018-02-16 00:07:21.351899: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:881] could not open file to read NUMA node: /sys/bus/pci/devices/0000:00:00.0/numa_node
Your kernel may have been built without NUMA support.
2018-02-16 00:07:21.352047: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1105] Found device 0 with properties: 
name: NVIDIA Tegra X2 major: 6 minor: 2 memoryClockRate(GHz): 1.3005
pciBusID: 0000:00:00.0
totalMemory: 7.67GiB freeMemory: 4.84GiB
2018-02-16 00:07:21.352110: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1195] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.
0, compute capability: 6.2)
2018-02-16 00:07:22.098411: I tensorflow/core/common_runtime/gpu/gpu_device.cc:859] Could not identify NUMA node of /job:localhost/replica:0/task:0/device:GPU:0, defaulting to 0.  Your kern
el may not have been built with NUMA support.
2018-02-16 00:07:31.070486: E tensorflow/stream_executor/cuda/cuda_driver.cc:1080] failed to synchronize the stop event: CUDA_ERROR_LAUNCH_FAILED
2018-02-16 00:07:31.070643: E tensorflow/stream_executor/cuda/cuda_timer.cc:54] Internal: error destroying CUDA event in context 0x7e376d0: CUDA_ERROR_LAUNCH_FAILED
2018-02-16 00:07:31.070684: E tensorflow/stream_executor/cuda/cuda_timer.cc:59] Internal: error destroying CUDA event in context 0x7e376d0: CUDA_ERROR_LAUNCH_FAILED
2018-02-16 00:07:31.070858: E tensorflow/stream_executor/cuda/cuda_dnn.cc:2456] failed to enqueue convolution on stream: CUDNN_STATUS_EXECUTION_FAILED
Traceback (most recent call last):
  File "tools/infer_image.py", line 58, in <module>
    image_tensor: np.expand_dims(img_rgb, axis=0)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 895, in run
    run_metadata_ptr)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1128, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1344, in _do_run
    options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1363, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: cuDNN launch failure : input shape([1,64,75,75]) filter shape([3,3,64,192])
         [[Node: FeatureExtractor/InceptionV2/InceptionV2/Conv2d_2c_3x3/Conv2D = Conv2D[T=DT_FLOAT, data_format="NHWC", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 1, 1, 1], use_cud
nn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](FeatureExtractor/InceptionV2/InceptionV2/Conv2d_2b_1x1/Relu6, FeatureExtractor/InceptionV2/Conv2d_2c_3x3/weights)]]
         [[Node: Postprocessor/BatchMultiClassNonMaxSuppression/map/while/Exit_5/_69 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device
="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_1105_Postprocessor/BatchMultiClassNonMaxSuppression/map/while/Exit_5", tensor_type=DT_FLOAT, _d
evice="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

Caused by op u'FeatureExtractor/InceptionV2/InceptionV2/Conv2d_2c_3x3/Conv2D', defined at:
  File "tools/infer_image.py", line 40, in <module>
    tf.import_graph_def(graph_def, name='')
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/deprecation.py", line 316, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/importer.py", line 554, in import_graph_def
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 3160, in create_op
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1625, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

InternalError (see above for traceback): cuDNN launch failure : input shape([1,64,75,75]) filter shape([3,3,64,192])
         [[Node: FeatureExtractor/InceptionV2/InceptionV2/Conv2d_2c_3x3/Conv2D = Conv2D[T=DT_FLOAT, data_format="NHWC", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 1, 1, 1], use_cud
nn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](FeatureExtractor/InceptionV2/InceptionV2/Conv2d_2b_1x1/Relu6, FeatureExtractor/InceptionV2/Conv2d_2c_3x3/weights)]]
         [[Node: Postprocessor/BatchMultiClassNonMaxSuppression/map/while/Exit_5/_69 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device
="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_1105_Postprocessor/BatchMultiClassNonMaxSuppression/map/while/Exit_5", tensor_type=DT_FLOAT, _d
evice="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

2018-02-16 00:07:31.494460: E tensorflow/stream_executor/event.cc:33] error destroying CUDA event in context 0x7e376d0: CUDA_ERROR_LAUNCH_FAILED
2018-02-16 00:07:31.494583: E tensorflow/stream_executor/event.cc:33] error destroying CUDA event in context 0x7e376d0: CUDA_ERROR_LAUNCH_FAILED
2018-02-16 00:07:31.494620: E tensorflow/stream_executor/event.cc:33] error destroying CUDA event in context 0x7e376d0: CUDA_ERROR_LAUNCH_FAILED
2018-02-16 00:07:31.494671: E tensorflow/stream_executor/event.cc:33] error destroying CUDA event in context 0x7e376d0: CUDA_ERROR_LAUNCH_FAILED
2018-02-16 00:07:31.494720: E tensorflow/stream_executor/event.cc:33] error destroying CUDA event in context 0x7e376d0: CUDA_ERROR_LAUNCH_FAILED

Source code / logs

Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached. Try to provide a reproducible test case that is the bare minimum necessary to generate the problem.

Below is the infer_image.py script used to reproduce. I can also provide the pre-trained network and labels file if necessary.

#!/usr/bin/python
import argparse
import cv2
import cv_bridge
import numpy as np
import tensorflow as tf

from object_detection.utils import ops as utils_ops
from object_detection.utils import label_map_util
from object_detection.utils import visualization_utils as vis_util

if __name__ == '__main__':
    parser = argparse.ArgumentParser(description='Infer labels in an image.')

    parser.add_argument('img', type=str, help='The image to infer.')
    parser.add_argument('model', type=str, help='The frozen inference graph to use')
    parser.add_argument('labels', type=str, help='The label map to use.')

    args = parser.parse_args()

    # Load the label map into TF to convert indices into labels.
    label_map = label_map_util.load_labelmap(args.labels)
    categories = label_map_util.convert_label_map_to_categories(
            label_map, max_num_classes=90, use_display_name=True)
    category_index = label_map_util.create_category_index(categories)

    # Load the model architecture into TF.
    graph = tf.Graph()

    with graph.as_default():
        graph_def = tf.GraphDef()

        with tf.gfile.GFile(args.model, 'rb') as f:
            serialized_graph = f.read()

        graph_def.ParseFromString(serialized_graph)
        tf.import_graph_def(graph_def, name='')

    image_tensor = graph.get_tensor_by_name('image_tensor:0')
    d_boxes = graph.get_tensor_by_name('detection_boxes:0')
    d_scores = graph.get_tensor_by_name('detection_scores:0')
    d_classes = graph.get_tensor_by_name('detection_classes:0')
    num_d = graph.get_tensor_by_name('num_detections:0')

    # Read the image into memory.
    img = cv2.imread(args.img)
    img_rgb = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)

    # Perform an inference on the image.
    with graph.as_default():
        with tf.Session() as sess:
            (boxes, scores, classes, num) = sess.run(
                [d_boxes, d_scores, d_classes, num_d],
                feed_dict={
                    image_tensor: np.expand_dims(img_rgb, axis=0)
                })

    vis_util.visualize_boxes_and_labels_on_image_array(
            img,
            boxes[0],
            classes[0].astype(int),
            scores[0],
            category_index,
            use_normalized_coordinates=True,
            line_thickness=3)

    cv2.imshow('Detected Image', img)
    cv2.waitKey()

cy89 commented 6 years ago

@ryan-summers I'm not exactly sure how we can help, as it's not clear whether the issue is from running on the Jetson TX2 or from object_detection or from something else, and I'm not sure who could replicate your bug here.

I'll ask @jch1 and @tombstone if they're aware of anything relevant, but I think we need help from the wider TF community on this one.

oneTimePad commented 6 years ago

@ryan-summers I believe this is due to the Jetson running out of memory. I had a similar problem when testing SSD Inception-v2-resnet and all versions of Faster R-CNN. Try to use MobileNet with SSD. That definitely works.

ryan-summers commented 6 years ago

@oneTimePad I'm inclined to believe that you're correct. I'll try to run some more tests to see if MobileNet generates the same errors, but the initial work I have done indicates MobileNet will run successfully.

As a follow up, I continued digging into the problem and noticed that the error never occurs when running TensorFlow in CPU mode (e.g. setting CUDA_VISIBLE_DEVICES=-1).

I know that the Jetson TX2 is slightly weird in that the GPU and CPU memories are shared, but does anyone know of a way to specify that a specific amount of memory be dedicated to the tensorflow session? Monitoring the Jetson TX2 memory usage when SSD inception properly spins up doesn't seem to indicate that a majority of the memory is being used, but it also appears to only be a problem at initialization (once the inference graph is loaded and the first inference is performed, the graph will consistently perform inferences).

oneTimePad commented 6 years ago

@ryan-summers

Fraction of memory:

 gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.333)
 config=tf.ConfigProto(gpu_options=gpu_options)

Dynamic Growth:

   config = tf.ConfigProto()
   config.gpu_options.allow_growth = True

ryan-summers commented 6 years ago

Thanks @oneTimePad - I implemented a static GPU memory fraction at 0.5 and initial tests are showing that the error disappears. I am now heavily inclined to believe the error is related to GPU memory usage - it would also explain the sporadic nature of the error (as the Linux system uses more memory for other processes, it would limit CPU/GPU memory on the Jetson and cause errors). I'm going to continue with some regression testing today and will close this issue if the error doesn't return within the next few hours.

Thanks for all of the help!

ryan-summers commented 6 years ago

I'm going to be closing this issue, as it appears that the issue has been resolved by limiting the amount of GPU memory on the Jetson.

tensorflow / models