TensorFlow 1.10 compatible issue

lucasjinreal commented 5 years ago

Hi, I think the code have some compatible issue on TensorFlow 1.10 or 1.11 or 1.12:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1334, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
     [[{{node Conv/Conv2D}} = Conv2D[T=DT_FLOAT, data_format="NCHW", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 1, 2, 2], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](Conv/Conv2D-0-TransposeNHWCToNCHW-LayoutOptimizer, Conv/weights/read)]]
     [[{{node predictions_1/_635}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_1450_predictions_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "inference.py", line 142, in <module>
    tf.app.run()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "inference.py", line 139, in main
    label_map, output_directory)
  File "inference.py", line 95, in run_inference_graph
    feed_dict={placeholder_tensor: image_raw})
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 929, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1152, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1328, in _do_run
    run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1348, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
     [[node Conv/Conv2D (defined at /usr/local/lib/python3.6/dist-packages/tensorflow/contrib/layers/python/layers/layers.py:1057)  = Conv2D[T=DT_FLOAT, data_format="NCHW", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 1, 2, 2], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](Conv/Conv2D-0-TransposeNHWCToNCHW-LayoutOptimizer, Conv/weights/read)]]
     [[{{node predictions_1/_635}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_1450_predictions_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

Caused by op 'Conv/Conv2D', defined at:
  File "inference.py", line 142, in <module>
    tf.app.run()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "inference.py", line 139, in main
    label_map, output_directory)
  File "inference.py", line 82, in run_inference_graph
    label_color_map=label_color_map)
  File "/media/jintian/netac/ai/home/fast-semantic-segmentation/libs/exporter.py", line 64, in deploy_segmentation_inference_graph
    outputs = _get_outputs_from_inputs(model, input_tensor)
  File "/media/jintian/netac/ai/home/fast-semantic-segmentation/libs/exporter.py", line 38, in _get_outputs_from_inputs
    outputs_dict = model.predict(preprocessed_inputs)
  File "/media/jintian/netac/ai/home/fast-semantic-segmentation/architectures/icnet_architecture.py", line 117, in predict
    full_res = self._third_feature_branch(preprocessed_inputs)
  File "/media/jintian/netac/ai/home/fast-semantic-segmentation/architectures/icnet_architecture.py", line 203, in _third_feature_branch
    stride=2, normalizer_fn=slim.batch_norm)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/contrib/framework/python/ops/arg_scope.py", line 182, in func_with_args
    return func(*args, **current_args)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/contrib/layers/python/layers/layers.py", line 1154, in convolution2d
    conv_dims=2)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/contrib/framework/python/ops/arg_scope.py", line 182, in func_with_args
    return func(*args, **current_args)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/contrib/layers/python/layers/layers.py", line 1057, in convolution
    outputs = layer.apply(inputs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/base_layer.py", line 817, in apply
    return self.__call__(inputs, *args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/layers/base.py", line 374, in __call__
    outputs = super(Layer, self).__call__(inputs, *args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/base_layer.py", line 757, in __call__
    outputs = self.call(inputs, *args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/layers/convolutional.py", line 194, in call
    outputs = self._convolution_op(inputs, self.kernel)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/nn_ops.py", line 868, in __call__
    return self.conv_op(inp, filter)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/nn_ops.py", line 520, in __call__
    return self.call(inp, filter)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/nn_ops.py", line 204, in __call__
    name=self.name)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/gen_nn_ops.py", line 957, in conv2d
    data_format=data_format, dilations=dilations, name=name)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 3274, in create_op
    op_def=op_def)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 1770, in __init__
    self._traceback = tf_stack.extract_stack()

UnknownError (see above for traceback): Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
     [[node Conv/Conv2D (defined at /usr/local/lib/python3.6/dist-packages/tensorflow/contrib/layers/python/layers/layers.py:1057)  = Conv2D[T=DT_FLOAT, data_format="NCHW", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 1, 2, 2], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](Conv/Conv2D-0-TransposeNHWCToNCHW-LayoutOptimizer, Conv/weights/read)]]
     [[{{node predictions_1/_635}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_1450_predictions_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

It would be better if code can upgrade to tensorflow1.11

oandrienko commented 5 years ago

@jinfagang Hey, thanks for posting the issue. Originally I had been working with Tensorflow v1.8, but after your comment I tried v1.5 to v1.12 and all versions seem to work with the inference.py script on my end.

It looks like you might have a cuDNN Tensorflow version mismatch. What version of CUDA and cuDNN do you have on your machine? If you have CUDA 9.0 and cuDNN 7.0, Tensorflow v1.5 and above should work. Basing these questions from here

lucasjinreal commented 5 years ago

@oandrienko Hi, I am just having cuda 9 and cudnn7.2 which is only depencies with tensorflow 1.12. This error seems only occurred when training the model. inference I haven't test yet. My environment just same as you but got error like this.

oandrienko commented 5 years ago

@jinfagang Thanks for the extra details. Do other networks with Convs ops work with your setup? I have tested with CUDA 9.0 and cuDNN 7.0 and everything seems to work fine. I tested the train and eval scripts with Tensorflow v1.8 to v1.12 and had no issues.

I will try to experiment with cuDNN 7.2 when I have a little time, but in the meantime, have you tried downgrading cuDNN to version 7.0? I believe the issue is related to this since the error message indicates that problems are occurring probably because cuDNN failed to initialize.

lucasjinreal commented 5 years ago

@oandrienko Thanks you for your digging out, I think maybe exactly it is cudnn7.2 problem. But isn't it TensorFlow depends only on cudnn7.2?

oandrienko commented 5 years ago

@jinfagang The Tensorflow documentation indicates that the newer versions of Tensorflow should work with cuDNN 7.2 but it seems that cuDNN v7.2 for CUDA 9.0 has actually been removed from the cuDNN archive altogether and has been replaced with cuDNN v7.3. If you are building Tensorflow from source, I think it would be a good idea to upgrade to v7.3 then see if that solves your problem. For simplicity, I just install Tensorflow and Tensorflow GPU using pip which requires cuDNN v7.0. I think that would be the easiest option to solve your problem.

I will close this issue for now since it is not directly related to the project! Hope this helps.

AlekseyMalyshev commented 5 years ago

You can actually download libcudnn7_7.2.1.38-1+cuda9.0_amd64.deb from here https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1604/x86_64/ Then do:

sudo dpkg -i libcudnn7_7.2.1.38-1+cuda9.0_amd64.deb

It fixed the problem for me.

bleedingfight commented 5 years ago

@oandrienko maybe cudnn7.3 can't work.My env:

ubuntu16.0.4
gtx1080x2
E5
cuda10.0
cudnn7.3

tensorflow1.12 I have same problem.


UnknownError (see above for traceback): Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
 [[node InceptionV3/InceptionV3/Conv2d_1a_3x3/Conv2D (defined at /home/amax/anaconda3/lib/python3.5/site-packages/tensorflow/contrib/layers/python/layers/layers.py:1057)  = Conv2D[T=DT_FLOAT, data_format="NCHW", dilations=[1, 1, 1, 1], padding="VALID", strides=[1, 1, 2, 2], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](InceptionV3/InceptionV3/Conv2d_1a_3x3/Conv2D-0-TransposeNHWCToNCHW-LayoutOptimizer, InceptionV3/Conv2d_1a_3x3/weights/read/_5)]]
 [[{{node InceptionV3/InceptionV3/Mixed_7a/concat-0-1-TransposeNCHWToNHWC-LayoutOptimizer/_591}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_2707_...tOptimizer", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

oandrienko / fast-semantic-segmentation

TensorFlow 1.10 compatible issue #8