Closed lucasjinreal closed 5 years ago
@jinfagang Hey, thanks for posting the issue. Originally I had been working with Tensorflow v1.8, but after your comment I tried v1.5 to v1.12 and all versions seem to work with the inference.py
script on my end.
It looks like you might have a cuDNN Tensorflow version mismatch. What version of CUDA and cuDNN do you have on your machine? If you have CUDA 9.0 and cuDNN 7.0, Tensorflow v1.5 and above should work. Basing these questions from here
@oandrienko Hi, I am just having cuda 9 and cudnn7.2 which is only depencies with tensorflow 1.12. This error seems only occurred when training the model. inference I haven't test yet. My environment just same as you but got error like this.
@jinfagang Thanks for the extra details. Do other networks with Convs ops work with your setup? I have tested with CUDA 9.0 and cuDNN 7.0 and everything seems to work fine. I tested the train and eval scripts with Tensorflow v1.8 to v1.12 and had no issues.
I will try to experiment with cuDNN 7.2 when I have a little time, but in the meantime, have you tried downgrading cuDNN to version 7.0? I believe the issue is related to this since the error message indicates that problems are occurring probably because cuDNN failed to initialize
.
@oandrienko Thanks you for your digging out, I think maybe exactly it is cudnn7.2 problem. But isn't it TensorFlow depends only on cudnn7.2?
@jinfagang The Tensorflow documentation indicates that the newer versions of Tensorflow should work with cuDNN 7.2 but it seems that cuDNN v7.2 for CUDA 9.0 has actually been removed from the cuDNN archive altogether and has been replaced with cuDNN v7.3. If you are building Tensorflow from source, I think it would be a good idea to upgrade to v7.3 then see if that solves your problem. For simplicity, I just install Tensorflow and Tensorflow GPU using pip which requires cuDNN v7.0. I think that would be the easiest option to solve your problem.
I will close this issue for now since it is not directly related to the project! Hope this helps.
You can actually download libcudnn7_7.2.1.38-1+cuda9.0_amd64.deb from here https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1604/x86_64/ Then do:
sudo dpkg -i libcudnn7_7.2.1.38-1+cuda9.0_amd64.deb
It fixed the problem for me.
@oandrienko maybe cudnn7.3 can't work.My env:
UnknownError (see above for traceback): Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[node InceptionV3/InceptionV3/Conv2d_1a_3x3/Conv2D (defined at /home/amax/anaconda3/lib/python3.5/site-packages/tensorflow/contrib/layers/python/layers/layers.py:1057) = Conv2D[T=DT_FLOAT, data_format="NCHW", dilations=[1, 1, 1, 1], padding="VALID", strides=[1, 1, 2, 2], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](InceptionV3/InceptionV3/Conv2d_1a_3x3/Conv2D-0-TransposeNHWCToNCHW-LayoutOptimizer, InceptionV3/Conv2d_1a_3x3/weights/read/_5)]]
[[{{node InceptionV3/InceptionV3/Mixed_7a/concat-0-1-TransposeNCHWToNHWC-LayoutOptimizer/_591}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_2707_...tOptimizer", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Hi, I think the code have some compatible issue on TensorFlow 1.10 or 1.11 or 1.12:
It would be better if code can upgrade to tensorflow1.11