lindawangg / COVID-Net

COVID-Net Open Source Initiative
Other
1.15k stars 480 forks source link

cuDNN failed to initialize with tensorflow:15 #67

Closed Edwardsleo closed 3 years ago

Edwardsleo commented 4 years ago

Hi , TensorFlow version 13 and 15 is mentioned as a requirement for this code .

With TensorFlow version 13, the train_tf is working fine. With TensorFlow version 15, i am getting the following issue Traceback (most recent call last): File "train_tf1.py", line 96, in args.in_tensorname, args.out_tensorname, args.input_size) File "/home/jovyan/COVID-Net/eval.py", line 24, in eval pred.append(np.array(sess.run(pred_tensor, feed_dict={image_tensor: np.expand_dims(x, axis=0)})).argmax(axis=1)) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 956, in run run_metadata_ptr) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1180, in _run feed_dict_tensor, options, run_metadata) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run run_metadata) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found. (0) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking tosee if a warning log message was printed above. [[node conv1_conv/convolution (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]] [[norm_dense_1/Softmax/_1229]] (1) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking tosee if a warning log message was printed above. [[node conv1_conv/convolution (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]] 0 successful operations. 0 derived errors ignored.

I have to check with GPU, but GPU support is available in above tf ver:15. I am trying this code with my GPU's. Eagerly waiting for comments :) Thanks in Advance, Edward

haydengunraj commented 4 years ago

This error can happen for a number of reasons, most commonly due to a version mismatch between CUDA, Tensorflow, and/or your GPU drivers. It can also happen if you run out of GPU memory, so I would check that first. If memory isn't the issue, then make sure your TF 1.15 installation is using the right CUDA and driver versions.