googlecolab / colabtools

Python libraries for Google Colaboratory
Apache License 2.0
2.2k stars 721 forks source link

cuDNN failed to initialize #3086

Open KJGithub2021 opened 2 years ago

KJGithub2021 commented 2 years ago

I am trying to execute my deep learning code, having the following link to the gist file: https://colab.research.google.com/gist/KJGithub2021/705a27f8d42bf1f26b07b8e10eedacb0/imn.ipynb

-Python 3.6.0 -Tensorflow 2.6

The code executes without any error on my local machine with CPU only device, although it takes hours in training. However, When I select GPU as the runtime in Collaboratory and run it, it gives the following errors:

=========== 2022-09-21 12:42:37.434840: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 164634400 exceeds 10% of free system memory. 2022-09-21 12:42:37.507974: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 164634400 exceeds 10% of free system memory. 2022-09-21 12:42:37.789797: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 164634400 exceeds 10% of free system memory. 2022-09-21 12:42:37.868800: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 164634400 exceeds 10% of free system memory. 2022-09-21 12:42:40.532872: E tensorflow/stream_executor/cuda/cuda_dnn.cc:362] Loaded runtime CuDNN library: 8.0.5 but source was compiled with: 8.1.0. CuDNN library needs to have matching major version and equal or higher minor version. If using a binary install, upgrade your CuDNN library. If building from sources, make sure the library loaded at runtime is compatible with the version specified during compile configuration. 2022-09-21 12:42:40.538008: E tensorflow/stream_executor/cuda/cuda_dnn.cc:362] Loaded runtime CuDNN library: 8.0.5 but source was compiled with: 8.1.0. CuDNN library needs to have matching major version and equal or higher minor version. If using a binary install, upgrade your CuDNN library. If building from sources, make sure the library loaded at runtime is compatible with the version specified during compile configuration. 2022-09-21 12:42:40.541819: E tensorflow/stream_executor/cuda/cuda_dnn.cc:362] Loaded runtime CuDNN library: 8.0.5 but source was compiled with: 8.1.0. CuDNN library needs to have matching major version and equal or higher minor version. If using a binary install, upgrade your CuDNN library. If building from sources, make sure the library loaded at runtime is compatible with the version specified during compile configuration. 2022-09-21 12:42:40.545662: E tensorflow/stream_executor/cuda/cuda_dnn.cc:362] Loaded runtime CuDNN library: 8.0.5 but source was compiled with: 8.1.0. CuDNN library needs to have matching major version and equal or higher minor version. If using a binary install, upgrade your CuDNN library. If building from sources, make sure the library loaded at runtime is compatible with the version specified during compile configuration. 2022-09-21 12:42:40.549502: E tensorflow/stream_executor/cuda/cuda_dnn.cc:362] Loaded runtime CuDNN library: 8.0.5 but source was compiled with: 8.1.0. CuDNN library needs to have matching major version and equal or higher minor version. If using a binary install, upgrade your CuDNN library. If building from sources, make sure the library loaded at runtime is compatible with the version specified during compile configuration. 2022-09-21 12:42:40.552936: E tensorflow/stream_executor/cuda/cuda_dnn.cc:362] Loaded runtime CuDNN library: 8.0.5 but source was compiled with: 8.1.0. CuDNN library needs to have matching major version and equal or higher minor version. If using a binary install, upgrade your CuDNN library. If building from sources, make sure the library loaded at runtime is compatible with the version specified during compile configuration. 2022-09-21 12:42:40.556449: E tensorflow/stream_executor/cuda/cuda_dnn.cc:362] Loaded runtime CuDNN library: 8.0.5 but source was compiled with: 8.1.0. CuDNN library needs to have matching major version and equal or higher minor version. If using a binary install, upgrade your CuDNN library. If building from sources, make sure the library loaded at runtime is compatible with the version specified during compile configuration. 2022-09-21 12:42:40.559686: E tensorflow/stream_executor/cuda/cuda_dnn.cc:362] Loaded runtime CuDNN library: 8.0.5 but source was compiled with: 8.1.0. CuDNN library needs to have matching major version and equal or higher minor version. If using a binary install, upgrade your CuDNN library. If building from sources, make sure the library loaded at runtime is compatible with the version specified during compile configuration. 2022-09-21 12:42:40.563026: E tensorflow/stream_executor/cuda/cuda_dnn.cc:362] Loaded runtime CuDNN library: 8.0.5 but source was compiled with: 8.1.0. CuDNN library needs to have matching major version and equal or higher minor version. If using a binary install, upgrade your CuDNN library. If building from sources, make sure the library loaded at runtime is compatible with the version specified during compile configuration. 2022-09-21 12:42:40.566285: E tensorflow/stream_executor/cuda/cuda_dnn.cc:362] Loaded runtime CuDNN library: 8.0.5 but source was compiled with: 8.1.0. CuDNN library needs to have matching major version and equal or higher minor version. If using a binary install, upgrade your CuDNN library. If building from sources, make sure the library loaded at runtime is compatible with the version specified during compile configuration. 2022-09-21 12:42:40.569632: E tensorflow/stream_executor/cuda/cuda_dnn.cc:362] Loaded runtime CuDNN library: 8.0.5 but source was compiled with: 8.1.0. CuDNN library needs to have matching major version and equal or higher minor version. If using a binary install, upgrade your CuDNN library. If building from sources, make sure the library loaded at runtime is compatible with the version specified during compile configuration. 2022-09-21 12:42:40.572676: E tensorflow/stream_executor/cuda/cuda_dnn.cc:362] Loaded runtime CuDNN library: 8.0.5 but source was compiled with: 8.1.0. CuDNN library needs to have matching major version and equal or higher minor version. If using a binary install, upgrade your CuDNN library. If building from sources, make sure the library loaded at runtime is compatible with the version specified during compile configuration. Traceback (most recent call last): File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1375, in _do_call return fn(*args) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1360, in _run_fn target_list, run_metadata) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1453, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found. (0) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above. [[{{node CNN_char_emb/conv1d}}]] [[prediction_layer/add_2/_1705]] (1) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above. [[{{node CNN_char_emb/conv1d}}]] 0 successful operations. 0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "train.py", line 262, in train_step(x_utterances, x_response, x_utterances_len, x_response_len, x_utters_num, x_target, x_target_weight, id_pairs, x_u_char, x_u_char_len, x_r_char, x_r_char_len) File "train.py", line 204, in train_step feed_dict) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 968, in run run_metadata_ptr) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1191, in _run feed_dict_tensor, options, run_metadata) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1369, in _do_run run_metadata) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1394, in _do_call raise type(e)(node_def, op, message) # pylint: disable=no-value-for-parameter tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found. (0) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above. [[node CNN_char_emb/conv1d (defined at /content/drive/MyDrive/IMN-master/Ubuntu_V2/model/model_IMN.py:98) ]] [[prediction_layer/add_2/_1705]] (1) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above. [[node CNN_char_emb/conv1d (defined at /content/drive/MyDrive/IMN-master/Ubuntu_V2/model/model_IMN.py:98) ]] 0 successful operations. 0 derived errors ignored. Original stack trace for 'CNN_char_emb/conv1d': File "train.py", line 128, in l2_reg_lambda=FLAGS.l2_reg_lambda) File "/content/drive/MyDrive/IMN-master/Ubuntu_V2/model/model_IMN.py", line 200, in init utterances_cnn_char_emb = cnn_layer(utterances_char_embedded, filter_sizes=[3, 4, 5], num_filters=50, scope="CNN_char_emb", scope_reuse=False) # [batch_sizemax_utter_nummax_utter_len, emb] File "/content/drive/MyDrive/IMN-master/Ubuntu_V2/model/model_IMN.py", line 98, in cnn_layer conv = tf.nn.conv1d(inputs, w, stride=1, padding="VALID") # [num_words, num_chars - filter_size, num_filters] File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/util/dispatch.py", line 206, in wrapper return target(*args, kwargs) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/nn_ops.py", line 2094, in conv1d_v2 dilations=dilations) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/util/dispatch.py", line 206, in wrapper return target(*args, *kwargs) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/util/deprecation.py", line 617, in new_func return func(args, kwargs) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/util/deprecation.py", line 617, in new_func return func(*args, **kwargs) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/nn_ops.py", line 2011, in conv1d name=name) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/gen_nn_ops.py", line 973, in conv2d data_format=data_format, dilations=dilations, name=name) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/op_def_library.py", line 750, in _apply_op_helper attrs=attr_protos, op_def=op_def) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 3569, in _create_op_internal op_def=op_def) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 2045, in init self._traceback = tf_stack.extract_stack_for_node(self._c_op)

============ I'm currently using chrome to run my colab files (ipynb).

I am looking forward to an early response from Colab team!

KJGithub2021 commented 2 years ago

Just for information, I have already tried the following but all in vain:

  1. reset all runtimes and reconnect.
  2. Installed tensorflow-gpu
  3. Added following code right after import statements in python: physical_devices = tf.config.experimental.list_physical_devices('GPU') tf.config.experimental.set_memory_growth(physical_devices[0], True)

config = tf.compat.v1.ConfigProto() config.gpu_options.allow_growth = True sess = tf.compat.v1.Session(config=config)

mayankmalik-colab commented 2 years ago

Unfortunately, changing Python in Colab is not supported. Can you try using the existing Python (3.7) and tensorflow (2.8.2)?

KJGithub2021 commented 2 years ago

ok let me try and get back.

KJGithub2021 commented 2 years ago

Thanks for the help. It worked and no errors so far. However the model training time on Collab GPU runtime is even more than my local system (CPU enabled) ....why is it so?

KJGithub2021 commented 2 years ago

so after 10 mins of execution, apparently the GPU device went out of RAM and the code terminated....with no error message