googlecolab / colabtools

Python libraries for Google Colaboratory
Apache License 2.0
2.2k stars 724 forks source link

Cudnn failed to initialize. #2427

Closed waghts95 closed 2 years ago

waghts95 commented 3 years ago

2021-11-15 06:23:02.895695: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:39] Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0. INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',) I1115 06:23:02.901168 140056836609920 mirrored_strategy.py:369] Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',) INFO:tensorflow:Maybe overwriting train_steps: 5000 I1115 06:23:02.907187 140056836609920 config_util.py:552] Maybe overwriting train_steps: 5000 INFO:tensorflow:Maybe overwriting use_bfloat16: False I1115 06:23:02.907364 140056836609920 config_util.py:552] Maybe overwriting use_bfloat16: False I1115 06:23:02.917712 140056836609920 ssd_efficientnet_bifpn_feature_extractor.py:143] EfficientDet EfficientNet backbone version: efficientnet-b0 I1115 06:23:02.917843 140056836609920 ssd_efficientnet_bifpn_feature_extractor.py:144] EfficientDet BiFPN num filters: 64 I1115 06:23:02.917987 140056836609920 ssd_efficientnet_bifpn_feature_extractor.py:146] EfficientDet BiFPN num iterations: 3 I1115 06:23:03.023633 140056836609920 efficientnet_model.py:147] round_filter input=32 output=32 INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',). I1115 06:23:03.046473 140056836609920 cross_device_ops.py:621] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',). INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',). I1115 06:23:03.048448 140056836609920 cross_device_ops.py:621] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',). INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',). I1115 06:23:03.050931 140056836609920 cross_device_ops.py:621] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',). INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',). I1115 06:23:03.051964 140056836609920 cross_device_ops.py:621] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',). INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',). I1115 06:23:03.060000 140056836609920 cross_device_ops.py:621] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',). INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',). I1115 06:23:03.064206 140056836609920 cross_device_ops.py:621] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',). I1115 06:23:03.070451 140056836609920 efficientnet_model.py:147] round_filter input=32 output=32 I1115 06:23:03.070622 140056836609920 efficientnet_model.py:147] round_filter input=16 output=16 INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',). I1115 06:23:03.085935 140056836609920 cross_device_ops.py:621] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',). INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',). I1115 06:23:03.087064 140056836609920 cross_device_ops.py:621] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',). INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',). I1115 06:23:03.089118 140056836609920 cross_device_ops.py:621] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',). INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',). I1115 06:23:03.090175 140056836609920 cross_device_ops.py:621] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',). I1115 06:23:03.189514 140056836609920 efficientnet_model.py:147] round_filter input=16 output=16 I1115 06:23:03.189718 140056836609920 efficientnet_model.py:147] round_filter input=24 output=24 I1115 06:23:03.503825 140056836609920 efficientnet_model.py:147] round_filter input=24 output=24 I1115 06:23:03.504052 140056836609920 efficientnet_model.py:147] round_filter input=40 output=40 I1115 06:23:03.829157 140056836609920 efficientnet_model.py:147] round_filter input=40 output=40 I1115 06:23:03.829411 140056836609920 efficientnet_model.py:147] round_filter input=80 output=80 I1115 06:23:04.307266 140056836609920 efficientnet_model.py:147] round_filter input=80 output=80 I1115 06:23:04.307478 140056836609920 efficientnet_model.py:147] round_filter input=112 output=112 I1115 06:23:04.789242 140056836609920 efficientnet_model.py:147] round_filter input=112 output=112 I1115 06:23:04.789535 140056836609920 efficientnet_model.py:147] round_filter input=192 output=192 I1115 06:23:05.393709 140056836609920 efficientnet_model.py:147] round_filter input=192 output=192 I1115 06:23:05.393937 140056836609920 efficientnet_model.py:147] round_filter input=320 output=320 I1115 06:23:05.552831 140056836609920 efficientnet_model.py:147] round_filter input=1280 output=1280 I1115 06:23:05.618357 140056836609920 efficientnet_model.py:458] Building model efficientnet with params ModelConfig(width_coefficient=1.0, depth_coefficient=1.0, resolution=224, dropout_rate=0.2, blocks=(BlockConfig(input_filters=32, output_filters=16, kernel_size=3, num_repeat=1, expand_ratio=1, strides=(1, 1), se_ratio=0.25, id_skip=True, fused_conv=False, conv_type='depthwise'), BlockConfig(input_filters=16, output_filters=24, kernel_size=3, num_repeat=2, expand_ratio=6, strides=(2, 2), se_ratio=0.25, id_skip=True, fused_conv=False, conv_type='depthwise'), BlockConfig(input_filters=24, output_filters=40, kernel_size=5, num_repeat=2, expand_ratio=6, strides=(2, 2), se_ratio=0.25, id_skip=True, fused_conv=False, conv_type='depthwise'), BlockConfig(input_filters=40, output_filters=80, kernel_size=3, num_repeat=3, expand_ratio=6, strides=(2, 2), se_ratio=0.25, id_skip=True, fused_conv=False, conv_type='depthwise'), BlockConfig(input_filters=80, output_filters=112, kernel_size=5, num_repeat=3, expand_ratio=6, strides=(1, 1), se_ratio=0.25, id_skip=True, fused_conv=False, conv_type='depthwise'), BlockConfig(input_filters=112, output_filters=192, kernel_size=5, num_repeat=4, expand_ratio=6, strides=(2, 2), se_ratio=0.25, id_skip=True, fused_conv=False, conv_type='depthwise'), BlockConfig(input_filters=192, output_filters=320, kernel_size=3, num_repeat=1, expand_ratio=6, strides=(1, 1), se_ratio=0.25, id_skip=True, fused_conv=False, conv_type='depthwise')), stem_base_filters=32, top_base_filters=1280, activation='simple_swish', batch_norm='default', bn_momentum=0.99, bn_epsilon=0.001, weight_decay=5e-06, drop_connect_rate=0.2, depth_divisor=8, min_depth=None, use_se=True, input_channels=3, num_classes=1000, model_name='efficientnet', rescale_input=False, data_format='channels_last', dtype='float32') WARNING:tensorflow:From /usr/local/lib/python3.7/dist-packages/object_detection/model_lib_v2.py:558: StrategyBase.experimental_distribute_datasets_from_function (from tensorflow.python.distribute.distribute_lib) is deprecated and will be removed in a future version. Instructions for updating: rename to distribute_datasets_from_function W1115 06:23:05.674777 140056836609920 deprecation.py:345] From /usr/local/lib/python3.7/dist-packages/object_detection/model_lib_v2.py:558: StrategyBase.experimental_distribute_datasets_from_function (from tensorflow.python.distribute.distribute_lib) is deprecated and will be removed in a future version. Instructions for updating: rename to distribute_datasets_from_function INFO:tensorflow:Reading unweighted datasets: ['/content/train/train.record'] I1115 06:23:05.685961 140056836609920 dataset_builder.py:163] Reading unweighted datasets: ['/content/train/train.record'] INFO:tensorflow:Reading record datasets for input file: ['/content/train/train.record'] I1115 06:23:05.686263 140056836609920 dataset_builder.py:80] Reading record datasets for input file: ['/content/train/train.record'] INFO:tensorflow:Number of filenames to read: 1 I1115 06:23:05.686477 140056836609920 dataset_builder.py:81] Number of filenames to read: 1 WARNING:tensorflow:num_readers has been reduced to 1 to match input file shards. W1115 06:23:05.686713 140056836609920 dataset_builder.py:88] num_readers has been reduced to 1 to match input file shards. WARNING:tensorflow:From /usr/local/lib/python3.7/dist-packages/object_detection/builders/dataset_builder.py:105: parallel_interleave (from tensorflow.python.data.experimental.ops.interleave_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.data.Dataset.interleave(map_func, cycle_length, block_length, num_parallel_calls=tf.data.AUTOTUNE) instead. If sloppy execution is desired, use tf.data.Options.experimental_deterministic. W1115 06:23:05.689157 140056836609920 deprecation.py:345] From /usr/local/lib/python3.7/dist-packages/object_detection/builders/dataset_builder.py:105: parallel_interleave (from tensorflow.python.data.experimental.ops.interleave_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.data.Dataset.interleave(map_func, cycle_length, block_length, num_parallel_calls=tf.data.AUTOTUNE) instead. If sloppy execution is desired, use tf.data.Options.experimental_deterministic. WARNING:tensorflow:From /usr/local/lib/python3.7/dist-packages/object_detection/builders/dataset_builder.py:237: DatasetV1.map_with_legacy_function (from tensorflow.python.data.ops.dataset_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.data.Dataset.map() W1115 06:23:05.712973 140056836609920 deprecation.py:345] From /usr/local/lib/python3.7/dist-packages/object_detection/builders/dataset_builder.py:237: DatasetV1.map_with_legacy_function (from tensorflow.python.data.ops.dataset_ops) is deprecated and will be removed in a future version. Instructions for updating: Usetf.data.Dataset.map() WARNING:tensorflow:From /usr/local/lib/python3.7/dist-packages/tensorflow/python/util/dispatch.py:206: sparse_to_dense (from tensorflow.python.ops.sparse_ops) is deprecated and will be removed in a future version. Instructions for updating: Create a tf.sparse.SparseTensor and use tf.sparse.to_dense instead. W1115 06:23:14.798910 140056836609920 deprecation.py:345] From /usr/local/lib/python3.7/dist-packages/tensorflow/python/util/dispatch.py:206: sparse_to_dense (from tensorflow.python.ops.sparse_ops) is deprecated and will be removed in a future version. Instructions for updating: Create a tf.sparse.SparseTensor and use tf.sparse.to_dense instead. WARNING:tensorflow:From /usr/local/lib/python3.7/dist-packages/tensorflow/python/autograph/impl/api.py:464: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.cast instead. W1115 06:23:19.895702 140056836609920 deprecation.py:345] From /usr/local/lib/python3.7/dist-packages/tensorflow/python/autograph/impl/api.py:464: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.cast instead. /usr/local/lib/python3.7/dist-packages/keras/backend.py:401: UserWarning: tf.keras.backend.set_learning_phase is deprecated and will be removed after 2020-10-11. To update it, simply pass a True/False value to the training argument of the __call__ method of your layer or model. warnings.warn('tf.keras.backend.set_learning_phase is deprecated and ' 2021-11-15 06:23:58.659650: E tensorflow/stream_executor/cuda/cuda_dnn.cc:362] Loaded runtime CuDNN library: 8.0.5 but source was compiled with: 8.1.0. CuDNN library needs to have matching major version and equal or higher minor version. If using a binary install, upgrade your CuDNN library. If building from sources, make sure the library loaded at runtime is compatible with the version specified during compile configuration. 2021-11-15 06:23:58.662947: E tensorflow/stream_executor/cuda/cuda_dnn.cc:362] Loaded runtime CuDNN library: 8.0.5 but source was compiled with: 8.1.0. CuDNN library needs to have matching major version and equal or higher minor version. If using a binary install, upgrade your CuDNN library. If building from sources, make sure the library loaded at runtime is compatible with the version specified during compile configuration. Traceback (most recent call last): File "/content/models/research/object_detection/model_main_tf2.py", line 115, in tf.compat.v1.app.run() File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/platform/app.py", line 40, in run _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef) File "/usr/local/lib/python3.7/dist-packages/absl/app.py", line 303, in run _run_main(main, args) File "/usr/local/lib/python3.7/dist-packages/absl/app.py", line 251, in _run_main sys.exit(main(argv)) File "/content/models/research/object_detection/model_main_tf2.py", line 112, in main record_summaries=FLAGS.record_summaries) File "/usr/local/lib/python3.7/dist-packages/object_detection/model_lib_v2.py", line 603, in train_loop train_input, unpad_groundtruth_tensors) File "/usr/local/lib/python3.7/dist-packages/object_detection/model_lib_v2.py", line 394, in load_fine_tune_checkpoint _ensure_model_is_built(model, input_dataset, unpad_groundtruth_tensors) File "/usr/local/lib/python3.7/dist-packages/object_detection/model_lib_v2.py", line 176, in _ensure_model_is_built labels, File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/distribute/distribute_lib.py", line 1286, in run return self._extended.call_for_each_replica(fn, args=args, kwargs=kwargs) File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/distribute/distribute_lib.py", line 2849, in call_for_each_replica return self._call_for_each_replica(fn, args, kwargs) File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/distribute/mirrored_strategy.py", line 671, in _call_for_each_replica self._container_strategy(), fn, args, kwargs) File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/distribute/mirrored_run.py", line 86, in call_for_each_replica return wrapped(args, kwargs) File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/def_function.py", line 885, in call result = self._call(*args, *kwds) File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/def_function.py", line 950, in _call return self._stateless_fn(args, **kwds) File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/function.py", line 3040, in call filtered_flat_args, captured_inputs=graph_function.captured_inputs) # pylint: disable=protected-access File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/function.py", line 1964, in _call_flat ctx, args, cancellation_manager=cancellation_manager)) File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/function.py", line 596, in call ctx=ctx) File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute inputs, attrs, num_outputs) tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above. [[node EfficientDet-D0/model/stem_conv2d/Conv2D (defined at usr/local/lib/python3.7/dist-packages/object_detection/models/ssd_efficientnet_bifpn_feature_extractor.py:225) ]] [Op:inferencedummy_computation_fn_27786]

Errors may have originated from an input operation. Input Source operations connected to node EfficientDet-D0/model/stem_conv2d/Conv2D: args_1 (defined at usr/local/lib/python3.7/dist-packages/object_detection/model_lib_v2.py:176)

Function call stack: _dummy_computation_fn

tushar-wagh-bioenable commented 3 years ago

@tushar-wagh-bioenable

TeoNikolov commented 3 years ago

I had this issue myself as well. From what I see, the CUDA 11.2 (and CUDNN 8.1.0) version is no longer installed on the Colab instances, for some odd reason. The workaround is to manually install CUDA 11.2 and CUDNN 8.1.0 packages yourself, which unfortunately means that it'll take even longer before you can train anything.

Tomorrow I'll share the commands I used to set it all up on Colab.


EDIT: The promised commands.

Setup

Mount your google drive (it should contain the 8.1.0 CUDNN x64 Linux Library for CUDA 11.2 from the NVIDIA website). from google.colab import drive drive.mount('/content/drive')

Change to directory containing CUDA/CUDNN installation files. %cd "/content/drive/MyDrive/Master Thesis Dev/"

I use this for cleanup. Probably not needed, but it helped me when the CUDA installation failed once. !sudo dpkg --configure -a !sudo apt-get clean

Install CUDA 11.2

Download and install CUDA 11.2. The command should download a .deb file to your drive. If you see it, you can comment out the line to save time from downloading it every time (and filling up drive).

!wget https://developer.download.nvidia.com/compute/cuda/11.2.0/local_installers/cuda-repo-ubuntu1804-11-2-local_11.2.0-460.27.04-1_amd64.deb

Install the CUDA library

!sudo dpkg -i cuda-repo-ubuntu1804-11-2-local_11.2.0-460.27.04-1_amd64.deb !sudo apt-key add /var/cuda-repo-ubuntu1804-11-2-local/7fa2af80.pub !sudo apt-get update !sudo apt-get -y install cuda-11.2 --fix-broken

De-reference the CUDA library used by Colab, and reference the one you just installed.

!rm -rf /usr/local/cuda !ln -s /usr/local/cuda-11.2/ /usr/local/cuda

Install CUDNN 8.1.0

This was the trickiest part for me. You need to know where CUDNN is installed on Colab so that you can use your version instead of the one that Colab comes with. There are probably better ways to do this, but I am not used to Linux so I just used something that works.

First, check to see where CUDNN is installed. This will affect the directories you install the library in the next commands. !dpkg -L libcudnn8-dev

Install CUDNN !tar -xzvf cudnn-11.2-linux-*.tgz !sudo cp cuda/include/cudnn*.h /usr/include/x86_64-linux-gnu !sudo cp -P cuda/lib64/libcudnn* /usr/lib/x86_64-linux-gnu/ !sudo chmod a+r /usr/include/x86_64-linux-gnu/cudnn*.h /usr/lib/x86_64-linux-gnu/libcudnn*

And that should be it pretty much. You can check the current CUDA version with !/usr/local/cuda/bin/nvcc --version and CUDNN with !cat /usr/include/x86_64-linux-gnu/cudnn_version_v8.h | grep CUDNN_MAJOR -A 2 although the latter is simply looking into the file we just installed, so nothing too special. Best to run and see if it crashes (it shouldn't).

tushar-wagh-bioenable commented 3 years ago

Thank you

On Wed, Nov 17, 2021 at 8:43 AM Teodor Nikolov @.***> wrote:

I had this issue myself as well. From what I see, the CUDA 11.2 version is no longer installed on the Colab instances (for who knows what reason?). The workaround is to manually install CUDA 11.2 and CUDNN 8.1.0 packages yourself, which unfortunately means that it'll take even longer before you can train anything.

Tomorrow I'll share the commands I used to set it all up on Colab.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/googlecolab/colabtools/issues/2427#issuecomment-971135434, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQK3YPSPR2TT6P4WWYMO4PLUMMMVLANCNFSM5IA4S7IA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

-- Best Regards, Tushar Wagh Data Science Engineer BioEnable Technologies Pvt. Ltd

+91 9890132816

nguaki commented 2 years ago

@tushar I also have this problem? Was this problem terminated with the above instructions? Why Google Colab allow such a pain? Didn't they tested this obvious bug before rolling out?

waghts95 commented 2 years ago

They have issues with other cloud products too.

On Fri, Dec 10, 2021, 13:58 james che @.***> wrote:

@tushar https://github.com/tushar I also have this problem? Was this problem terminated with the above instructions? Why Google Colab allow such a pain? Didn't they tested this obvious bug before rolling out?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/googlecolab/colabtools/issues/2427#issuecomment-990731819, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE3ISNIAXJVKCVZUFVY7QYTUQG23PANCNFSM5IA4S7IA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

nguaki commented 2 years ago

i was counting on Google Collab for Deep Learning project. But it looks like all the hooplas and hypes are very misleading. Need to resort to getting an expensive Nvidia GPU chip and do the training on my local computer would be more cost effective considering all these unstable production issues.