Closed waghts95 closed 2 years ago
@tushar-wagh-bioenable
I had this issue myself as well. From what I see, the CUDA 11.2 (and CUDNN 8.1.0) version is no longer installed on the Colab instances, for some odd reason. The workaround is to manually install CUDA 11.2 and CUDNN 8.1.0 packages yourself, which unfortunately means that it'll take even longer before you can train anything.
Tomorrow I'll share the commands I used to set it all up on Colab.
EDIT: The promised commands.
Mount your google drive (it should contain the 8.1.0 CUDNN x64 Linux Library for CUDA 11.2 from the NVIDIA website).
from google.colab import drive
drive.mount('/content/drive')
Change to directory containing CUDA/CUDNN installation files.
%cd "/content/drive/MyDrive/Master Thesis Dev/"
I use this for cleanup. Probably not needed, but it helped me when the CUDA installation failed once.
!sudo dpkg --configure -a
!sudo apt-get clean
Download and install CUDA 11.2. The command should download a .deb file to your drive. If you see it, you can comment out the line to save time from downloading it every time (and filling up drive).
!wget https://developer.download.nvidia.com/compute/cuda/11.2.0/local_installers/cuda-repo-ubuntu1804-11-2-local_11.2.0-460.27.04-1_amd64.deb
Install the CUDA library
!sudo dpkg -i cuda-repo-ubuntu1804-11-2-local_11.2.0-460.27.04-1_amd64.deb
!sudo apt-key add /var/cuda-repo-ubuntu1804-11-2-local/7fa2af80.pub
!sudo apt-get update
!sudo apt-get -y install cuda-11.2 --fix-broken
De-reference the CUDA library used by Colab, and reference the one you just installed.
!rm -rf /usr/local/cuda
!ln -s /usr/local/cuda-11.2/ /usr/local/cuda
This was the trickiest part for me. You need to know where CUDNN is installed on Colab so that you can use your version instead of the one that Colab comes with. There are probably better ways to do this, but I am not used to Linux so I just used something that works.
First, check to see where CUDNN is installed. This will affect the directories you install the library in the next commands.
!dpkg -L libcudnn8-dev
Install CUDNN
!tar -xzvf cudnn-11.2-linux-*.tgz
!sudo cp cuda/include/cudnn*.h /usr/include/x86_64-linux-gnu
!sudo cp -P cuda/lib64/libcudnn* /usr/lib/x86_64-linux-gnu/
!sudo chmod a+r /usr/include/x86_64-linux-gnu/cudnn*.h /usr/lib/x86_64-linux-gnu/libcudnn*
And that should be it pretty much. You can check the current CUDA version with
!/usr/local/cuda/bin/nvcc --version
and CUDNN with
!cat /usr/include/x86_64-linux-gnu/cudnn_version_v8.h | grep CUDNN_MAJOR -A 2
although the latter is simply looking into the file we just installed, so nothing too special. Best to run and see if it crashes (it shouldn't).
Thank you
On Wed, Nov 17, 2021 at 8:43 AM Teodor Nikolov @.***> wrote:
I had this issue myself as well. From what I see, the CUDA 11.2 version is no longer installed on the Colab instances (for who knows what reason?). The workaround is to manually install CUDA 11.2 and CUDNN 8.1.0 packages yourself, which unfortunately means that it'll take even longer before you can train anything.
Tomorrow I'll share the commands I used to set it all up on Colab.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/googlecolab/colabtools/issues/2427#issuecomment-971135434, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQK3YPSPR2TT6P4WWYMO4PLUMMMVLANCNFSM5IA4S7IA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
-- Best Regards, Tushar Wagh Data Science Engineer BioEnable Technologies Pvt. Ltd
+91 9890132816
@tushar I also have this problem? Was this problem terminated with the above instructions? Why Google Colab allow such a pain? Didn't they tested this obvious bug before rolling out?
They have issues with other cloud products too.
On Fri, Dec 10, 2021, 13:58 james che @.***> wrote:
@tushar https://github.com/tushar I also have this problem? Was this problem terminated with the above instructions? Why Google Colab allow such a pain? Didn't they tested this obvious bug before rolling out?
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/googlecolab/colabtools/issues/2427#issuecomment-990731819, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE3ISNIAXJVKCVZUFVY7QYTUQG23PANCNFSM5IA4S7IA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
i was counting on Google Collab for Deep Learning project. But it looks like all the hooplas and hypes are very misleading. Need to resort to getting an expensive Nvidia GPU chip and do the training on my local computer would be more cost effective considering all these unstable production issues.
2021-11-15 06:23:02.895695: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:39] Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0. INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',) I1115 06:23:02.901168 140056836609920 mirrored_strategy.py:369] Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',) INFO:tensorflow:Maybe overwriting train_steps: 5000 I1115 06:23:02.907187 140056836609920 config_util.py:552] Maybe overwriting train_steps: 5000 INFO:tensorflow:Maybe overwriting use_bfloat16: False I1115 06:23:02.907364 140056836609920 config_util.py:552] Maybe overwriting use_bfloat16: False I1115 06:23:02.917712 140056836609920 ssd_efficientnet_bifpn_feature_extractor.py:143] EfficientDet EfficientNet backbone version: efficientnet-b0 I1115 06:23:02.917843 140056836609920 ssd_efficientnet_bifpn_feature_extractor.py:144] EfficientDet BiFPN num filters: 64 I1115 06:23:02.917987 140056836609920 ssd_efficientnet_bifpn_feature_extractor.py:146] EfficientDet BiFPN num iterations: 3 I1115 06:23:03.023633 140056836609920 efficientnet_model.py:147] round_filter input=32 output=32 INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',). I1115 06:23:03.046473 140056836609920 cross_device_ops.py:621] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',). INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',). I1115 06:23:03.048448 140056836609920 cross_device_ops.py:621] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',). INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',). I1115 06:23:03.050931 140056836609920 cross_device_ops.py:621] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',). INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',). I1115 06:23:03.051964 140056836609920 cross_device_ops.py:621] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',). INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',). I1115 06:23:03.060000 140056836609920 cross_device_ops.py:621] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',). INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',). I1115 06:23:03.064206 140056836609920 cross_device_ops.py:621] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',). I1115 06:23:03.070451 140056836609920 efficientnet_model.py:147] round_filter input=32 output=32 I1115 06:23:03.070622 140056836609920 efficientnet_model.py:147] round_filter input=16 output=16 INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',). I1115 06:23:03.085935 140056836609920 cross_device_ops.py:621] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',). INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',). I1115 06:23:03.087064 140056836609920 cross_device_ops.py:621] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',). INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',). I1115 06:23:03.089118 140056836609920 cross_device_ops.py:621] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',). INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',). I1115 06:23:03.090175 140056836609920 cross_device_ops.py:621] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',). I1115 06:23:03.189514 140056836609920 efficientnet_model.py:147] round_filter input=16 output=16 I1115 06:23:03.189718 140056836609920 efficientnet_model.py:147] round_filter input=24 output=24 I1115 06:23:03.503825 140056836609920 efficientnet_model.py:147] round_filter input=24 output=24 I1115 06:23:03.504052 140056836609920 efficientnet_model.py:147] round_filter input=40 output=40 I1115 06:23:03.829157 140056836609920 efficientnet_model.py:147] round_filter input=40 output=40 I1115 06:23:03.829411 140056836609920 efficientnet_model.py:147] round_filter input=80 output=80 I1115 06:23:04.307266 140056836609920 efficientnet_model.py:147] round_filter input=80 output=80 I1115 06:23:04.307478 140056836609920 efficientnet_model.py:147] round_filter input=112 output=112 I1115 06:23:04.789242 140056836609920 efficientnet_model.py:147] round_filter input=112 output=112 I1115 06:23:04.789535 140056836609920 efficientnet_model.py:147] round_filter input=192 output=192 I1115 06:23:05.393709 140056836609920 efficientnet_model.py:147] round_filter input=192 output=192 I1115 06:23:05.393937 140056836609920 efficientnet_model.py:147] round_filter input=320 output=320 I1115 06:23:05.552831 140056836609920 efficientnet_model.py:147] round_filter input=1280 output=1280 I1115 06:23:05.618357 140056836609920 efficientnet_model.py:458] Building model efficientnet with params ModelConfig(width_coefficient=1.0, depth_coefficient=1.0, resolution=224, dropout_rate=0.2, blocks=(BlockConfig(input_filters=32, output_filters=16, kernel_size=3, num_repeat=1, expand_ratio=1, strides=(1, 1), se_ratio=0.25, id_skip=True, fused_conv=False, conv_type='depthwise'), BlockConfig(input_filters=16, output_filters=24, kernel_size=3, num_repeat=2, expand_ratio=6, strides=(2, 2), se_ratio=0.25, id_skip=True, fused_conv=False, conv_type='depthwise'), BlockConfig(input_filters=24, output_filters=40, kernel_size=5, num_repeat=2, expand_ratio=6, strides=(2, 2), se_ratio=0.25, id_skip=True, fused_conv=False, conv_type='depthwise'), BlockConfig(input_filters=40, output_filters=80, kernel_size=3, num_repeat=3, expand_ratio=6, strides=(2, 2), se_ratio=0.25, id_skip=True, fused_conv=False, conv_type='depthwise'), BlockConfig(input_filters=80, output_filters=112, kernel_size=5, num_repeat=3, expand_ratio=6, strides=(1, 1), se_ratio=0.25, id_skip=True, fused_conv=False, conv_type='depthwise'), BlockConfig(input_filters=112, output_filters=192, kernel_size=5, num_repeat=4, expand_ratio=6, strides=(2, 2), se_ratio=0.25, id_skip=True, fused_conv=False, conv_type='depthwise'), BlockConfig(input_filters=192, output_filters=320, kernel_size=3, num_repeat=1, expand_ratio=6, strides=(1, 1), se_ratio=0.25, id_skip=True, fused_conv=False, conv_type='depthwise')), stem_base_filters=32, top_base_filters=1280, activation='simple_swish', batch_norm='default', bn_momentum=0.99, bn_epsilon=0.001, weight_decay=5e-06, drop_connect_rate=0.2, depth_divisor=8, min_depth=None, use_se=True, input_channels=3, num_classes=1000, model_name='efficientnet', rescale_input=False, data_format='channels_last', dtype='float32') WARNING:tensorflow:From /usr/local/lib/python3.7/dist-packages/object_detection/model_lib_v2.py:558: StrategyBase.experimental_distribute_datasets_from_function (from tensorflow.python.distribute.distribute_lib) is deprecated and will be removed in a future version. Instructions for updating: rename to distribute_datasets_from_function W1115 06:23:05.674777 140056836609920 deprecation.py:345] From /usr/local/lib/python3.7/dist-packages/object_detection/model_lib_v2.py:558: StrategyBase.experimental_distribute_datasets_from_function (from tensorflow.python.distribute.distribute_lib) is deprecated and will be removed in a future version. Instructions for updating: rename to distribute_datasets_from_function INFO:tensorflow:Reading unweighted datasets: ['/content/train/train.record'] I1115 06:23:05.685961 140056836609920 dataset_builder.py:163] Reading unweighted datasets: ['/content/train/train.record'] INFO:tensorflow:Reading record datasets for input file: ['/content/train/train.record'] I1115 06:23:05.686263 140056836609920 dataset_builder.py:80] Reading record datasets for input file: ['/content/train/train.record'] INFO:tensorflow:Number of filenames to read: 1 I1115 06:23:05.686477 140056836609920 dataset_builder.py:81] Number of filenames to read: 1 WARNING:tensorflow:num_readers has been reduced to 1 to match input file shards. W1115 06:23:05.686713 140056836609920 dataset_builder.py:88] num_readers has been reduced to 1 to match input file shards. WARNING:tensorflow:From /usr/local/lib/python3.7/dist-packages/object_detection/builders/dataset_builder.py:105: parallel_interleave (from tensorflow.python.data.experimental.ops.interleave_ops) is deprecated and will be removed in a future version. Instructions for updating: Use
tf.compat.v1.app.run()
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/platform/app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/usr/local/lib/python3.7/dist-packages/absl/app.py", line 303, in run
_run_main(main, args)
File "/usr/local/lib/python3.7/dist-packages/absl/app.py", line 251, in _run_main
sys.exit(main(argv))
File "/content/models/research/object_detection/model_main_tf2.py", line 112, in main
record_summaries=FLAGS.record_summaries)
File "/usr/local/lib/python3.7/dist-packages/object_detection/model_lib_v2.py", line 603, in train_loop
train_input, unpad_groundtruth_tensors)
File "/usr/local/lib/python3.7/dist-packages/object_detection/model_lib_v2.py", line 394, in load_fine_tune_checkpoint
_ensure_model_is_built(model, input_dataset, unpad_groundtruth_tensors)
File "/usr/local/lib/python3.7/dist-packages/object_detection/model_lib_v2.py", line 176, in _ensure_model_is_built
labels,
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/distribute/distribute_lib.py", line 1286, in run
return self._extended.call_for_each_replica(fn, args=args, kwargs=kwargs)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/distribute/distribute_lib.py", line 2849, in call_for_each_replica
return self._call_for_each_replica(fn, args, kwargs)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/distribute/mirrored_strategy.py", line 671, in _call_for_each_replica
self._container_strategy(), fn, args, kwargs)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/distribute/mirrored_run.py", line 86, in call_for_each_replica
return wrapped(args, kwargs)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/def_function.py", line 885, in call
result = self._call(*args, *kwds)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/def_function.py", line 950, in _call
return self._stateless_fn(args, **kwds)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/function.py", line 3040, in call
filtered_flat_args, captured_inputs=graph_function.captured_inputs) # pylint: disable=protected-access
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/function.py", line 1964, in _call_flat
ctx, args, cancellation_manager=cancellation_manager))
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/function.py", line 596, in call
ctx=ctx)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute
inputs, attrs, num_outputs)
tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[node EfficientDet-D0/model/stem_conv2d/Conv2D (defined at usr/local/lib/python3.7/dist-packages/object_detection/models/ssd_efficientnet_bifpn_feature_extractor.py:225) ]] [Op:inferencedummy_computation_fn_27786]
tf.data.Dataset.interleave(map_func, cycle_length, block_length, num_parallel_calls=tf.data.AUTOTUNE)
instead. If sloppy execution is desired, usetf.data.Options.experimental_deterministic
. W1115 06:23:05.689157 140056836609920 deprecation.py:345] From /usr/local/lib/python3.7/dist-packages/object_detection/builders/dataset_builder.py:105: parallel_interleave (from tensorflow.python.data.experimental.ops.interleave_ops) is deprecated and will be removed in a future version. Instructions for updating: Usetf.data.Dataset.interleave(map_func, cycle_length, block_length, num_parallel_calls=tf.data.AUTOTUNE)
instead. If sloppy execution is desired, usetf.data.Options.experimental_deterministic
. WARNING:tensorflow:From /usr/local/lib/python3.7/dist-packages/object_detection/builders/dataset_builder.py:237: DatasetV1.map_with_legacy_function (from tensorflow.python.data.ops.dataset_ops) is deprecated and will be removed in a future version. Instructions for updating: Usetf.data.Dataset.map() W1115 06:23:05.712973 140056836609920 deprecation.py:345] From /usr/local/lib/python3.7/dist-packages/object_detection/builders/dataset_builder.py:237: DatasetV1.map_with_legacy_function (from tensorflow.python.data.ops.dataset_ops) is deprecated and will be removed in a future version. Instructions for updating: Use
tf.data.Dataset.map() WARNING:tensorflow:From /usr/local/lib/python3.7/dist-packages/tensorflow/python/util/dispatch.py:206: sparse_to_dense (from tensorflow.python.ops.sparse_ops) is deprecated and will be removed in a future version. Instructions for updating: Create atf.sparse.SparseTensor
and usetf.sparse.to_dense
instead. W1115 06:23:14.798910 140056836609920 deprecation.py:345] From /usr/local/lib/python3.7/dist-packages/tensorflow/python/util/dispatch.py:206: sparse_to_dense (from tensorflow.python.ops.sparse_ops) is deprecated and will be removed in a future version. Instructions for updating: Create atf.sparse.SparseTensor
and usetf.sparse.to_dense
instead. WARNING:tensorflow:From /usr/local/lib/python3.7/dist-packages/tensorflow/python/autograph/impl/api.py:464: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version. Instructions for updating: Usetf.cast
instead. W1115 06:23:19.895702 140056836609920 deprecation.py:345] From /usr/local/lib/python3.7/dist-packages/tensorflow/python/autograph/impl/api.py:464: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version. Instructions for updating: Usetf.cast
instead. /usr/local/lib/python3.7/dist-packages/keras/backend.py:401: UserWarning:tf.keras.backend.set_learning_phase
is deprecated and will be removed after 2020-10-11. To update it, simply pass a True/False value to thetraining
argument of the__call__
method of your layer or model. warnings.warn('tf.keras.backend.set_learning_phase
is deprecated and ' 2021-11-15 06:23:58.659650: E tensorflow/stream_executor/cuda/cuda_dnn.cc:362] Loaded runtime CuDNN library: 8.0.5 but source was compiled with: 8.1.0. CuDNN library needs to have matching major version and equal or higher minor version. If using a binary install, upgrade your CuDNN library. If building from sources, make sure the library loaded at runtime is compatible with the version specified during compile configuration. 2021-11-15 06:23:58.662947: E tensorflow/stream_executor/cuda/cuda_dnn.cc:362] Loaded runtime CuDNN library: 8.0.5 but source was compiled with: 8.1.0. CuDNN library needs to have matching major version and equal or higher minor version. If using a binary install, upgrade your CuDNN library. If building from sources, make sure the library loaded at runtime is compatible with the version specified during compile configuration. Traceback (most recent call last): File "/content/models/research/object_detection/model_main_tf2.py", line 115, inErrors may have originated from an input operation. Input Source operations connected to node EfficientDet-D0/model/stem_conv2d/Conv2D: args_1 (defined at usr/local/lib/python3.7/dist-packages/object_detection/model_lib_v2.py:176)
Function call stack: _dummy_computation_fn