googlecolab / colabtools

Python libraries for Google Colaboratory
Apache License 2.0
2.2k stars 720 forks source link

GPU device not found in Colab (likely due to most recent TF 2.0 update) #864

Closed bigmw closed 6 months ago

bigmw commented 4 years ago

My code worked well with GPU in Colab yesterday. But this morning it became very slow. So I suspect that CPU is used despite hardware accelerator is set to GPU in “change runtime type” explicitly. The following test code result in “SystemError: GPU device not found”.

code chunk:

%tensorflow_version 2.x  
import tensorflow as tf  
device_name = tf.test.gpu_device_name()  
if device_name != '/device:GPU:0':
   raise SystemError('GPU device not found')
print('Found GPU at: {}'.format(device_name))

the same problem has be replicated by other users when I reported the issue to TF team: https://github.com/tensorflow/tensorflow/issues/34385. I posted it here because I think Colab can solve the problem by simply reverse the most recent update to TF 2.0 in their end. thank you.

ashsny commented 4 years ago

thanks for the report. I can confirm TF2 is not working on some GPU types.

b/144378214

shkarupa-alex commented 4 years ago

Got same issue starting from today. Updating to tf-nightly-gpu helps in my case

bigmw commented 4 years ago

thanks @jsnydes @shkarupa-alex

Someone suggests that this is due to the recent update from CUDA 10.0 to 10.1 at Google Colab. https://stackoverflow.com/questions/58926378/unable-to-run-even-basic-example-with-tf-2-0-and-gpu-support-in-colab

TF 2.0 is not yet ready for the update. installing tf-nightly-gpu did help as others and I have tested. But my results suggested that GPU under tf-nightly-gpu (TF 2.1.0-dev20191119) is ~54% slower than under the original TF 2.0 (before this bug), details in the original report thread: https://github.com/tensorflow/tensorflow/issues/34385 I think a potential temporary solution to all of this is for the Google Colab to reverse to CUDA 10.0, and hold the update on CUDA (if this is the cause) till TF 2.0 is fully ready. Hope this make sense.

fortharrow commented 4 years ago

Same issue on cupy. It can't load libXXX.10.0 libraries. FYI.

CaptainPZ commented 4 years ago

Have the same issue. Colab worked fine with TF 2.x-GPU but starting from yesterday it is not wroking anymore

colaboratory-team commented 4 years ago

b/144844033

colaboratory-team commented 4 years ago

@fortharrow Can you share a self-contained notebook reproducing the issue you observe with cupy?

We aren't able to reproduce the failure described with the bundled version, 6.5.0.

adam-adam commented 4 years ago

I have the same issue. Colab stopped working with the GPU two days ago. Was running tensorflow 2.0.0-beta1.

jakevdp commented 4 years ago

If you are using tensorflow 2.0 on Colab, we recommend using the bundled version, which you can enable with:

%tensorflow_version 2.x

This will switch your tensorflow version to the current 2.X version built for Colab (see https://colab.research.google.com/notebooks/tensorflow_version.ipynb for details)

If you install external releases of tensorflow via pip install tensorflow-gpu==2.0 or similar, it will install a pre-built binary that may not be compatible with the GPUs and drivers available in Colab's runtime.

fortharrow commented 4 years ago

@fortharrow Can you share a self-contained notebook reproducing the issue you observe with cupy?

We aren't able to reproduce the failure described with the bundled version, 6.5.0.

The situation is fixed now. Thanks you.

adam-adam commented 4 years ago

I've switched to the bundled version of tensorflow 2.0 and it's working again now. Thank you.

bigmw commented 4 years ago

@adam-adam Yes, it works now. thanks for the update.

@jakevdp thanks for the comments. "%tensorflow_version 2.x" was used by me as shown in the test code above. And this was exactly the setting/version, where GPU device was not found in Colab. That's why we needed to install the tf-nightly-gpu version. Hope this makes sense.

If you are using tensorflow 2.0 on Colab, we recommend using the bundled version, which you can enable with:

%tensorflow_version 2.x
milansoliya4210 commented 4 years ago

If I am Loading Tensorflow 2.0 using bundle Package which give me below error: Traceback (most recent call last): File "keras_retinanet/bin/train.py", line 527, in <module> main() File "keras_retinanet/bin/train.py", line 522, in main initial_epoch=args.initial_epoch File "/usr/local/lib/python3.6/dist-packages/keras/legacy/interfaces.py", line 91, in wrapper return func(*args, **kwargs) File "/usr/local/lib/python3.6/dist-packages/keras/engine/training.py", line 1732, in fit_generator initial_epoch=initial_epoch) File "/usr/local/lib/python3.6/dist-packages/keras/engine/training_generator.py", line 100, in fit_generator callbacks.set_model(callback_model) File "/usr/local/lib/python3.6/dist-packages/keras/callbacks/callbacks.py", line 68, in set_model callback.set_model(model) File "/usr/local/lib/python3.6/dist-packages/keras/callbacks/tensorboard_v2.py", line 116, in set_model super(TensorBoard, self).set_model(model) File "/tensorflow-2.1.0/python3.6/tensorflow_core/python/keras/callbacks.py", line 1532, in set_model self.log_dir, self.model._get_distribution_strategy()) # pylint: disable=protected-access AttributeError: 'Model' object has no attribute '_get_distribution_strategy'

DevashishX commented 4 years ago

Is this issue solved now ?

is 2.2.0-rc3 the right version ?