googlecolab / colabtools

Python libraries for Google Colaboratory
Apache License 2.0
2.16k stars 698 forks source link

Unable to use TPU-v2 for "tf.distribute.cluster_resolver.TPUClusterResolver()" #4686

Closed moathkhaleel closed 6 days ago

moathkhaleel commented 1 month ago

A couple of weeks ago, I was able to use the following code without a problem. I used to connect to 'TPU (deprecated)' runtime. I tried using the same code today using 'TPU v2' but I keep getting an error indicating that I am not connected to TPU:

!pip install tensorflow==2.9 t5 tensorflow-text==2.9
!pip install jax==0.4.9
!pip install jaxlib==0.4.9
!pip install mesh-tensorflow==0.1.21 t5
!pip install tflite
!pip install registrar
!pip install google-auth-oauthlib==0.4.1
!pip install protobuf==3.20.3
!pip install datasets==2.5.0
!pip install --upgrade tensorflow-datasets

import t5.models
import seqio
!pip install -U tensorflow-gcs-config==2.9.1

import tensorflow as tf
import os

BASE_DIR = "gs://"  # @param { type: "string" }
if not BASE_DIR or BASE_DIR == "gs://":
  raise ValueError("You must enter a BASE_DIR.")

DATA_DIR = os.path.join(BASE_DIR, "data")
FINETUNE_MODELS_DIR = os.path.join(BASE_DIR, "optimized_model")
ON_CLOUD = True

# Enable eager execution
tf.config.experimental_run_functions_eagerly(True)
if ON_CLOUD:
  print("Setting up GCS access...")
  import tensorflow_gcs_config
  from google.colab import auth
  # Set credentials for GCS reading/writing from Colab and TPU.
  TPU_TOPOLOGY = "v2-8"
  try:
    tpu = tf.distribute.cluster_resolver.TPUClusterResolver()  # TPU detection
    TPU_ADDRESS = tpu.get_master()
    print('Running on TPU:', TPU_ADDRESS)
  except ValueError:
    raise BaseException('ERROR: Not connected to a TPU runtime; please see the previous cell in this notebook for instructions!')
  auth.authenticate_user()
  tf.config.experimental_connect_to_cluster(tpu)
  tensorflow_gcs_config.configure_gcs_from_colab_auth()

tf.compat.v1.disable_v2_behavior()

# Improve logging.
from contextlib import contextmanager
import logging as py_logging

if ON_CLOUD:
  tf.compat.v1.get_logger().propagate = False
  py_logging.root.setLevel('INFO')

@contextmanager
def tf_verbosity_level(level):
  og_level = tf.compat.v1.logging.get_verbosity()
  tf.compat.v1.logging.set_verbosity(level)
  yield
  tf.compat.v1.logging.set_verbosity(og_level)

# Enable eager execution
tf.config.experimental_run_functions_eagerly(True)
sagelywizard commented 1 month ago

Many deep learning libraries have TPU-specific versions of packages. You're pip-installing tensorflow. The pip wheel doesn't include TPU support. I managed to find an old tensorflow 2.9+TPU wheel which may work for you.

!pip install https://storage.googleapis.com/cloud-tpu-tpuvm-artifacts/tensorflow/tf-2.9.3/tensorflow-2.9.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl

You'll likely need to pass in tpu='local' to the TPUClusterResolver too.

sagelywizard commented 1 month ago

Also see: https://github.com/googlecolab/colabtools/issues/4481

github-actions[bot] commented 6 days ago

Without additional information we're not able to resolve this issue, so it will be closed at this time. You're still free to add more info and respond to any questions above, though. We'll re-open the issue if you do. Thanks for your contribution!