TPU Node to TPU VM Migration

sagelywizard commented 6 months ago

We recently released TPU VM accelerators for Colab (backed by TPU v2-8)! This deprecates the legacy TPU Node accelerators (see this documentation for the technical details). This improves usability, reliability, and debuggability, as well as enables support for modern JAX on TPU! Between April and June 2024, we'll begin migrating existing legacy TPU notebooks to modern TPU VM machines. This may require some action from Colab TPU notebook owners.

Limited capacity

As with legacy TPUs, we have limited free tier capacity. Use of "TPU v2" accelerators is subject to availability.

Migration plan

We'll gradually switch legacy TPU notebooks to be backed by TPU VM runtimes. No user action is required to initiate the migration, but some action may be required to update legacy TPU notebooks.

As we migrate legacy TPU notebooks, we'll update this Github ticket with the current migration status.

You can also manually migrate your legacy TPU notebooks by changing the accelerator type from "TPU (deprecated)" to "TPU v2".

What changes might be necessary to my existing TPU notebook?

Default installed packages

TPU v2 runtimes have different packages installed by default than our other runtimes. The set of packages installed on the TPU v2 runtime is smaller and focused on deep learning/AI applications. We've verified that the most commonly installed packages will still be included in the new TPU v2 runtime, but some uncommon packages may be removed. You may need to manually install these packages using pip install, e.g.:

!pip install <my-uncommon-package>

JAX will undergo a significant upgrade

JAX will undergo a very significant upgrade, i.e. (0.3.25 to 0.4.x). There may be some API changes during this upgrade. You can manually downgrade to the previous version using pip install, i.e.

!pip install 'jax[tpu]==0.3.25' -f https://storage.googleapis.com/jax-releases/libtpu_releases.html -f https://storage.googleapis.com/jax-releases/jax_releases.html

Tensorflow TPU Initialization Changes Slightly

The arguments for TPU instantiation in Tensorflow change slightly, now that TPU v2 notebooks will connect to a local TPU.

You can direct Tensorflow to connect to the local runtime by setting tpu="local" for TPUClusterResolver. For example:

tf.contrib.cluster_resolver.TPUClusterResolver(tpu='grpc://' + os.environ['COLAB_TPU_ADDR'])

Changes to:

tf.contrib.cluster_resolver.TPUClusterResolver(tpu='local')

Ryukijano commented 4 months ago

Are there going to be support for tpu 3-8 vms in the future? Also does this mean the tpu v1 accelerators are not available at all in collab?

sagelywizard commented 4 months ago

@Ryukijano Nothing to announce as far as new accelerators. As far as the old accelerator availability: yes, the TPU (deprecated) runtimes will no longer be available in the coming weeks. (A technical note: TPU (deprecated) was backed by v2-8 too, but they used the old-style TPU Node architecture. TPU v2 uses the same v2-8 accelerator, but it uses the new-style TPU VM architecture. They were not backed by TPU v1 chips.)

shakthiman commented 4 months ago

Will the TPU VM machines support TF v2.16.1?

JersonGB22 commented 4 months ago

@sagelywizard I've been trying to use the 'TPU v2' on Colab for several days, but I keep getting the following message: Failed to assign a backend There are no TPUs available. Would you like to use a runtime without an accelerator?

sagelywizard commented 4 months ago

@shakthiman The TPU runtime currently has TF 2.15 installed. I'm not sure about the timeline on upgrading to 2.16

sagelywizard commented 4 months ago

@JersonGB22 TPU availability is subject to our available resource capacity, which varies throughout the day. However, Pro and Pro+ subscribers get priority access. Subscribe to Pro or Pro+ if you'd like more consistent access to TPU runtimes.

sagelywizard commented 4 months ago

Migration status update: We're in the process of disabling the creation of new TPU (deprecated) notebooks. In the coming weeks, we'll migrate the remaining legacy TPU (deprecated) notebooks to TPU v2 runtimes.

BenKlee commented 4 months ago

@sagelywizard Is this only supported on tensorflow>=2.15.0? I am using 2.13.0 (and tested 2.14.0) and changing to TPUClusterResolver(tpu='local') results in the resolver not recognizing any TPUs (resolver.num_accelerators()['TPU'] == 0)

Emrhdgr commented 3 months ago

I am using version of Tensorflow 2.15.0 and Python 3.10.12 in google colab cell

    tpu = tf.distribute.cluster_resolver.TPUClusterResolver(tpu='local')
    tf.config.experimental_connect_to_cluster(tpu)
    tf.tpu.experimental.initialize_tpu_system(tpu)

is running but ;


!python /content/models/research/object_detection/model_main_tf2.py \
--pipeline_config_path={TRAIN_FOLDER}/pipeline.config \
--model_dir={TRAIN_FOLDER}/model/checkpoint/ \
--num_train_steps=20000 \
--num_eval_steps=1000 \
--sample_1_of_n_eval_examples=10 \
--use_tpu={True} \

!python /content/models/research/object_detection/model_main_tf2.py \ relevant part

...
if FLAGS.use_tpu:
      # TPU is automatically inferred if tpu_name is None and
      # we are running under cloud ai-platform.
      resolver = tf.distribute.cluster_resolver.TPUClusterResolver(
          tpu='local')
      tf.config.experimental_connect_to_cluster(resolver)
      tf.tpu.experimental.initialize_tpu_system(resolver)
      strategy = tf.distribute.experimental.TPUStrategy(resolver)
    elif FLAGS.num_workers > 1:
      strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy()
    else:
      strategy = tf.compat.v2.distribute.MirroredStrategy()
...

I'm getting errors while start training.

Traceback (most recent call last): File "/content/models/research/object_detection/model_main_tf2.py", line 114, in tf.compat.v1.app.run() File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/platform/app.py", line 36, in run _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef) File "/usr/local/lib/python3.10/dist-packages/absl/app.py", line 308, in run _run_main(main, args) File "/usr/local/lib/python3.10/dist-packages/absl/app.py", line 254, in _run_main sys.exit(main(argv)) File "/content/models/research/object_detection/model_main_tf2.py", line 97, in main tf.tpu.experimental.initialize_tpu_system(resolver) File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/distribute/cluster_resolver/tpu/tpu_cluster_resolver.py", line 72, in initialize_tpu_system return tpu_strategy_util.initialize_tpu_system_impl( File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/tpu/tpu_strategy_util.py", line 142, in initialize_tpu_system_impl raise errors.NotFoundError( tensorflow.python.framework.errors_impl.NotFoundError: TPUs not found in the cluster. Failed in initialization: No matching devices found for '/device:TPU_SYSTEM:0' [Op:inferencetpu_init_fn_4]

I'm using multiple TPUv2 and I can't start training because I can't connect

h4ck4l1 commented 3 months ago

How do I upgrade tensorflow to 2.16 on tpu? my code is tested on local cpu 64gb i9 cpu. I get some errors which take me to previous keras versions.

sagelywizard commented 3 months ago

@BenKlee We don't explicitly support tensorflow 2.13.x, though it may work. It's possible you're not installing tensorflow correctly.

sagelywizard commented 3 months ago

@Emrhdgr Are you sure you're connected to a TPU runtime? Please file a separate Github issue.

agbruno-git commented 2 months ago

@sagelywizard I've been trying to use the 'TPU v2' on Colab for several days, but I keep getting the following message: Failed to assign a backend There are no TPUs available. Would you like to use a runtime without an accelerator?

I am having the same message.

sagelywizard commented 2 months ago

Hi @agbruno-git! That means there were no TPUs available at that time. GPU and TPU notebooks are subject to availability. Colab Pro and Pro+ subscribers get priority access to accelerators, so you can subscribe if you'd like more reliable access to accelerators such as TPUs.

sagelywizard commented 2 months ago

Hi folks! The TPU Node to TPU VM migration has completed, so we're closing out this ticket. If you have any new issues with TPUs, please file a new issue in Github or send feedback in the Colab UI (in the Help dropdown menu, under "Help > Send feedback").

googlecolab / colabtools