Closed sagelywizard closed 2 months ago
Are there going to be support for tpu 3-8 vms in the future? Also does this mean the tpu v1 accelerators are not available at all in collab?
@Ryukijano Nothing to announce as far as new accelerators. As far as the old accelerator availability: yes, the TPU (deprecated)
runtimes will no longer be available in the coming weeks. (A technical note: TPU (deprecated)
was backed by v2-8 too, but they used the old-style TPU Node architecture. TPU v2
uses the same v2-8 accelerator, but it uses the new-style TPU VM architecture. They were not backed by TPU v1 chips.)
Will the TPU VM machines support TF v2.16.1?
@sagelywizard I've been trying to use the 'TPU v2' on Colab for several days, but I keep getting the following message:
Failed to assign a backend There are no TPUs available. Would you like to use a runtime without an accelerator?
@shakthiman The TPU runtime currently has TF 2.15 installed. I'm not sure about the timeline on upgrading to 2.16
@JersonGB22 TPU availability is subject to our available resource capacity, which varies throughout the day. However, Pro and Pro+ subscribers get priority access. Subscribe to Pro or Pro+ if you'd like more consistent access to TPU runtimes.
Migration status update: We're in the process of disabling the creation of new TPU (deprecated)
notebooks. In the coming weeks, we'll migrate the remaining legacy TPU (deprecated)
notebooks to TPU v2
runtimes.
@sagelywizard Is this only supported on tensorflow>=2.15.0
? I am using 2.13.0 (and tested 2.14.0) and changing to TPUClusterResolver(tpu='local')
results in the resolver not recognizing any TPUs (resolver.num_accelerators()['TPU'] == 0
)
I am using version of Tensorflow 2.15.0 and Python 3.10.12 in google colab cell
tpu = tf.distribute.cluster_resolver.TPUClusterResolver(tpu='local')
tf.config.experimental_connect_to_cluster(tpu)
tf.tpu.experimental.initialize_tpu_system(tpu)
is running but ;
!python /content/models/research/object_detection/model_main_tf2.py \
--pipeline_config_path={TRAIN_FOLDER}/pipeline.config \
--model_dir={TRAIN_FOLDER}/model/checkpoint/ \
--num_train_steps=20000 \
--num_eval_steps=1000 \
--sample_1_of_n_eval_examples=10 \
--use_tpu={True} \
!python /content/models/research/object_detection/model_main_tf2.py \ relevant part
...
if FLAGS.use_tpu:
# TPU is automatically inferred if tpu_name is None and
# we are running under cloud ai-platform.
resolver = tf.distribute.cluster_resolver.TPUClusterResolver(
tpu='local')
tf.config.experimental_connect_to_cluster(resolver)
tf.tpu.experimental.initialize_tpu_system(resolver)
strategy = tf.distribute.experimental.TPUStrategy(resolver)
elif FLAGS.num_workers > 1:
strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy()
else:
strategy = tf.compat.v2.distribute.MirroredStrategy()
...
I'm getting errors while start training.
Traceback (most recent call last): File "/content/models/research/object_detection/model_main_tf2.py", line 114, in
tf.compat.v1.app.run() File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/platform/app.py", line 36, in run _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef) File "/usr/local/lib/python3.10/dist-packages/absl/app.py", line 308, in run _run_main(main, args) File "/usr/local/lib/python3.10/dist-packages/absl/app.py", line 254, in _run_main sys.exit(main(argv)) File "/content/models/research/object_detection/model_main_tf2.py", line 97, in main tf.tpu.experimental.initialize_tpu_system(resolver) File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/distribute/cluster_resolver/tpu/tpu_cluster_resolver.py", line 72, in initialize_tpu_system return tpu_strategy_util.initialize_tpu_system_impl( File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/tpu/tpu_strategy_util.py", line 142, in initialize_tpu_system_impl raise errors.NotFoundError( tensorflow.python.framework.errors_impl.NotFoundError: TPUs not found in the cluster. Failed in initialization: No matching devices found for '/device:TPU_SYSTEM:0' [Op:inferencetpu_init_fn_4]
I'm using multiple TPUv2 and I can't start training because I can't connect
How do I upgrade tensorflow to 2.16 on tpu? my code is tested on local cpu 64gb i9 cpu. I get some errors which take me to previous keras versions.
@BenKlee We don't explicitly support tensorflow 2.13.x, though it may work. It's possible you're not installing tensorflow correctly.
@Emrhdgr Are you sure you're connected to a TPU runtime? Please file a separate Github issue.
@sagelywizard I've been trying to use the 'TPU v2' on Colab for several days, but I keep getting the following message:
Failed to assign a backend There are no TPUs available. Would you like to use a runtime without an accelerator?
I am having the same message.
Hi @agbruno-git! That means there were no TPUs available at that time. GPU and TPU notebooks are subject to availability. Colab Pro and Pro+ subscribers get priority access to accelerators, so you can subscribe if you'd like more reliable access to accelerators such as TPUs.
Hi folks! The TPU Node to TPU VM migration has completed, so we're closing out this ticket. If you have any new issues with TPUs, please file a new issue in Github or send feedback in the Colab UI (in the Help dropdown menu, under "Help > Send feedback").
We recently released TPU VM accelerators for Colab (backed by TPU v2-8)! This deprecates the legacy TPU Node accelerators (see this documentation for the technical details). This improves usability, reliability, and debuggability, as well as enables support for modern JAX on TPU! Between April and June 2024, we'll begin migrating existing legacy TPU notebooks to modern TPU VM machines. This may require some action from Colab TPU notebook owners.
Limited capacity
As with legacy TPUs, we have limited free tier capacity. Use of "TPU v2" accelerators is subject to availability.
Migration plan
We'll gradually switch legacy TPU notebooks to be backed by TPU VM runtimes. No user action is required to initiate the migration, but some action may be required to update legacy TPU notebooks.
As we migrate legacy TPU notebooks, we'll update this Github ticket with the current migration status.
You can also manually migrate your legacy TPU notebooks by changing the accelerator type from "TPU (deprecated)" to "TPU v2".
What changes might be necessary to my existing TPU notebook?
Default installed packages
TPU v2 runtimes have different packages installed by default than our other runtimes. The set of packages installed on the TPU v2 runtime is smaller and focused on deep learning/AI applications. We've verified that the most commonly installed packages will still be included in the new TPU v2 runtime, but some uncommon packages may be removed. You may need to manually install these packages using
pip install
, e.g.:JAX will undergo a significant upgrade
JAX will undergo a very significant upgrade, i.e. (0.3.25 to 0.4.x). There may be some API changes during this upgrade. You can manually downgrade to the previous version using
pip install
, i.e.Tensorflow TPU Initialization Changes Slightly
The arguments for TPU instantiation in Tensorflow change slightly, now that TPU v2 notebooks will connect to a local TPU.
You can direct Tensorflow to connect to the local runtime by setting
tpu="local"
forTPUClusterResolver
. For example:Changes to: