ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
34.04k stars 5.78k forks source link

[Ray core] TPU support #25222

Closed JiahaoYao closed 2 years ago

JiahaoYao commented 2 years ago

Description

When we do ray.init(), it would be great if the tpu devices can be automatically supported.

I found the current gpu instances are detected like this

https://sourcegraph.com/github.com/ray-project/ray/-/blob/python/ray/worker.py?L544-555

More specifically, the gpus are detected like this

https://github.com/ray-project/ray/blob/e142bb3874bb14d76da4fbd2d3808595fb6265d6/python/ray/_private/utils.py?q=CUDA_VISIBLE_DEVICE#L266-L307

I wish Ray could also support TPU, possible examples of detecting the tpu cores are:

%tensorflow_version 2.x
import tensorflow as tf
import os

try:
  tpu = tf.distribute.cluster_resolver.TPUClusterResolver()  # TPU detection
  print('Running on TPU ', tpu.cluster_spec().as_dict()['worker'])
except ValueError:
  raise BaseException('ERROR: Not connected to a TPU runtime; please see the previous cell in this notebook for instructions!')

tf.config.experimental_connect_to_cluster(tpu)
tf.tpu.experimental.initialize_tpu_system(tpu)
tpu_strategy = tf.distribute.TPUStrategy(tpu)

Use case

https://github.com/ray-project/ray/issues/22251

JiahaoYao commented 2 years ago

@scv119

stale[bot] commented 2 years ago

Hi, I'm a bot from the Ray team :)

To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.

If there is no further activity in the 14 days, the issue will be closed!

You can always ask for help on our discussion forum or Ray's public slack channel.

stale[bot] commented 2 years ago

Hi again! The issue will be closed because there has been no more activity in the 14 days since the last message.

Please feel free to reopen or open a new issue if you'd still like it to be addressed.

Again, you can always ask for help on our discussion forum or Ray's public slack channel.

Thanks again for opening the issue!

jjyao commented 1 year ago

TPU support is added to Ray.

steveepreston commented 1 month ago

Hey @jjyao Can you say how should use it?

i'm running notebook in tpu vm env, and also Ray itself shows CPU: 0.0/96.0 - TPU 0.0/4.0 in trials status table

but when i set this:

tune.with_resources(train_model, resources={"cpu": 96, "tpu": 4}),

it throws:

Error: No available node types can fulfill resource request {'CPU': 96.0, 'tpu': 4.0}. Add suitable node types to this cluster to resolve this issue.
steveepreston commented 1 month ago

Also please tell, does we need this block?

try:
    tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
    tf.config.experimental_connect_to_cluster(tpu)
    tf.tpu.experimental.initialize_tpu_system(tpu)
    strategy = tf.distribute.TPUStrategy(tpu)
except ValueError:
    strategy = tf.distribute.get_strategy()

if yes, where to place it? and where to use strategy? wrap tune.Tuner with strategy? or wrap model.fit in train_model(config) with strategy?