googlecolab / colabtools

Python libraries for Google Colaboratory
Apache License 2.0
2.17k stars 701 forks source link

"tf.distribute.cluster_resolver.TPUClusterResolver()" is not working #4699

Closed PlutoSejin closed 2 weeks ago

PlutoSejin commented 1 month ago

3 months ago, I made model using following notebook. After tpu is changed TPU(deprecated) to TPU v2, I have error at tf.distribute.cluster_resolver.TPUClusterResolver() part which did not return value of TPU address. Specific code is below.

try:
  tpu = tf.distribute.cluster_resolver.TPUClusterResolver()  # TPU detection
  print('Running on TPU ', tpu.cluster_spec().as_dict())
  TPU_ADDRESS = tpu.get_master()
  print('Running on TPU:', TPU_ADDRESS)
except ValueError:
  raise BaseException(
    'ERROR: Not connected to a TPU runtime; please see the previous cell in this notebook for instructions!')

Output of above code is

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
[<ipython-input-35-bc480ed05132>](https://localhost:8080/#) in <cell line: 5>()
      5 try:
----> 6   tpu = tf.distribute.cluster_resolver.TPUClusterResolver()  # TPU detection
      7   print('Running on TPU ', tpu.cluster_spec().as_dict())

2 frames
[/usr/local/lib/python3.10/dist-packages/tensorflow/python/distribute/cluster_resolver/tpu/tpu_cluster_resolver.py](https://localhost:8080/#) in __init__(self, tpu, zone, project, job_name, coordinator_name, coordinator_address, credentials, service, discovery_url)
    234       # Default Cloud environment
--> 235       self._cloud_tpu_client = client.Client(
    236           tpu=tpu,

[/usr/local/lib/python3.10/dist-packages/cloud_tpu_client/client.py](https://localhost:8080/#) in __init__(self, tpu, zone, project, credentials, service, discovery_url)
    138     if tpu is None:
--> 139       raise ValueError('Please provide a TPU Name to connect to.')
    140 

ValueError: Please provide a TPU Name to connect to.

During handling of the above exception, another exception occurred:

BaseException                             Traceback (most recent call last)
[<ipython-input-35-bc480ed05132>](https://localhost:8080/#) in <cell line: 5>()
      9   print('Running on TPU:', TPU_ADDRESS)
     10 except ValueError:
---> 11   raise BaseException(
     12     'ERROR: Not connected to a TPU runtime; please see the previous cell in this notebook for instructions!')
     13 #tf.config.experimental_connect_to_host(TPU_ADDRESS)

BaseException: ERROR: Not connected to a TPU runtime; please see the previous cell in this notebook for instructions!

I also add tpu='local' at tpu = tf.distribute.cluster_resolver.TPUClusterResolver() part.

try:
  tpu = tf.distribute.cluster_resolver.TPUClusterResolver(tpu='local')  # TPU detection
  print('Running on TPU ', tpu.cluster_spec().as_dict())
  TPU_ADDRESS = tpu.get_master()
  print('Running on TPU:', TPU_ADDRESS)
except ValueError:
  raise BaseException(
    'ERROR: Not connected to a TPU runtime; please see the previous cell in this notebook for instructions!')

But output is below:

Running on TPU  {}
Running on TPU: 
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
[<ipython-input-37-bbb33ea8b6de>](https://localhost:8080/#) in <cell line: 18>()
     16 auth.authenticate_user()
     17 tf.enable_eager_execution()
---> 18 tf.config.experimental_connect_to_host(TPU_ADDRESS)
     19 tensorflow_gcs_config.configure_gcs_from_colab_auth()
     20 

[/usr/local/lib/python3.10/dist-packages/tensorflow/python/eager/remote.py](https://localhost:8080/#) in connect_to_remote_host(remote_host, job_name)
     65   """
     66   if not remote_host:
---> 67     raise ValueError("Must provide at least one remote_host")
     68 
     69   remote_hosts = nest.flatten(remote_host)

ValueError: Must provide at least one remote_host

I can't get to TPU address. Also when I use tpu='local', Return value of tpu.cluster_spec().as_dict() is empty. When I use tpu.cluster_spec().as_dict()['worker'], above error occur.

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
[<ipython-input-38-a02f5f80852d>](https://localhost:8080/#) in <cell line: 5>()
      5 try:
      6   tpu = tf.distribute.cluster_resolver.TPUClusterResolver(tpu='local')  # TPU detection
----> 7   print('Running on TPU ', tpu.cluster_spec().as_dict()['worker'])
      8   TPU_ADDRESS = tpu.get_master()
      9   print('Running on TPU:', TPU_ADDRESS)

KeyError: 'worker'

Do I have to use Cloud TPU API to get TPU address? or there is other way to get TPU address?

mayankmalik-colab commented 1 month ago

Similar issue - #4686. Check the comments from one of our team members.

PlutoSejin commented 1 month ago

I already done using

"!pip install https://storage.googleapis.com/cloud-tpu-tpuvm-artifacts/tensorflow/tf-2.15.0/tensorflow-2.15.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl" 

at my notebook which is mentioned in #4686. But still not work.

EvanWiederspan commented 1 month ago

Tracking internally as b/353976964

sagelywizard commented 1 month ago

The "TPU v2" runtimes are no longer on the "TPU Node" architecture. This means the notebook VM has direct access to the TPU, rather than the TPU residing on a remote worker machine. You're not seeing any workers because there's no worker VM on the new TPU VM architecture.

You can see the TPUs attached to your VM with tpu.num_accelerators().

sagelywizard commented 1 month ago

You can find more information about the TPU Node and TPU VM architecture differences here: https://cloud.google.com/tpu/docs/system-architecture-tpu-vm#tpu_architectures

PlutoSejin commented 1 month ago

I check tpu.num_accelerators() and it return 8.

Then how to get tpu device name?

import tensorflow_gcs_config
import tensorflow.compat.v1 as tf
TPU_TOPOLOGY = "2x2"
try:
  tpu = tf.distribute.cluster_resolver.TPUClusterResolver(tpu='local')
  tf.config.experimental_connect_to_cluster(tpu)
  #tf.tpu.experimental.initialize_tpu_system(tpu)
  TPU_ADDRESS = tpu.get_master()
  print('Running on TPU:', TPU_ADDRESS)
except ValueError:
  raise BaseException(
    'ERROR: Not connected to a TPU runtime; please see the previous cell in this notebook for instructions!')

tf.config.experimental_connect_to_host(TPU_ADDRESS)
tensorflow_gcs_config.configure_gcs_from_colab_auth()

I use above code, but result and error is below.

Running on TPU: 
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
[<ipython-input-4-98ea436009aa>](https://localhost:8080/#) in <cell line: 23>()
     21 #tf.tpu.experimental.initialize_tpu_system(tpu)
     22 #strategy = tf.distribute.experimental.TPUStrategy(tpu)
---> 23 tf.config.experimental_connect_to_host(TPU_ADDRESS)
     24 #with strategy.scope():
     25 tensorflow_gcs_config.configure_gcs_from_colab_auth()

[/usr/local/lib/python3.10/dist-packages/tensorflow/python/eager/remote.py](https://localhost:8080/#) in connect_to_remote_host(remote_host, job_name)
     65   """
     66   if not remote_host:
---> 67     raise ValueError("Must provide at least one remote_host")
     68 
     69   remote_hosts = nest.flatten(remote_host)

ValueError: Must provide at least one remote_host

tpu.get_master() return nothing. I also see other issue that cannot proceed due to tensorflow version, but my tensorflow, tensorflow-gcs-config, tensorflow-text version is 2.15.0.

sagelywizard commented 1 month ago

Then how to get tpu device name?

I think you're referring to the TPU network address. The new TPUs on the new TPU VMs are not attached to the network, so they don't have a network address. That's why tpu.get_master() returns '' (i.e. there's no network address).

Can you delete tf.config.experimental_connect_to_host(TPU_ADDRESS) and all the references to TPU_ADDRESS?

PlutoSejin commented 1 month ago

I deleted all the references to TPU_ADDRESS which is below code

import tensorflow_gcs_config
import tensorflow.compat.v1 as tf
TPU_TOPOLOGY = "2x2"
try:
  tpu = tf.distribute.cluster_resolver.TPUClusterResolver(tpu='local')  # TPU detection
  tf.config.experimental_connect_to_cluster(tpu)
  tf.tpu.experimental.initialize_tpu_system(tpu)
except ValueError:
  raise BaseException(
    'ERROR: Not connected to a TPU runtime; please see the previous cell in this notebook for instructions!')

tensorflow_gcs_config.configure_gcs_from_colab_auth()

Then below error occured.

---------------------------------------------------------------------------
InvalidArgumentError                      Traceback (most recent call last)
[<ipython-input-4-01c48cbb3b56>](https://localhost:8080/#) in <cell line: 22>()
     21 strategy = tf.distribute.TPUStrategy(tpu)
     22 with strategy.scope():
---> 23   tensorflow_gcs_config.configure_gcs_from_colab_auth()
     24 
     25 tf.disable_v2_behavior()

11 frames
[/usr/local/lib/python3.10/dist-packages/tensorflow_gcs_config/__init__.py](https://localhost:8080/#) in configure_gcs_from_colab_auth(device)
    128   with open(adc_filename) as f:
    129     data = json.load(f)
--> 130   return configure_gcs(credentials=data, device=device)
    131 

[/usr/local/lib/python3.10/dist-packages/tensorflow_gcs_config/__init__.py](https://localhost:8080/#) in configure_gcs(credentials, block_cache, device)
    116   if device:
    117     with ops.device(device):
--> 118       return configure(credentials, block_cache)
    119   return configure(credentials, block_cache)
    120 

[/usr/local/lib/python3.10/dist-packages/tensorflow_gcs_config/__init__.py](https://localhost:8080/#) in configure(credentials, block_cache)
    100       if isinstance(credentials, dict):
    101         credentials = json.dumps(credentials)
--> 102       creds = gcs_configure_credentials(credentials)
    103     else:
    104       creds = tf.constant(0)

<string> in gcs_configure_credentials(json, name)

<string> in gcs_configure_credentials_eager_fallback(json, name, ctx)

[/usr/local/lib/python3.10/dist-packages/tensorflow/python/profiler/trace.py](https://localhost:8080/#) in wrapped(*args, **kwargs)
    181         with Trace(trace_name, **trace_kwargs):
    182           return func(*args, **kwargs)
--> 183       return func(*args, **kwargs)
    184 
    185     return wrapped

[/usr/local/lib/python3.10/dist-packages/tensorflow/python/framework/ops.py](https://localhost:8080/#) in convert_to_tensor(value, dtype, name, as_ref, preferred_dtype, dtype_hint, ctx, accepted_result_types)
    694   # TODO(b/142518781): Fix all call-sites and remove redundant arg
    695   preferred_dtype = preferred_dtype or dtype_hint
--> 696   return tensor_conversion_registry.convert(
    697       value, dtype, name, as_ref, preferred_dtype, accepted_result_types
    698   )

[/usr/local/lib/python3.10/dist-packages/tensorflow/python/framework/tensor_conversion_registry.py](https://localhost:8080/#) in convert(value, dtype, name, as_ref, preferred_dtype, accepted_result_types)
    232 
    233     if ret is None:
--> 234       ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
    235 
    236     if ret is NotImplemented:

[/usr/local/lib/python3.10/dist-packages/tensorflow/python/framework/constant_op.py](https://localhost:8080/#) in _constant_tensor_conversion_function(v, dtype, name, as_ref)
    333                                          as_ref=False):
    334   _ = as_ref
--> 335   return constant(v, dtype=dtype, name=name)
    336 
    337 # Register the conversion function for the "unconvertible" types

[/usr/local/lib/python3.10/dist-packages/tensorflow/python/ops/weak_tensor_ops.py](https://localhost:8080/#) in wrapper(*args, **kwargs)
    140   def wrapper(*args, **kwargs):
    141     if not ops.is_auto_dtype_conversion_enabled():
--> 142       return op(*args, **kwargs)
    143     bound_arguments = signature.bind(*args, **kwargs)
    144     bound_arguments.apply_defaults()

[/usr/local/lib/python3.10/dist-packages/tensorflow/python/framework/constant_op.py](https://localhost:8080/#) in constant(value, dtype, shape, name)
    269     ValueError: if called on a symbolic tensor.
    270   """
--> 271   return _constant_impl(value, dtype, shape, name, verify_shape=False,
    272                         allow_broadcast=True)
    273 

[/usr/local/lib/python3.10/dist-packages/tensorflow/python/framework/constant_op.py](https://localhost:8080/#) in _constant_impl(value, dtype, shape, name, verify_shape, allow_broadcast)
    282       with trace.Trace("tf.constant"):
    283         return _constant_eager_impl(ctx, value, dtype, shape, verify_shape)
--> 284     return _constant_eager_impl(ctx, value, dtype, shape, verify_shape)
    285 
    286   const_tensor = ops._create_graph_constant(  # pylint: disable=protected-access

[/usr/local/lib/python3.10/dist-packages/tensorflow/python/framework/constant_op.py](https://localhost:8080/#) in _constant_eager_impl(ctx, value, dtype, shape, verify_shape)
    294 ) -> ops._EagerTensorBase:
    295   """Creates a constant on the current device."""
--> 296   t = convert_to_eager_tensor(value, ctx, dtype)
    297   if shape is None:
    298     return t

[/usr/local/lib/python3.10/dist-packages/tensorflow/python/framework/constant_op.py](https://localhost:8080/#) in convert_to_eager_tensor(value, ctx, dtype)
    101       dtype = dtypes.as_dtype(dtype).as_datatype_enum
    102   ctx.ensure_initialized()
--> 103   return ops.EagerTensor(value, ctx.device_name, dtype)
    104 
    105 

InvalidArgumentError: /job:worker/replica:0/task:0/device:CPU:0 unknown device.

It had error at tensorflow_gcs_config.configure_gcs_from_colab_auth()

PlutoSejin commented 3 weeks ago

Isn't there are any solution? tensorflow_gcs_config.configure_gcs_from_colab_auth() give same error even though I use TPU without using TPU_ADDRESS

sagelywizard commented 3 weeks ago

We don't own the tensorflow_gcs_config library, so I'd recommend contacting to the library owners (I believe that's the tensorflow team) and asking them for help.

But from looking at the code, it looks like there's a default kwarg to configure_gcs_from_colab_auth, device="/job:worker/replica:0/task:0/device:CPU:0". That looks incorrect to me, and I suspect you'll want to pass in a different device name. I'm not sure if it wants a physical or logical device, but you can list the physical devices on the system with: tf.config.list_physical_devices() and you can list the logical devices on the system with tf.config.list_logical_devices()

Hope that's helpful!

PlutoSejin commented 2 weeks ago

Thank you for your comment. I solved it by referring to your advice.