google-research / text-to-text-transfer-transformer

Code for the paper "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer"
https://arxiv.org/abs/1910.10683
Apache License 2.0
6.17k stars 756 forks source link

Never ending “The TPU worker may not be ready (still scheduling)” warning #38

Closed danyaljj closed 4 years ago

danyaljj commented 4 years ago

Trying to connect to my TPU instance but keep getting this warning (for the past 10 hours or so). I am not sure if it's a T5-related issue or something to do with the way I set up TPU. Any thoughts what could be wrong?

(env37) danielk0014-2:text-to-text-transfer-transformer danielk$ t5_mesh_transformer  \
>   --tpu="daniels-tpu" \
>   --gcp_project="testing-out-tpus" \
>   --tpu_zone="europe-west4-a" \
>   --t5_tfds_data_dir="gs://t5-files" \
>   --gin_file="dataset.gin" \
>   --gin_param="utils.tpu_mesh_shape.model_parallelism = 1" \
>   --gin_param="utils.tpu_mesh_shape.tpu_topology = '2x2'" \
>   --gin_param="MIXTURE_NAME = 'glue_mrpc_v002'" \
>   --model_dir="gs://t5-files/models" \
>   --gin_file="gs://t5-data/pretrained_models/small/operative_config.gin" 
WARNING:tensorflow:From /Users/danielk/ideaProjects/text-to-text-transfer-transformer/env37/lib/python3.7/site-packages/tensorflow_core/python/compat/v2_compat.py:68: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
INFO:tensorflow:model_type=bitransformer
I0108 00:22:22.273789 4582849984 utils.py:1625] model_type=bitransformer
INFO:tensorflow:mode=train
I0108 00:22:22.273949 4582849984 utils.py:1626] mode=train
INFO:tensorflow:sequence_length={'inputs': 512, 'targets': 512}
I0108 00:22:22.274010 4582849984 utils.py:1627] sequence_length={'inputs': 512, 'targets': 512}
INFO:tensorflow:batch_size=2048
I0108 00:22:22.274062 4582849984 utils.py:1628] batch_size=2048
INFO:tensorflow:train_steps=1000000000
I0108 00:22:22.274109 4582849984 utils.py:1629] train_steps=1000000000
INFO:tensorflow:mesh_shape=Shape[batch=8]
I0108 00:22:22.274163 4582849984 utils.py:1630] mesh_shape=Shape[batch=8]
INFO:tensorflow:layout_rules=ensemble:ensemble,batch:batch,d_ff:model,heads:model,vocab:model,experts:batch
I0108 00:22:22.274212 4582849984 utils.py:1631] layout_rules=ensemble:ensemble,batch:batch,d_ff:model,heads:model,vocab:model,experts:batch
INFO:tensorflow:Building TPUConfig with tpu_job_name=None
I0108 00:22:22.277817 4582849984 utils.py:1646] Building TPUConfig with tpu_job_name=None
I0108 00:22:22.280086 4582849984 discovery.py:271] URL being requested: GET https://www.googleapis.com/discovery/v1/apis/tpu/v1/rest
I0108 00:22:22.579149 4582849984 discovery.py:867] URL being requested: GET https://tpu.googleapis.com/v1/projects/testing-out-tpus/locations/europe-west4-a/nodes/daniels-tpu?alt=json
I0108 00:22:22.579305 4582849984 transport.py:157] Attempting refresh to obtain initial access_token
I0108 00:22:22.608117 4582849984 client.py:777] Refreshing access_token
I0108 00:22:23.236666 4582849984 discovery.py:271] URL being requested: GET https://www.googleapis.com/discovery/v1/apis/tpu/v1/rest
I0108 00:22:23.555082 4582849984 discovery.py:867] URL being requested: GET https://tpu.googleapis.com/v1/projects/testing-out-tpus/locations/europe-west4-a/nodes/daniels-tpu?alt=json
I0108 00:22:23.555221 4582849984 transport.py:157] Attempting refresh to obtain initial access_token
I0108 00:22:23.584164 4582849984 client.py:777] Refreshing access_token
INFO:tensorflow:Using config: {'_model_dir': 'gs://t5-files/models', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': None, '_session_config': allow_soft_placement: true
cluster_def {
  job {
    name: "worker"
    tasks {
      key: 0
      value: "10.240.1.2:8470"
    }
  }
}
isolate_session_state: true
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': None, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x160559410>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': 'grpc://10.240.1.2:8470', '_evaluation_master': 'grpc://10.240.1.2:8470', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1, '_tpu_config': TPUConfig(iterations_per_loop=100, num_shards=None, num_cores_per_replica=1, per_host_input_for_training=4, tpu_job_name=None, initial_infeed_sleep_secs=None, input_partition_dims=None, eval_training_input_configuration=2, experimental_host_call_every_n_steps=1), '_cluster': <tensorflow.python.distribute.cluster_resolver.tpu_cluster_resolver.TPUClusterResolver object at 0x15a9b1cd0>}
I0108 00:22:24.162949 4582849984 estimator.py:212] Using config: {'_model_dir': 'gs://t5-files/models', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': None, '_session_config': allow_soft_placement: true
cluster_def {
  job {
    name: "worker"
    tasks {
      key: 0
      value: "10.240.1.2:8470"
    }
  }
}
isolate_session_state: true
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': None, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x160559410>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': 'grpc://10.240.1.2:8470', '_evaluation_master': 'grpc://10.240.1.2:8470', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1, '_tpu_config': TPUConfig(iterations_per_loop=100, num_shards=None, num_cores_per_replica=1, per_host_input_for_training=4, tpu_job_name=None, initial_infeed_sleep_secs=None, input_partition_dims=None, eval_training_input_configuration=2, experimental_host_call_every_n_steps=1), '_cluster': <tensorflow.python.distribute.cluster_resolver.tpu_cluster_resolver.TPUClusterResolver object at 0x15a9b1cd0>}
INFO:tensorflow:_TPUContext: eval_on_tpu True
I0108 00:22:24.163352 4582849984 tpu_context.py:220] _TPUContext: eval_on_tpu True
INFO:tensorflow:Querying Tensorflow master (grpc://10.240.1.2:8470) for TPU system metadata.
I0108 00:22:24.408021 4582849984 tpu_system_metadata.py:78] Querying Tensorflow master (grpc://10.240.1.2:8470) for TPU system metadata.
2020-01-08 00:22:24.409535: W tensorflow/core/distributed_runtime/rpc/grpc_session.cc:370] GrpcSession::ListDevices will initialize the session with an empty graph and other defaults because the session has not yet been created.
WARNING:tensorflow:Failed to connect to the Tensorflow master. The TPU worker may not be ready (still scheduling) or the Tensorflow master address is incorrect: got (grpc://10.240.1.2:8470).
W0108 00:27:24.414439 4582849984 tpu_system_metadata.py:97] Failed to connect to the Tensorflow master. The TPU worker may not be ready (still scheduling) or the Tensorflow master address is incorrect: got (grpc://10.240.1.2:8470).
WARNING:tensorflow:Retrying (1/288).
W0108 00:27:24.414757 4582849984 tpu_system_metadata.py:98] Retrying (1/288).
INFO:tensorflow:Querying Tensorflow master (grpc://10.240.1.2:8470) for TPU system metadata.
I0108 00:27:24.414932 4582849984 tpu_system_metadata.py:78] Querying Tensorflow master (grpc://10.240.1.2:8470) for TPU system metadata.
2020-01-08 00:27:24.415889: W tensorflow/core/distributed_runtime/rpc/grpc_session.cc:370] GrpcSession::ListDevices will initialize the session with an empty graph and other defaults because the session has not yet been created.
WARNING:tensorflow:Failed to connect to the Tensorflow master. The TPU worker may not be ready (still scheduling) or the Tensorflow master address is incorrect: got (grpc://10.240.1.2:8470).
W0108 00:32:24.421372 4582849984 tpu_system_metadata.py:97] Failed to connect to the Tensorflow master. The TPU worker may not be ready (still scheduling) or the Tensorflow master address is incorrect: got (grpc://10.240.1.2:8470).
WARNING:tensorflow:Retrying (2/288).
W0108 00:32:24.421593 4582849984 tpu_system_metadata.py:98] Retrying (2/288).
INFO:tensorflow:Querying Tensorflow master (grpc://10.240.1.2:8470) for TPU system metadata.
I0108 00:32:24.421711 4582849984 tpu_system_metadata.py:78] Querying Tensorflow master (grpc://10.240.1.2:8470) for TPU system metadata.
2020-01-08 00:32:24.422493: W tensorflow/core/distributed_runtime/rpc/grpc_session.cc:370] GrpcSession::ListDevices will initialize the session with an empty graph and other defaults because the session has not yet been created.
WARNING:tensorflow:Failed to connect to the Tensorflow master. The TPU worker may not be ready (still scheduling) or the Tensorflow master address is incorrect: got (grpc://10.240.1.2:8470).
W0108 00:37:24.429055 4582849984 tpu_system_metadata.py:97] Failed to connect to the Tensorflow master. The TPU worker may not be ready (still scheduling) or the Tensorflow master address is incorrect: got (grpc://10.240.1.2:8470).
WARNING:tensorflow:Retrying (3/288).
W0108 00:37:24.429368 4582849984 tpu_system_metadata.py:98] Retrying (3/288).
INFO:tensorflow:Querying Tensorflow master (grpc://10.240.1.2:8470) for TPU system metadata.
I0108 00:37:24.429541 4582849984 tpu_system_metadata.py:78] Querying Tensorflow master (grpc://10.240.1.2:8470) for TPU system metadata.
2020-01-08 00:37:24.430392: W tensorflow/core/distributed_runtime/rpc/grpc_session.cc:370] GrpcSession::ListDevices will initialize the session with an empty graph and other defaults because the session has not yet been created.
WARNING:tensorflow:Failed to connect to the Tensorflow master. The TPU worker may not be ready (still scheduling) or the Tensorflow master address is incorrect: got (grpc://10.240.1.2:8470).
W0108 00:42:24.454765 4582849984 tpu_system_metadata.py:97] Failed to connect to the Tensorflow master. The TPU worker may not be ready (still scheduling) or the Tensorflow master address is incorrect: got (grpc://10.240.1.2:8470).
WARNING:tensorflow:Retrying (4/288).
W0108 00:42:24.454980 4582849984 tpu_system_metadata.py:98] Retrying (4/288).
INFO:tensorflow:Querying Tensorflow master (grpc://10.240.1.2:8470) for TPU system metadata.
I0108 00:42:24.455092 4582849984 tpu_system_metadata.py:78] Querying Tensorflow master (grpc://10.240.1.2:8470) for TPU system metadata.
2020-01-08 00:42:24.455882: W tensorflow/core/distributed_runtime/rpc/grpc_session.cc:370] GrpcSession::ListDevices will initialize the session with an empty graph and other defaults because the session has not yet been created. 
WARNING:tensorflow:Failed to connect to the Tensorflow master. The TPU worker may not be ready (still scheduling) or the Tensorflow master address is incorrect: got (grpc://10.240.1.2:8470).
W0108 00:47:24.462080 4582849984 tpu_system_metadata.py:97] Failed to connect to the Tensorflow master. The TPU worker may not be ready (still scheduling) or the Tensorflow master address is incorrect: got (grpc://10.240.1.2:8470).
WARNING:tensorflow:Retrying (5/288).
W0108 00:47:24.462295 4582849984 tpu_system_metadata.py:98] Retrying (5/288).
INFO:tensorflow:Querying Tensorflow master (grpc://10.240.1.2:8470) for TPU system metadata.
I0108 00:47:24.462409 4582849984 tpu_system_metadata.py:78] Querying Tensorflow master (grpc://10.240.1.2:8470) for TPU system metadata.
2020-01-08 00:47:24.463112: W tensorflow/core/distributed_runtime/rpc/grpc_session.cc:370] GrpcSession::ListDevices will initialize the session with an empty graph and other defaults because the session has not yet been created.
WARNING:tensorflow:Failed to connect to the Tensorflow master. The TPU worker may not be ready (still scheduling) or the Tensorflow master address is incorrect: got (grpc://10.240.1.2:8470).
W0108 00:52:24.471292 4582849984 tpu_system_metadata.py:97] Failed to connect to the Tensorflow master. The TPU worker may not be ready (still scheduling) or the Tensorflow master address is incorrect: got (grpc://10.240.1.2:8470).
WARNING:tensorflow:Retrying (6/288).
W0108 00:52:24.471508 4582849984 tpu_system_metadata.py:98] Retrying (6/288).
INFO:tensorflow:Querying Tensorflow master (grpc://10.240.1.2:8470) for TPU system metadata.
I0108 00:52:24.471624 4582849984 tpu_system_metadata.py:78] Querying Tensorflow master (grpc://10.240.1.2:8470) for TPU system metadata.
2020-01-08 00:52:24.472351: W tensorflow/core/distributed_runtime/rpc/grpc_session.cc:370] GrpcSession::ListDevices will initialize the session with an empty graph and other defaults because the session has not yet been created.
WARNING:tensorflow:Failed to connect to the Tensorflow master. The TPU worker may not be ready (still scheduling) or the Tensorflow master address is incorrect: got (grpc://10.240.1.2:8470).
W0108 00:57:24.479638 4582849984 tpu_system_metadata.py:97] Failed to connect to the Tensorflow master. The TPU worker may not be ready (still scheduling) or the Tensorflow master address is incorrect: got (grpc://10.240.1.2:8470).
WARNING:tensorflow:Retrying (7/288).
W0108 00:57:24.479855 4582849984 tpu_system_metadata.py:98] Retrying (7/288).
INFO:tensorflow:Querying Tensorflow master (grpc://10.240.1.2:8470) for TPU system metadata.
I0108 00:57:24.479971 4582849984 tpu_system_metadata.py:78] Querying Tensorflow master (grpc://10.240.1.2:8470) for TPU system metadata.
2020-01-08 00:57:24.480666: W tensorflow/core/distributed_runtime/rpc/grpc_session.cc:370] GrpcSession::ListDevices will initialize the session with an empty graph and other defaults because the session has not yet been created.
WARNING:tensorflow:Failed to connect to the Tensorflow master. The TPU worker may not be ready (still scheduling) or the Tensorflow master address is incorrect: got (grpc://10.240.1.2:8470).

For completeness, here is how I launched my TPU:

ctpu up --name=daniels-tpu --zone=europe-west4-a --tpu-size=v3-8 --tf-version=1.15  --disk-size-gb=2000 
craffel commented 4 years ago

Did the TPU launch successfully and is it in a healthy state? You can try running

gcloud compute tpus list --zone=europe-west4-a

and checking the status of the TPU.

craffel commented 4 years ago

See also https://cloud.google.com/tpu/docs/troubleshooting#trouble-connecting