tensorflow / tensor2tensor

Library of deep learning models and datasets designed to make deep learning more accessible and accelerate ML research.
Apache License 2.0
15.5k stars 3.49k forks source link

t2t_trainer Failed to connect to the Tensorflow master #920

Open eyaler opened 6 years ago

eyaler commented 6 years ago

Description

trying to follow: https://github.com/tensorflow/tensor2tensor/blob/master/docs/cloud_tpu.md getting "Failed to connect to the Tensorflow master" when running t2t_trainer

Environment information

n1-standard-2 vm

OS: Linux eyal-vm1 4.9.0-6-amd64 #1 SMP Debian 4.9.88-1+deb9u1 (2018-05-07) x86_64 GNU/Linux

$ pip freeze | grep tensor tensor2tensor==1.6.6 tensorboard==1.9.0 tensorflow==1.9.0rc2

$ python3 -V Python 3.5.3

For bugs: reproduction and error logs

Steps to reproduce:

t2t-trainer --model=transformer --hparams_set=transformer_tpu --problem=translate_ende_wmt8k --train_steps=10 --eval_steps=10 --local_eval_frequency=10 --data_dir=$DATA_DIR --output_dir=$OUT_DIR  --cloud_tpu  --cloud_delete_on_done

Error logs:

INFO:tensorflow:Running on Cloud TPU
Will delete VM and TPU instance on done.
Confirm (Y/n)? > Y
Listed 0 items.
INFO:tensorflow:VM eyaler-vm already exists, reusing.
Creating TPU instance eyaler-tpu
Confirm (Y/n)? > Y
Listed 0 items.
Waiting for operation [projects/vae-st/locations/us-central1-f/operations/operation-1530889609094-570560357271f-91684d02-e0181ff4] to complete...done.
Created tpu [eyaler-tpu].
INFO:tensorflow:VM (Name, IP): eyaler-vm, 104.197.107.1
INFO:tensorflow:TPU (Name, IP): eyaler-tpu, 10.240.112.2
INFO:tensorflow:To delete the VM, run: gcloud compute instances delete eyaler-vm --quiet
INFO:tensorflow:To delete the TPU instance, run: gcloud beta compute tpus delete eyaler-tpu --quiet
WARNING: The public SSH key file for gcloud does not exist.
WARNING: The private SSH key file for gcloud does not exist.
WARNING: You do not have an SSH key for gcloud.
WARNING: SSH keygen will be executed to generate a key.
Generating public/private rsa key pair.
Enter passphrase (empty for no passphrase): INFO:tensorflow:Set up port forwarding. Local ports: {'tpu_profile': 34945, 'tpu': 38199}
WARNING:tensorflow:Estimator's model_fn (<function T2TModel.make_estimator_model_fn.<locals>.wrapping_model_fn at 0x7fa88485db70>) includes params argument, but params are not passed to Estimator.
INFO:tensorflow:Using config: {'_train_distribute': None, '_service': None, '_model_dir': 'gs://vae-st-storage/t2t/training/transformer_v1', '_device_fn': None, '_save_checkpoints_steps': 100, '_num_worker_replicas': 1, '_is_chief': True, '_tf_random_seed': None, '_master': 'grpc://localhost:38199', 'use_tpu': True, '_task_type': 'worker', '_tpu_config': TPUConfig(iterations_per_loop=100, num_shards=8, computation_shape=None, per_host_input_for_training=2, tpu_job_name=None, initial_infeed_sleep_secs=None), '_keep_checkpoint_max': 20, '_log_step_count_steps': None, '_session_config': gpu_options {
  per_process_gpu_memory_fraction: 0.95
}
allow_soft_placement: true
graph_options {
}
, '_keep_checkpoint_every_n_hours': 10000, '_save_checkpoints_secs': None, '_save_summary_steps': 100, '_evaluation_master': 'grpc://localhost:38199', '_global_id_in_cluster': 0, '_num_ps_replicas': 0, '_cluster': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fa87d1b7710>, '_task_id': 0}
INFO:tensorflow:_TPUContext: eval_on_tpu True
WARNING:tensorflow:ValidationMonitor only works with --schedule=train_and_evaluate
INFO:tensorflow:Running training and evaluation locally (non-distributed).
INFO:tensorflow:Start train and evaluate loop. The evaluate will happen after 600 secs (eval_spec.throttle_secs) or training is finished.
INFO:tensorflow:Querying Tensorflow master (grpc://localhost:38199) for TPU system metadata.
2018-07-06 15:11:42.090327: W tensorflow/core/distributed_runtime/rpc/grpc_session.cc:349] GrpcSession::ListDevices will initialize the session with an empty graph and other defaults because the session has not yet been created.
WARNING:tensorflow:Failed to connect to the Tensorflow master. The TPU worker may not be ready (still scheduling) or the Tensorflow master address is incorrect: got (grpc://localhost:38199).
WARNING:tensorflow:Retrying (1/120).
INFO:tensorflow:Querying Tensorflow master (grpc://localhost:38199) for TPU system metadata.
2018-07-06 15:12:42.092534: W tensorflow/core/distributed_runtime/rpc/grpc_session.cc:349] GrpcSession::ListDevices will initialize the session with an empty graph and other defaults because the session has not yet been created.
WARNING:tensorflow:Failed to connect to the Tensorflow master. The TPU worker may not be ready (still scheduling) or the Tensorflow master address is incorrect: got (grpc://localhost:38199).
WARNING:tensorflow:Retrying (2/120).
INFO:tensorflow:Querying Tensorflow master (grpc://localhost:38199) for TPU system metadata.
2018-07-06 15:13:42.095482: W tensorflow/core/distributed_runtime/rpc/grpc_session.cc:349] GrpcSession::ListDevices will initialize the session with an empty graph and other defaults because the session has not yet been created.
...
[continues like this]
eyaler commented 6 years ago

i think the problem was related to the script interfering with the passphrase user input. once i ran the commands manually and created a passphrase, the issue was solved: gcloud compute ssh $USER-vm