tensorflow / tensor2tensor

Library of deep learning models and datasets designed to make deep learning more accessible and accelerate ML research.
Apache License 2.0
15.5k stars 3.49k forks source link

TPU failing to reinitialize after eval #1202

Open mikeymezher opened 5 years ago

mikeymezher commented 5 years ago

TPUs running tensorflow 1.11.0 are failing to reinitialize after evaluation during "continuous_train_and_eval" schedule. Not sure if this is a bug in Tensor2Tensor or TF's scheduling for TPUs.

This problem does not exist with TPUs running tensorflow 1.9.0

Still able to train, but have to set schedule to "train" and avoid evaluation.

Environment information

tensor2tensor==1.10.0 tensorboard==1.11.0 tensorflow==1.11.0 tensorflow-metadata==0.9.0 tensorflow-serving-api==1.11.0 tensorflow-transform==0.9.0

Trace: INFO:tensorflow:Done calling model_fn. INFO:tensorflow:Starting evaluation at 2018-11-05-18:14:34 INFO:tensorflow:TPU job name tpu_worker INFO:tensorflow:Graph was finalized. INFO:tensorflow:Restoring parameters from gs://besc-route-prediction/tensor2tensor/train_variableGHR/tpu/model_kinSparse_singleShard/model.ckpt-1000 INFO:tensorflow:Running local_init_op. INFO:tensorflow:Done running local_init_op. INFO:tensorflow:Init TPU system INFO:tensorflow:Starting infeed thread controller. INFO:tensorflow:Starting outfeed thread controller. INFO:tensorflow:Enqueue next (30) batch(es) of data to infeed. INFO:tensorflow:Dequeue next (30) batch(es) of data from outfeed. INFO:tensorflow:Evaluation [30/30] INFO:tensorflow:Stop infeed thread controller INFO:tensorflow:Shutting down InfeedController thread. INFO:tensorflow:InfeedController received shutdown signal, stopping. INFO:tensorflow:Infeed thread finished, shutting down. INFO:tensorflow:infeed marked as finished INFO:tensorflow:Stop output thread controller INFO:tensorflow:Shutting down OutfeedController thread. INFO:tensorflow:OutfeedController received shutdown signal, stopping. INFO:tensorflow:Outfeed thread finished, shutting down. INFO:tensorflow:outfeed marked as finished INFO:tensorflow:Shutdown TPU system. INFO:tensorflow:Finished evaluation at 2018-11-05-18:15:24 INFO:tensorflow:Saving dict for global step 1000: global_step = 1000, loss = 5.689253, targets/metrics-routes_packed/accuracy = 0.14732198, targets/metrics-routes_packed/accuracy_per_sequence = 0.0, targets/metrics-routes_packed/accuracy_top5 = 0.31858853, targets/metrics-routes_packed/neg_log_perplexity = -5.773135, targets_position/metrics-routes_packed/accuracy = 0.019026693, targets_position/metrics-routes_packed/accuracy_per_sequence = 0.0, targets_position/metrics-routes_packed/accuracy_top5 = 0.041145835, targets_position/metrics-routes_packed/neg_log_perplexity = -12.372437, targets_segmentation/metrics-routes_packed/accuracy = 0.019026693, targets_segmentation/metrics-routes_packed/accuracy_per_sequence = 0.0, targets_segmentation/metrics-routes_packed/accuracy_top5 = 0.041145835, targets_segmentation/metrics-routes_packed/neg_log_perplexity = -12.372437 INFO:tensorflow:Saving 'checkpoint_path' summary for global step 1000: gs://besc-route-prediction/tensor2tensor/train_variableGHR/tpu/model_kinSparse_singleShard/model.ckpt-1000 INFO:tensorflow:evaluation_loop marked as finished INFO:tensorflow:loss = 4.7897573, step = 1000 (88.605 sec) INFO:tensorflow:global_step/sec: 1.1286 INFO:tensorflow:examples/sec: 9.02881 INFO:tensorflow:Enqueue next (100) batch(es) of data to infeed. INFO:tensorflow:Dequeue next (100) batch(es) of data from outfeed. INFO:tensorflow:Error recorded from training_loop: The TPU system has not been initialized. [[{{node TPUReplicate/_compile/_18113138105308393941/_194}} = TPUCompileNumDynamicShapes=0, Tguaranteed_constants=[], function=cluster_14266836224844131312[], metadata="\n\006\010...7\323\224M", num_computations=1, _device="/job:tpu_worker/replica:0/task:0/device:CPU:0"]] [[{{node training/transformer/body/decoder/layer_3/self_attention/layer_prepostprocess/layer_norm/layer_norm_scale/Adafactor/read/_67_G943}} = _Recv[client_terminated=false, recv_device="/job:tpu_worker/replica:0/task:0/device:TPU:0", send_device="/job:tpu_worker/replica:0/task:0/device:CPU:0", send_device_incarnation=-8638422706644811085, tensor_name="edge1039...r/read/_67", tensor_type=DT_FLOAT, _device="/job:tpu_worker/replica:0/task:0/device:TPU:0"]()]] INFO:tensorflow:Error recorded from infeed: Step was cancelled by an explicit call to Session::Close(). INFO:tensorflow:training_loop marked as finished WARNING:tensorflow:Reraising captured error Traceback (most recent call last): File "/usr/local/bin/t2t-trainer", line 33, in tf.app.run() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 125, in run _sys.exit(main(argv)) File "/usr/local/bin/t2t-trainer", line 28, in main t2t_trainer.main(argv) File "/usr/local/lib/python2.7/dist-packages/tensor2tensor/bin/t2t_trainer.py", line 379, in main execute_schedule(exp) File "/usr/local/lib/python2.7/dist-packages/tensor2tensor/bin/t2t_trainer.py", line 341, in execute_schedule getattr(exp, FLAGS.schedule)() File "/usr/local/lib/python2.7/dist-packages/tensor2tensor/utils/trainer_lib.py", line 391, in continuous_train_and_eval self._eval_spec) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/training.py", line 471, in train_and_evaluate return executor.run() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/training.py", line 610, in run return self.run_local() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/training.py", line 711, in run_local saving_listeners=saving_listeners) File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2400, in train rendezvous.raise_errors() File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/error_handling.py", line 128, in raise_errors six.reraise(typ, value, traceback) File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2394, in train saving_listeners=saving_listeners File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 356, in train loss = self._train_model(input_fn, hooks, saving_listeners) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 1181, in _train_model return self._train_model_default(input_fn, hooks, saving_listeners) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 1215, in _train_model_default saving_listeners) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 1409, in _train_with_estimatorspec , loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss]) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 671, in run run_metadata=run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1148, in run run_metadata=run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1239, in run raise six.reraise(original_exc_info) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1224, in run return self._sess.run(args, *kwargs) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1296, in run run_metadata=run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1076, in run return self._sess.run(args, **kwargs) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 887, in run run_metadata_ptr) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1110, in _run feed_dict_tensor, options, run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1286, in _do_run run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1308, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.FailedPreconditionError: The TPU system has not been initialized. [[{{node TPUReplicate/_compile/_18113138105308393941/_194}} = TPUCompileNumDynamicShapes=0, Tguaranteed_constants=[], function=cluster_14266836224844131312[], metadata="\n\006\010...7\323\224M", num_computations=1, _device="/job:tpu_worker/replica:0/task:0/device:CPU:0"]] [[{{node training/transformer/body/decoder/layer_3/self_attention/layer_prepostprocess/layer_norm/layer_norm_scale/Adafactor/read/_67_G943}} = _Recv[client_terminated=false, recv_device="/job:tpu_worker/replica:0/task:0/device:TPU:0", send_device="/job:tpu_worker/replica:0/task:0/device:CPU:0", send_device_incarnation=-8638422706644811085, tensor_name="edge1039...r/read/_67", tensor_type=DT_FLOAT, _device="/job:tpu_worker/replica:0/task:0/device:TPU:0"]()]]

afrozenator commented 5 years ago

We see this error internally as well -- best to workaround it for now as you've done, will try to get to the bottom of this.

On Mon, Nov 5, 2018 at 10:21 AM Michael Mezher notifications@github.com wrote:

TPUs running tensorflow 1.11.0 are failing to reinitialize after evaluation during "continuous_train_and_eval" schedule. Not sure if this is a bug in Tensor2Tensor or TF's scheduling for TPUs.

This problem does not exist with TPUs running tensorflow 1.9.0

Still able to train, but have to set schedule to "train" and avoid evaluation. Environment information

tensor2tensor==1.10.0 tensorboard==1.11.0 tensorflow==1.11.0 tensorflow-metadata==0.9.0 tensorflow-serving-api==1.11.0 tensorflow-transform==0.9.0

Trace: INFO:tensorflow:Done calling model_fn. INFO:tensorflow:Starting evaluation at 2018-11-05-18:14:34 INFO:tensorflow:TPU job name tpu_worker INFO:tensorflow:Graph was finalized. INFO:tensorflow:Restoring parameters from gs://besc-route-prediction/tensor2tensor/train_variableGHR/tpu/model_kinSparse_singleShard/model.ckpt-1000 INFO:tensorflow:Running local_init_op. INFO:tensorflow:Done running local_init_op. INFO:tensorflow:Init TPU system INFO:tensorflow:Starting infeed thread controller. INFO:tensorflow:Starting outfeed thread controller. INFO:tensorflow:Enqueue next (30) batch(es) of data to infeed. INFO:tensorflow:Dequeue next (30) batch(es) of data from outfeed. INFO:tensorflow:Evaluation [30/30] INFO:tensorflow:Stop infeed thread controller INFO:tensorflow:Shutting down InfeedController thread. INFO:tensorflow:InfeedController received shutdown signal, stopping. INFO:tensorflow:Infeed thread finished, shutting down. INFO:tensorflow:infeed marked as finished INFO:tensorflow:Stop output thread controller INFO:tensorflow:Shutting down OutfeedController thread. INFO:tensorflow:OutfeedController received shutdown signal, stopping. INFO:tensorflow:Outfeed thread finished, shutting down. INFO:tensorflow:outfeed marked as finished INFO:tensorflow:Shutdown TPU system. INFO:tensorflow:Finished evaluation at 2018-11-05-18:15:24 INFO:tensorflow:Saving dict for global step 1000: global_step = 1000, loss = 5.689253, targets/metrics-routes_packed/accuracy = 0.14732198, targets/metrics-routes_packed/accuracy_per_sequence = 0.0, targets/metrics-routes_packed/accuracy_top5 = 0.31858853, targets/metrics-routes_packed/neg_log_perplexity = -5.773135, targets_position/metrics-routes_packed/accuracy = 0.019026693, targets_position/metrics-routes_packed/accuracy_per_sequence = 0.0, targets_position/metrics-routes_packed/accuracy_top5 = 0.041145835, targets_position/metrics-routes_packed/neg_log_perplexity = -12.372437, targets_segmentation/metrics-routes_packed/accuracy = 0.019026693, targets_segmentation/metrics-routes_packed/accuracy_per_sequence = 0.0, targets_segmentation/metrics-routes_packed/accuracy_top5 = 0.041145835, targets_segmentation/metrics-routes_packed/neg_log_perplexity = -12.372437 INFO:tensorflow:Saving 'checkpoint_path' summary for global step 1000: gs://besc-route-prediction/tensor2tensor/train_variableGHR/tpu/model_kinSparse_singleShard/model.ckpt-1000 INFO:tensorflow:evaluation_loop marked as finished INFO:tensorflow:loss = 4.7897573, step = 1000 (88.605 sec) INFO:tensorflow:global_step/sec: 1.1286 INFO:tensorflow:examples/sec: 9.02881 INFO:tensorflow:Enqueue next (100) batch(es) of data to infeed. INFO:tensorflow:Dequeue next (100) batch(es) of data from outfeed. INFO:tensorflow:Error recorded from training_loop: The TPU system has not been initialized. [[{{node TPUReplicate/_compile/_18113138105308393941/_194}} = TPUCompileNumDynamicShapes=0, Tguaranteed_constants=[], function=cluster_14266836224844131312[], metadata="\n\006\010...7\323\224M", num_computations=1, _device="/job:tpu_worker/replica:0/task:0/device:CPU:0"]] [[{{node training/transformer/body/decoder/layer_3/self_attention/layer_prepostprocess/layer_norm/layer_norm_scale/Adafactor/read/_67_G943}} = _Recvclient_terminated=false, recv_device="/job:tpu_worker/replica:0/task:0/device:TPU:0", send_device="/job:tpu_worker/replica:0/task:0/device:CPU:0", send_device_incarnation=-8638422706644811085, tensor_name="edge1039...r/read/_67", tensor_type=DT_FLOAT, _device="/job:tpu_worker/replica:0/task:0/device:TPU:0"]] INFO:tensorflow:Error recorded from infeed: Step was cancelled by an explicit call to Session::Close(). INFO:tensorflow:training_loop marked as finished WARNING:tensorflow:Reraising captured error Traceback (most recent call last): File "/usr/local/bin/t2t-trainer", line 33, in tf.app.run() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 125, in run _sys.exit(main(argv)) File "/usr/local/bin/t2t-trainer", line 28, in main t2t_trainer.main(argv) File "/usr/local/lib/python2.7/dist-packages/tensor2tensor/bin/t2t_trainer.py", line 379, in main execute_schedule(exp) File "/usr/local/lib/python2.7/dist-packages/tensor2tensor/bin/t2t_trainer.py", line 341, in execute_schedule getattr(exp, FLAGS.schedule)() File "/usr/local/lib/python2.7/dist-packages/tensor2tensor/utils/trainer_lib.py", line 391, in continuous_train_and_eval self._eval_spec) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/training.py", line 471, in train_and_evaluate return executor.run() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/training.py", line 610, in run return self.run_local() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/training.py", line 711, in run_local saving_listeners=saving_listeners) File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2400, in train rendezvous.raise_errors() File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/error_handling.py", line 128, in raise_errors six.reraise(typ, value, traceback) File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2394, in train saving_listeners=saving_listeners File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 356, in train loss = self._train_model(input_fn, hooks, saving_listeners) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 1181, in _train_model return self._train_model_default(input_fn, hooks, saving_listeners) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 1215, in _train_model_default saving_listeners) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 1409, in _train_with_estimatorspec , loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss]) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 671, in run run_metadata=run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1148, in run run_metadata=run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1239, in run raise six.reraise(original_exc_info) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1224, in run return self._sess.run(args, *kwargs) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1296, in run run_metadata=run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1076, in run return self._sess.run(args, **kwargs) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 887, in run run_metadata_ptr) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1110, in _run feed_dict_tensor, options, run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1286, in _do_run run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1308, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.FailedPreconditionError: The TPU system has not been initialized. [[{{node TPUReplicate/_compile/_18113138105308393941/_194}} = TPUCompileNumDynamicShapes=0, Tguaranteed_constants=[], function=cluster_14266836224844131312[], metadata="\n\006\010...7\323\224M", num_computations=1, _device="/job:tpu_worker/replica:0/task:0/device:CPU:0"]] [[{{node training/transformer/body/decoder/layer_3/self_attention/layer_prepostprocess/layer_norm/layer_norm_scale/Adafactor/read/_67_G943}} = _Recvclient_terminated=false, recv_device="/job:tpu_worker/replica:0/task:0/device:TPU:0", send_device="/job:tpu_worker/replica:0/task:0/device:CPU:0", send_device_incarnation=-8638422706644811085, tensor_name="edge1039...r/read/_67", tensor_type=DT_FLOAT, _device="/job:tpu_worker/replica:0/task:0/device:TPU:0"]]

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/tensorflow/tensor2tensor/issues/1202, or mute the thread https://github.com/notifications/unsubscribe-auth/ABFCFviDqz0u7-pYZgEzhne2JdBia9wfks5usIG7gaJpZM4YO_q6 .

-- We are here on earth to do good for others. What the others are here for, I don't know.

Afroz Mohiuddin

BogdanDidenko commented 5 years ago

Same problem.

mikeymezher commented 5 years ago

This still seems to be a problem in T2T 1.12.0. Is this the case internally as well? @afrozenator

edumotya commented 5 years ago

The same here, it's able to train but it fails to reinitialize train after evaluation. Trace:

INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Init TPU system
INFO:tensorflow:Initialized TPU in 7 seconds
INFO:tensorflow:Starting infeed thread controller.
INFO:tensorflow:Initialized dataset iterators in 0 seconds
INFO:tensorflow:Starting outfeed thread controller.
INFO:tensorflow:Enqueue next (100) batch(es) of data to infeed.
INFO:tensorflow:Dequeue next (100) batch(es) of data from outfeed.
INFO:tensorflow:Evaluation [100/100]
INFO:tensorflow:Stop infeed thread controller
INFO:tensorflow:Shutting down InfeedController thread.
INFO:tensorflow:InfeedController received shutdown signal, stopping.
INFO:tensorflow:Infeed thread finished, shutting down.
INFO:tensorflow:infeed marked as finished
INFO:tensorflow:Stop output thread controller
INFO:tensorflow:Shutting down OutfeedController thread.
INFO:tensorflow:OutfeedController received shutdown signal, stopping.
INFO:tensorflow:Outfeed thread finished, shutting down.
INFO:tensorflow:outfeed marked as finished
INFO:tensorflow:Shutdown TPU system.
INFO:tensorflow:Finished evaluation at 2019-02-10-19:23:16
INFO:tensorflow:Saving dict for global step 100: global_step = 100, loss = 7.272978, mae = 2.8202152
INFO:tensorflow:Saving 'checkpoint_path' summary for global step 100: gs://xxxx/model.ckpt-100
INFO:tensorflow:evaluation_loop marked as finished
INFO:tensorflow:loss = 4.7624865, step = 100
INFO:tensorflow:Enqueue next (100) batch(es) of data to infeed.
INFO:tensorflow:Dequeue next (100) batch(es) of data from outfeed.
INFO:tensorflow:Error recorded from training_loop: Combined status information from 9 operations:

Status code: Failed precondition [9x]
  The TPU system has not been initialized.
     [[node TPUReplicateMetadata (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py:1171) ]] [1x]
  The TPU system has not been initialized.
     [[node TPUReplicateMetadata (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py:1171) ]]
     [[{{node cluster/_variable_copy/_980}}]] [3x]
  The TPU system has not been initialized.
     [[node TPUReplicateMetadata (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py:1171) ]]
     [[{{node cluster/_variable_copy/_980}}]]
     [[{{node cluster/_variable_copy/_980}}]] [5x]
(0 successful operations.)
dapatil211 commented 5 years ago

Has there been any progress on this issue? I think I might be encountering the same problem, except while using InMemoryEvaluatorHook: https://stackoverflow.com/questions/55071546/using-inmemoryevaluatorhook-with-tpu-throws-exception

cwbeitel commented 5 years ago

Same with tf-gpu 1.13.1, t2t@3b38635f12348036ea1e0166857f43a6b971ab07, tpu v3, same code trains on a gpu without error, while running schedule "train_and_evaluate".

emukans commented 4 years ago

Is there any progress with this issue?

jpark621 commented 4 years ago

Getting the similar errors with tpu 1.13, while trying to implement a CheckpointSaverHook doing the same thing.

wjxgeorge commented 4 years ago

I have the same problem.

At the end of evaluation session, the TPUInfeedOutfeedSessionHook calls a routine to shutdown TPU system. However, since the training session is not re-initialized (it will only be called once if you include evaluation in save_listener or hook), TPU system is not re-initialized.