TPU training problem - Githubissues

Hi all, thanks for the cool contribution.

I am training ALBERT base on a 115GB dataset (split in 4124 shards, 1.6TB in tfrecord files) on a v3-8 cloud TPU.

I'm using the following parameters: --num_train_steps 2000000 --num_warmup_steps 12500 --save_checkpoints_steps 50000 --keep_checkpoint_max 1000 --iterations_per_loop 4000 --learning_rate 0.00022 --max_seq_length 512 --max_predictions_per_seq 77

Problem 1:

When launching training, I get the following warning repeatedly:

INFO:tensorflow:Init TPU system I0325 07:44:16.229176 140644474742528 tpu_estimator.py:567] Init TPU system INFO:tensorflow:Initialized TPU in 1 seconds I0325 07:44:17.937542 140644474742528 tpu_estimator.py:576] Initialized TPU in 1 seconds INFO:tensorflow:Starting infeed thread controller. I0325 07:44:17.938689 140643652859648 tpu_estimator.py:521] Starting infeed thread controller. INFO:tensorflow:Starting outfeed thread controller. I0325 07:44:17.939245 140643618825984 tpu_estimator.py:540] Starting outfeed thread controller. I0325 07:44:17.984356 140643334616832 transport.py:157] Attempting refresh to obtain initial access_token WARNING:tensorflow:TPUPollingThread found TPU b'vm-tpu-1' in state READY, and health HEALTHY. W0325 07:44:18.045144 140643334616832 preempted_hook.py:91] TPUPollingThread found TPU b'vm-tpu-1' in state READY, and health HEALTHY. INFO:tensorflow:Enqueue next (4000) batch(es) of data to infeed. I0325 07:44:18.305288 140644474742528 tpu_estimator.py:600] Enqueue next (4000) batch(es) of data to infeed. INFO:tensorflow:Dequeue next (4000) batch(es) of data from outfeed. I0325 07:44:18.305603 140644474742528 tpu_estimator.py:604] Dequeue next (4000) batch(es) of data from outfeed. I0325 07:44:48.117219 140643334616832 transport.py:157] Attempting refresh to obtain initial access_token WARNING:tensorflow:TPUPollingThread found TPU b'vm-tpu-1' in state READY, and health HEALTHY. W0325 07:44:48.186919 140643334616832 preempted_hook.py:91] TPUPollingThread found TPU b'vm-tpu-1' in state READY, and health HEALTHY. INFO:tensorflow:Outfeed finished for iteration (0, 0) I0325 07:45:02.538997 140643618825984 tpu_estimator.py:279] Outfeed finished for iteration (0, 0) I0325 07:45:18.255510 140643334616832 transport.py:157] Attempting refresh to obtain initial access_token WARNING:tensorflow:TPUPollingThread found TPU b'vm-tpu-1' in state READY, and health HEALTHY. W0325 07:45:18.321858 140643334616832 preempted_hook.py:91] TPUPollingThread found TPU b'vm-tpu-1' in state READY, and health HEALTHY. I0325 07:45:48.392218 140643334616832 transport.py:157] Attempting refresh to obtain initial access_token WARNING:tensorflow:TPUPollingThread found TPU b'vm-tpu-1' in state READY, and health HEALTHY. W0325 07:45:48.446225 140643334616832 preempted_hook.py:91] TPUPollingThread found TPU b'vm-tpu-1' in state READY, and health HEALTHY. INFO:tensorflow:Outfeed finished for iteration (0, 84) I0325 07:46:02.626998 140643618825984 tpu_estimator.py:279] Outfeed finished for iteration (0, 84) I0325 07:46:18.513511 140643334616832 transport.py:157] Attempting refresh to obtain initial access_token

Problem 2

After a while, the process crashes:

WARNING:tensorflow:TPUPollingThread found TPU b'vm-tpu-1' in state READY, and health HEALTHY. W0325 07:58:51.939475 140643334616832 preempted_hook.py:91] TPUPollingThread found TPU b'vm-tpu-1' in state READY, and health HEALTHY. ERROR:tensorflow:Error recorded from infeed: Step was cancelled by an explicit call to Session::Close(). E0325 07:59:11.895795 140643652859648 error_handling.py:75] Error recorded from infeed: Step was cancelled by an explicit call to Session::Close(). ERROR:tensorflow:Error recorded from outfeed: Step was cancelled by an explicit call to Session::Close(). E0325 07:59:11.896760 140643618825984 error_handling.py:75] Error recorded from outfeed: Step was cancelled by an explicit call to Session::Close(). ERROR:tensorflow:Error recorded from training_loop: From /job:worker/replica:0/task:0: 8 root error(s) found. (0) Cancelled: Node was closed (1) Cancelled: Node was closed (2) Cancelled: Node was closed (3) Cancelled: Node was closed (4) Cancelled: Node was closed (5) Cancelled: Node was closed (6) Cancelled: Node was closed (7) Cancelled: Node was closed 1 successful operations. 0 derived errors ignored. E0325 07:59:11.897448 140644474742528 error_handling.py:75] Error recorded from training_loop: From /job:worker/replica:0/task:0: 8 root error(s) found. (0) Cancelled: Node was closed (1) Cancelled: Node was closed (2) Cancelled: Node was closed (3) Cancelled: Node was closed (4) Cancelled: Node was closed (5) Cancelled: Node was closed (6) Cancelled: Node was closed (7) Cancelled: Node was closed 1 successful operations. 0 derived errors ignored. INFO:tensorflow:training_loop marked as finished I0325 07:59:11.899694 140644474742528 error_handling.py:101] training_loop marked as finished WARNING:tensorflow:Reraising captured error W0325 07:59:11.900030 140644474742528 error_handling.py:139] Reraising captured error Traceback (most recent call last): File "/home/manuto/albert_env/lib/python3.5/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call return fn(*args) File "/home/manuto/albert_env/lib/python3.5/site-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn target_list, run_metadata) File "/home/manuto/albert_env/lib/python3.5/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.CancelledError: Step was cancelled by an explicit call to Session::Close().

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "ALBERT/run_pretraining.py", line 577, in tf.app.run() File "/home/manuto/albert_env/lib/python3.5/site-packages/tensorflow_core/python/platform/app.py", line 40, in run _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef) File "/home/manuto/albert_env/lib/python3.5/site-packages/absl/app.py", line 299, in run _run_main(main, args) File "/home/manuto/albert_env/lib/python3.5/site-packages/absl/app.py", line 250, in _run_main sys.exit(main(argv)) File "ALBERT/run_pretraining.py", line 534, in main estimator.train(input_fn=train_input_fn, max_steps=FLAGS.num_train_steps) File "/home/manuto/albert_env/lib/python3.5/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3035, in train rendezvous.raise_errors() File "/home/manuto/albert_env/lib/python3.5/site-packages/tensorflow_estimator/python/estimator/tpu/error_handling.py", line 140, in raise_errors six.reraise(typ, value, traceback) File "/home/manuto/albert_env/lib/python3.5/site-packages/six.py", line 703, in reraise raise value File "/home/manuto/albert_env/lib/python3.5/site-packages/tensorflow_estimator/python/estimator/tpu/error_handling.py", line 109, in catch_errors yield File "/home/manuto/albert_env/lib/python3.5/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 536, in _run_infeed session.run(self._enqueue_ops) File "/home/manuto/albert_env/lib/python3.5/site-packages/tensorflow_core/python/client/session.py", line 956, in run run_metadata_ptr) File "/home/manuto/albert_env/lib/python3.5/site-packages/tensorflow_core/python/client/session.py", line 1180, in _run feed_dict_tensor, options, run_metadata) File "/home/manuto/albert_env/lib/python3.5/site-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run run_metadata) File "/home/manuto/albert_env/lib/python3.5/site-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.CancelledError: Step was cancelled by an explicit call to Session::Close().

Regarding Problem 1, I found this issue presenting the same problem. I deleted the TPU instances and recreated them but still got the same problem though this seems to have solved the problem for other users.

Anybody faced the same issue and managed to solve it? Thanks a lot in advance for the help :)

google-research / albert

TPU training problem #188

Problem 1:

Problem 2