I'm having problems resuming training from a checkpoint on Google Colab. The only way I've found is to delete all checkpoints and start again, which of course isn't a good idea after training for hours.
I'm using TF and t2t version 1.14.0 and Ubuntu 18.04 on normal runtime because of a problem using Colab's GPU I've reported here.
WARNING: Logging before flag parsing goes to stderr.
W0828 15:46:33.243587 139684777486208 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/tensor2tensor/utils/expert_utils.py:68: The name tf.variable_scope is deprecated. Please use tf.compat.v1.variable_scope instead.
W0828 15:46:34.237717 139684777486208 lazy_loader.py:50]
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
* https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
* https://github.com/tensorflow/addons
* https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.
W0828 15:46:36.218165 139684777486208 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/tensor2tensor/utils/adafactor.py:27: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.
W0828 15:46:36.218689 139684777486208 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/tensor2tensor/utils/multistep_optimizer.py:32: The name tf.train.AdamOptimizer is deprecated. Please use tf.compat.v1.train.AdamOptimizer instead.
W0828 15:46:36.231890 139684777486208 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/mesh_tensorflow/ops.py:4237: The name tf.train.CheckpointSaverListener is deprecated. Please use tf.estimator.CheckpointSaverListener instead.
W0828 15:46:36.232139 139684777486208 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/mesh_tensorflow/ops.py:4260: The name tf.train.SessionRunHook is deprecated. Please use tf.estimator.SessionRunHook instead.
W0828 15:46:36.251127 139684777486208 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/tensor2tensor/models/research/neural_stack.py:38: The name tf.nn.rnn_cell.RNNCell is deprecated. Please use tf.compat.v1.nn.rnn_cell.RNNCell instead.
W0828 15:46:36.288087 139684777486208 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/tensor2tensor/rl/gym_utils.py:235: The name tf.logging.info is deprecated. Please use tf.compat.v1.logging.info instead.
W0828 15:46:36.311170 139684777486208 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/tensor2tensor/utils/trainer_lib.py:111: The name tf.OptimizerOptions is deprecated. Please use tf.compat.v1.OptimizerOptions instead.
W0828 15:46:36.326797 139684777486208 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/tensorflow_gan/python/contrib_utils.py:305: The name tf.estimator.tpu.TPUEstimator is deprecated. Please use tf.compat.v1.estimator.tpu.TPUEstimator instead.
W0828 15:46:36.327040 139684777486208 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/tensorflow_gan/python/contrib_utils.py:310: The name tf.estimator.tpu.TPUEstimatorSpec is deprecated. Please use tf.compat.v1.estimator.tpu.TPUEstimatorSpec instead.
W0828 15:46:37.165019 139684777486208 deprecation_wrapper.py:119] From /usr/local/bin/t2t-trainer:32: The name tf.logging.set_verbosity is deprecated. Please use tf.compat.v1.logging.set_verbosity instead.
W0828 15:46:37.165243 139684777486208 deprecation_wrapper.py:119] From /usr/local/bin/t2t-trainer:32: The name tf.logging.INFO is deprecated. Please use tf.compat.v1.logging.INFO instead.
W0828 15:46:37.165358 139684777486208 deprecation_wrapper.py:119] From /usr/local/bin/t2t-trainer:33: The name tf.app.run is deprecated. Please use tf.compat.v1.app.run instead.
W0828 15:46:37.166135 139684777486208 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/tensor2tensor/utils/hparams_lib.py:49: The name tf.gfile.Exists is deprecated. Please use tf.io.gfile.exists instead.
I0828 15:46:37.167073 139684777486208 hparams_lib.py:64] Loading hparams from existing json /content/gdrive/My Drive/TCC/T2T LibriSpeech/output/hparams.json
W0828 15:46:37.167232 139684777486208 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/tensor2tensor/utils/hparams_lib.py:65: The name tf.gfile.Open is deprecated. Please use tf.io.gfile.GFile instead.
W0828 15:46:37.169995 139684777486208 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/tensor2tensor/utils/trainer_lib.py:839: The name tf.set_random_seed is deprecated. Please use tf.compat.v1.set_random_seed instead.
W0828 15:46:37.170993 139684777486208 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/tensor2tensor/utils/trainer_lib.py:123: The name tf.GraphOptions is deprecated. Please use tf.compat.v1.GraphOptions instead.
W0828 15:46:37.171175 139684777486208 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/tensor2tensor/utils/trainer_lib.py:129: The name tf.GPUOptions is deprecated. Please use tf.compat.v1.GPUOptions instead.
W0828 15:46:37.171345 139684777486208 deprecation.py:323] From /usr/local/lib/python3.6/dist-packages/tensor2tensor/utils/trainer_lib.py:242: RunConfig.__init__ (from tensorflow.contrib.learn.python.learn.estimators.run_config) is deprecated and will be removed in a future version.
Instructions for updating:
When switching to tf.estimator.Estimator, use tf.estimator.RunConfig instead.
I0828 15:46:37.171534 139684777486208 trainer_lib.py:265] Configuring DataParallelism to replicate the model.
I0828 15:46:37.171617 139684777486208 devices.py:76] schedule=continuous_train_and_eval
I0828 15:46:37.171699 139684777486208 devices.py:77] worker_gpu=1
I0828 15:46:37.171761 139684777486208 devices.py:78] sync=False
W0828 15:46:37.171855 139684777486208 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/tensor2tensor/utils/devices.py:139: The name tf.logging.warn is deprecated. Please use tf.compat.v1.logging.warn instead.
W0828 15:46:37.171929 139684777486208 devices.py:141] Schedule=continuous_train_and_eval. Assuming that training is running on a single machine.
I0828 15:46:37.172624 139684777486208 devices.py:170] datashard_devices: ['gpu:0']
I0828 15:46:37.172721 139684777486208 devices.py:171] caching_devices: None
I0828 15:46:37.173149 139684777486208 devices.py:172] ps_devices: ['gpu:0']
I0828 15:46:37.173902 139684777486208 estimator.py:209] Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f0aa908abe0>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': None, '_log_step_count_steps': 100, '_protocol': None, '_session_config': gpu_options {
per_process_gpu_memory_fraction: 0.95
}
allow_soft_placement: true
graph_options {
optimizer_options {
global_jit_level: OFF
}
}
isolate_session_state: true
, '_save_checkpoints_steps': 1000, '_keep_checkpoint_max': 20, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/content/gdrive/My Drive/TCC/T2T LibriSpeech/output/', 'use_tpu': False, 't2t_device_info': {'num_async_replicas': 1}, 'data_parallelism': <tensor2tensor.utils.expert_utils.Parallelism object at 0x7f0aa908af28>}
W0828 15:46:37.174193 139684777486208 model_fn.py:630] Estimator's model_fn (<function T2TModel.make_estimator_model_fn.<locals>.wrapping_model_fn at 0x7f0aa9087ae8>) includes params argument, but params are not passed to Estimator.
W0828 15:46:37.174434 139684777486208 trainer_lib.py:783] ValidationMonitor only works with --schedule=train_and_evaluate
I0828 15:46:37.185815 139684777486208 estimator_training.py:186] Not using Distribute Coordinator.
I0828 15:46:37.186260 139684777486208 training.py:612] Running training and evaluation locally (non-distributed).
I0828 15:46:37.186565 139684777486208 training.py:700] Start train and evaluate loop. The evaluate will happen after every checkpoint. Checkpoint frequency is determined based on RunConfig arguments: save_checkpoints_steps 1000 or save_checkpoints_secs None.
E0828 15:46:37.192399 139684777486208 checkpoint_management.py:348] Couldn't match files for checkpoint /content/gdrive/My Drive/TCC/T2T LibriSpeech/output/model.ckpt-13000
W0828 15:46:37.197599 139684777486208 deprecation.py:323] From /usr/local/lib/python3.6/dist-packages/tensorflow/python/training/training_util.py:236: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.
I0828 15:46:37.208258 139684777486208 problem.py:644] Reading data files from /content/gdrive/My Drive/TCC/T2T LibriSpeech/data/librispeech_clean_small-train*
I0828 15:46:37.229276 139684777486208 problem.py:670] partition: 0 num_data_files: 100
W0828 15:46:37.232276 139684777486208 deprecation.py:323] From /usr/local/lib/python3.6/dist-packages/tensor2tensor/data_generators/problem.py:680: parallel_interleave (from tensorflow.python.data.experimental.ops.interleave_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.interleave(map_func, cycle_length, block_length, num_parallel_calls=tf.data.experimental.AUTOTUNE)` instead. If sloppy execution is desired, use `tf.data.Options.experimental_determinstic`.
W0828 15:46:37.275019 139684777486208 deprecation.py:323] From /usr/local/lib/python3.6/dist-packages/tensor2tensor/layers/common_audio.py:92: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.cast` instead.
W0828 15:46:37.562360 139684777486208 deprecation.py:323] From /usr/local/lib/python3.6/dist-packages/tensor2tensor/layers/common_audio.py:115: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.cast` instead.
W0828 15:46:37.750620 139684777486208 deprecation.py:323] From /usr/local/lib/python3.6/dist-packages/tensor2tensor/utils/data_reader.py:275: tf_record_iterator (from tensorflow.python.lib.io.tf_record) is deprecated and will be removed in a future version.
Instructions for updating:
Use eager execution and:
`tf.data.TFRecordDataset(path)`
W0828 15:46:38.267626 139684777486208 deprecation.py:323] From /usr/local/lib/python3.6/dist-packages/tensor2tensor/utils/data_reader.py:395: DatasetV1.output_shapes (from tensorflow.python.data.ops.dataset_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.compat.v1.data.get_output_shapes(dataset)`.
W0828 15:46:38.267972 139684777486208 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/tensor2tensor/utils/data_reader.py:398: The name tf.logging.warning is deprecated. Please use tf.compat.v1.logging.warning instead.
W0828 15:46:38.268058 139684777486208 data_reader.py:399] Shapes are not fully defined. Assuming batch_size means tokens.
W0828 15:46:38.323740 139684777486208 deprecation.py:323] From /usr/local/lib/python3.6/dist-packages/tensorflow/python/data/experimental/ops/grouping.py:193: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
W0828 15:46:38.372743 139684777486208 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/tensor2tensor/utils/data_reader.py:231: The name tf.summary.scalar is deprecated. Please use tf.compat.v1.summary.scalar instead.
I0828 15:46:38.437698 139684777486208 estimator.py:1145] Calling model_fn.
I0828 15:46:38.450161 139684777486208 t2t_model.py:2248] Setting T2TModel mode to 'train'
W0828 15:46:38.529374 139684777486208 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/tensor2tensor/utils/t2t_model.py:244: The name tf.summary.text is deprecated. Please use tf.compat.v1.summary.text instead.
I0828 15:46:39.269068 139684777486208 api.py:255] Using variable initializer: uniform_unit_scaling
I0828 15:46:39.718456 139684777486208 t2t_model.py:2248] Transforming feature 'inputs' with speech_recognition_modality.bottom
W0828 15:46:39.720613 139684777486208 deprecation.py:323] From /usr/local/lib/python3.6/dist-packages/tensor2tensor/layers/modalities.py:439: conv2d (from tensorflow.python.layers.convolutional) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.keras.layers.Conv2D` instead.
I0828 15:46:40.186799 139684777486208 t2t_model.py:2248] Transforming feature 'targets' with symbol_modality_256_384.targets_bottom
I0828 15:46:40.323158 139684777486208 t2t_model.py:2248] Building model body
W0828 15:46:40.388057 139684777486208 deprecation.py:506] From /usr/local/lib/python3.6/dist-packages/tensor2tensor/models/transformer.py:96: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a future version.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
W0828 15:46:40.435504 139684777486208 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/tensor2tensor/layers/common_layers.py:3077: The name tf.layers.Dense is deprecated. Please use tf.compat.v1.layers.Dense instead.
W0828 15:46:40.844527 139684777486208 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/tensor2tensor/layers/common_attention.py:1249: The name tf.summary.image is deprecated. Please use tf.compat.v1.summary.image instead.
I0828 15:46:48.565067 139684777486208 t2t_model.py:2248] Transforming body output with symbol_modality_256_384.top
W0828 15:46:48.689695 139684777486208 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/tensor2tensor/utils/learning_rate.py:120: The name tf.train.get_or_create_global_step is deprecated. Please use tf.compat.v1.train.get_or_create_global_step instead.
I0828 15:46:48.691083 139684777486208 learning_rate.py:29] Base learning rate: 2.000000
I0828 15:46:48.704310 139684777486208 optimize.py:338] Trainable Variables Total size: 70343552
I0828 15:46:48.704722 139684777486208 optimize.py:338] Non-trainable variables Total size: 5
I0828 15:46:48.705073 139684777486208 optimize.py:193] Using optimizer adam
I0828 15:47:00.715373 139684777486208 estimator.py:1147] Done calling model_fn.
I0828 15:47:00.717198 139684777486208 basic_session_run_hooks.py:541] Create CheckpointSaverHook.
I0828 15:47:05.476253 139684777486208 monitored_session.py:240] Graph was finalized.
2019-08-28 15:47:05.480538: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2200000000 Hz
2019-08-28 15:47:05.480819: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x22a5640 executing computations on platform Host. Devices:
2019-08-28 15:47:05.480857: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): <undefined>, <undefined>
W0828 15:47:05.483572 139684777486208 deprecation.py:323] From /usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py:1276: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file APIs to check for files with this prefix.
Traceback (most recent call last):
File "/usr/local/bin/t2t-trainer", line 33, in <module>
tf.app.run()
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/platform/app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 300, in run
_run_main(main, args)
File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 251, in _run_main
sys.exit(main(argv))
File "/usr/local/bin/t2t-trainer", line 28, in main
t2t_trainer.main(argv)
File "/usr/local/lib/python3.6/dist-packages/tensor2tensor/bin/t2t_trainer.py", line 412, in main
execute_schedule(exp)
File "/usr/local/lib/python3.6/dist-packages/tensor2tensor/bin/t2t_trainer.py", line 367, in execute_schedule
getattr(exp, FLAGS.schedule)()
File "/usr/local/lib/python3.6/dist-packages/tensor2tensor/utils/trainer_lib.py", line 456, in continuous_train_and_eval
self._eval_spec)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/training.py", line 473, in train_and_evaluate
return executor.run()
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/training.py", line 613, in run
return self.run_local()
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/training.py", line 714, in run_local
saving_listeners=saving_listeners)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 367, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1158, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1192, in _train_model_default
saving_listeners)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1480, in _train_with_estimator_spec
log_step_count_steps=log_step_count_steps) as mon_sess:
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 584, in MonitoredTrainingSession
stop_grace_period_secs=stop_grace_period_secs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 1007, in __init__
stop_grace_period_secs=stop_grace_period_secs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 725, in __init__
self._sess = _RecoverableSession(self._coordinated_creator)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 1200, in __init__
_WrappedSession.__init__(self, self._create_session())
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 1205, in _create_session
return self._sess_creator.create_session()
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 871, in create_session
self.tf_sess = self._session_creator.create_session()
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 647, in create_session
init_fn=self._scaffold.init_fn)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/session_manager.py", line 290, in prepare_session
config=config)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/session_manager.py", line 220, in _restore_checkpoint
saver.restore(sess, ckpt.model_checkpoint_path)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py", line 1278, in restore
compat.as_text(save_path))
ValueError: The passed save_path is not a valid checkpoint: /content/gdrive/My Drive/TCC/T2T LibriSpeech/output/model.ckpt-13000
Hello.
I'm having problems resuming training from a checkpoint on Google Colab. The only way I've found is to delete all checkpoints and start again, which of course isn't a good idea after training for hours.
I'm using TF and t2t version 1.14.0 and Ubuntu 18.04 on normal runtime because of a problem using Colab's GPU I've reported here.
This is the code :
And this is the prompt output: