what to do with interrupted training?

drzraf commented 6 years ago

For the second time, training got interrupted after several hours and just after completing epoch 31. (Here is the stack).

2018-07-22 21:03:20.250391: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1046] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3253 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1050, pci bus id: 0000:01:00.0, compute capability: 6.1)
2018-07-22 21:03:33.019454: W tensorflow/core/common_runtime/bfc_allocator.cc:275] Allocator (GPU_0_bfc) ran out of memory trying to allocate 637.77MiB.  Current allocation summary follows.
[...]
2018-07-22 21:03:33.041741: W tensorflow/core/common_runtime/bfc_allocator.cc:279] 
[...]
2018-07-22 21:03:33.041534: I tensorflow/core/common_runtime/bfc_allocator.cc:671]      Summary of in-use Chunks by size: 
2018-07-22 21:03:33.041549: I tensorflow/core/common_runtime/bfc_allocator.cc:674] 55 Chunks of size 256 totalling 13.8KiB
2018-07-22 21:03:33.041557: I tensorflow/core/common_runtime/bfc_allocator.cc:674] 84 Chunks of size 1024 totalling 84.0KiB
2018-07-22 21:03:33.041565: I tensorflow/core/common_runtime/bfc_allocator.cc:674] 2 Chunks of size 1280 totalling 2.5KiB
2018-07-22 21:03:33.041572: I tensorflow/core/common_runtime/bfc_allocator.cc:674] 1 Chunks of size 1792 totalling 1.8KiB
2018-07-22 21:03:33.041580: I tensorflow/core/common_runtime/bfc_allocator.cc:674] 26 Chunks of size 4096 totalling 104.0KiB
2018-07-22 21:03:33.041588: I tensorflow/core/common_runtime/bfc_allocator.cc:674] 1 Chunks of size 10752 totalling 10.5KiB
2018-07-22 21:03:33.041595: I tensorflow/core/common_runtime/bfc_allocator.cc:674] 6 Chunks of size 94208 totalling 552.0KiB
2018-07-22 21:03:33.041603: I tensorflow/core/common_runtime/bfc_allocator.cc:674] 1 Chunks of size 174080 totalling 170.0KiB
2018-07-22 21:03:33.041611: I tensorflow/core/common_runtime/bfc_allocator.cc:674] 12 Chunks of size 1492224 totalling 17.08MiB
2018-07-22 21:03:33.041619: I tensorflow/core/common_runtime/bfc_allocator.cc:674] 1 Chunks of size 1792512 totalling 1.71MiB
2018-07-22 21:03:33.041626: I tensorflow/core/common_runtime/bfc_allocator.cc:674] 1 Chunks of size 2097152 totalling 2.00MiB
2018-07-22 21:03:33.041634: I tensorflow/core/common_runtime/bfc_allocator.cc:674] 1 Chunks of size 2675200 totalling 2.55MiB
2018-07-22 21:03:33.041641: I tensorflow/core/common_runtime/bfc_allocator.cc:674] 9 Chunks of size 3000064 totalling 25.75MiB
2018-07-22 21:03:33.041649: I tensorflow/core/common_runtime/bfc_allocator.cc:674] 1 Chunks of size 4194304 totalling 4.00MiB
2018-07-22 21:03:33.041656: I tensorflow/core/common_runtime/bfc_allocator.cc:674] 1 Chunks of size 4492288 totalling 4.28MiB
2018-07-22 21:03:33.041663: I tensorflow/core/common_runtime/bfc_allocator.cc:674] 1 Chunks of size 5139712 totalling 4.90MiB
2018-07-22 21:03:33.041671: I tensorflow/core/common_runtime/bfc_allocator.cc:674] 1 Chunks of size 5388544 totalling 5.14MiB
2018-07-22 21:03:33.041678: I tensorflow/core/common_runtime/bfc_allocator.cc:674] 1 Chunks of size 5984512 totalling 5.71MiB
2018-07-22 21:03:33.041685: I tensorflow/core/common_runtime/bfc_allocator.cc:674] 2 Chunks of size 668750080 totalling 1.25GiB
2018-07-22 21:03:33.041692: I tensorflow/core/common_runtime/bfc_allocator.cc:678] Sum Total of in-use chunks: 1.32GiB
2018-07-22 21:03:33.041703: I tensorflow/core/common_runtime/bfc_allocator.cc:680] Stats: 
Limit:                  3411738624
InUse:                  1415132672
MaxInUse:               1509176832
NumAllocs:               460689205
MaxAllocSize:            668750080
********************_________________********************____________________*__________________*_**
2018-07-22 21:03:33.042256: W tensorflow/core/framework/op_kernel.cc:1290] CtxFailure at reverse_sequence_op.cc:135: Resource exhausted: OOM when allocating tensor with shape[2675,125,500] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
Unhandled exception
Traceback (most recent call last):
  File "/home/user/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1322, in _do_call
    return fn(*args)
  File "/home/user/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1307, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/home/user/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1409, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[2675,125,500] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
     [[Node: layer_1/bidirectional_rnn/bw/ReverseSequence = ReverseSequence[T=DT_FLOAT, Tlen=DT_INT32, batch_dim=0, seq_dim=1, _device="/job:localhost/replica:0/task:0/device:GPU:0"](layer_0/concat, _arg_batch_x_lens_0_4/_141)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

     [[Node: logits/_217 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_697_logits", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "./train.py", line 14, in <module>
    experiment.train_ready(corp)
  File "/home/user/.local/lib/python3.6/site-packages/persephone/experiment.py", line 101, in train_ready
    model.train(min_epochs=20, early_stopping_steps=3)
  File "/home/user/.local/lib/python3.6/site-packages/persephone/model.py", line 384, in train
    self.eval(restore_model_path=self.saved_model_path)
  File "/home/user/.local/lib/python3.6/site-packages/persephone/model.py", line 182, in eval
    feed_dict=feed_dict)
  File "/home/user/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 900, in run
    run_metadata_ptr)
  File "/home/user/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1135, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/user/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1316, in _do_run
    run_metadata)
  File "/home/user/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1335, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[2675,125,500] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc

Caused by op 'layer_1/bidirectional_rnn/bw/ReverseSequence', defined at:
  File "./train.py", line 14, in <module>
    experiment.train_ready(corp)
  File "/home/user/.local/lib/python3.6/site-packages/persephone/experiment.py", line 100, in train_ready
    model = get_simple_model(exp_dir, corpus)
  File "/home/user/.local/lib/python3.6/site-packages/persephone/experiment.py", line 91, in get_simple_model
    decoding_merge_repeated=True)
  File "/home/user/.local/lib/python3.6/site-packages/persephone/rnn_ctc.py", line 66, in __init__
    time_major=False)
  File "/home/user/.local/lib/python3.6/site-packages/tensorflow/python/ops/rnn.py", line 424, in bidirectional_dynamic_rnn
    seq_dim=time_dim, batch_dim=batch_dim)
  File "/home/user/.local/lib/python3.6/site-packages/tensorflow/python/ops/rnn.py", line 417, in _reverse
    seq_dim=seq_dim, batch_dim=batch_dim)
  File "/home/user/.local/lib/python3.6/site-packages/tensorflow/python/ops/array_ops.py", line 2638, in reverse_sequence
    name=name)
  File "/home/user/.local/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 6531, in reverse_sequence
    seq_dim=seq_dim, batch_dim=batch_dim, name=name)
  File "/home/user/.local/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/home/user/.local/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3417, in create_op
    op_def=op_def)
  File "/home/user/.local/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1743, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

(Previous failure was because a couple of files from one of the *_prefix.txt files were missing)

Anyway, what to do when this happens? Here is exp/

$ tree exp/
exp/
└── 0
    ├── best_scores.txt
    ├── decoded
    │   ├── best_hyps
    │   ├── epoch0_hyps
    │   ├── epoch1_hyps
    │   ├── ...
    │   ├── epoch31_hyps
    │   └── refs
    ├── model
    │   ├── checkpoint
    │   ├── model_best.ckpt.data-00000-of-00001
    │   ├── model_best.ckpt.index
    │   └── model_best.ckpt.meta
    ├── model_description.txt
    ├── train_description.txt
    ├── train_log.txt
    └── version.txt

Using train() and passing a restore_model_path (which is actually expecting a file) does not work (I apparently don't have yet a fully built model).
Tweaking train() to load_metagraph('exp/0/model/model_best.ckpt") and then saver.restore(sess, tf.train.latest_checkpoint("exp/0/model")) does not work either (restart from epoch0)

Even if I've a lot of restore/checkpoint files in exp/, and after a deep look at tf documentation of Saver I still can't find the way to actually restore that interrupted training.

Hints/doc welcomed.

shuttle1987 commented 6 years ago

Using train() and passing a restore_model_path (which is actually expecting a file) does not work (I apparently don't have yet a fully built model).

Could you give some more details as to how this is failing? It might help me figure out what needs to be done in your case.

shuttle1987 commented 6 years ago

So after looking at this myself for a while it appears really hard to solve this just using tensorflow checkpoint files, maybe using tf.keras.save along with HDF format dump of all the weights might be easier for enabling a restore. This would require some work but might be substantially better for reproducible research reasons too because the model training as far as I know is not fully deterministic.

shuttle1987 commented 6 years ago

This is related to #117

persephone-tools / persephone

what to do with interrupted training? #180