Open drzraf opened 6 years ago
Using train() and passing a restore_model_path (which is actually expecting a file) does not work (I apparently don't have yet a fully built model).
Could you give some more details as to how this is failing? It might help me figure out what needs to be done in your case.
So after looking at this myself for a while it appears really hard to solve this just using tensorflow checkpoint files, maybe using tf.keras.save
along with HDF format dump of all the weights might be easier for enabling a restore. This would require some work but might be substantially better for reproducible research reasons too because the model training as far as I know is not fully deterministic.
This is related to #117
For the second time, training got interrupted after several hours and just after completing epoch 31. (Here is the stack).
(Previous failure was because a couple of files from one of the
*_prefix.txt
files were missing)Anyway, what to do when this happens? Here is exp/
train()
and passing arestore_model_path
(which is actually expecting a file) does not work (I apparently don't have yet a fully built model).train()
toload_metagraph('exp/0/model/model_best.ckpt")
and thensaver.restore(sess, tf.train.latest_checkpoint("exp/0/model"))
does not work either (restart from epoch0)Even if I've a lot of restore/checkpoint files in
exp/
, and after a deep look at tf documentation of Saver I still can't find the way to actually restore that interrupted training.Hints/doc welcomed.