Open KishoreP1 opened 8 months ago
Hi, currently the training scripts does not support resume training. As you can see from the code, the --checkpoint_dir
argument just specify the path to save model checkpoints, it will not looking for some existing checkpoint to continue training.
You should be able to adapt the code to add the logic to looking for latest model checkpoint if required, here's an example of manually loading checkpoint file in the eval_agent.py
module.
if FLAGS.load_checkpoint_file:
checkpoint.restore(FLAGS.load_checkpoint_file)
However, keep in mind, the code will only save model state, not optimizer or the agent internal states (number of updates etc.), and also need to correctly handle logging to tensorboard or the csv files.
When training is interrupted and later resumed, I expect the process to restart from the last saved checkpoint iteration. However, even when specifying the same
--checkpoint_dir
flag, the training process restarts from iteration 0, disregarding previously completed iterations.I tried:
--checkpoint_dir
.--checkpoint_dir
flag.I expected the training to resume from iteration 13, considering the last completed iteration was 12. However, the training restarts from iteration 1, ignoring the checkpoints saved in the specified directory.
Inside
run_learner
ofmain_loop.py
, the checkpointing and iteration logging logic seems correct. However, I cannot find where the code loads the checkpoint to resume training from the last saved state.