Resuming training from tf checkpoints

kamalkraj / ALBERT-TF2.0

ALBERT model Pretraining and Fine Tuning using TF2.0

Apache License 2.0

199 stars 45 forks source link

Resuming training from tf checkpoints #15

Closed steindor closed 4 years ago

steindor commented 4 years ago

After training a model for some epochs, how can I restore it and continue training from the checkpoints outputted as they are not in the hdf5 format?

kamalkraj commented 4 years ago

@steindor if you re-run the script again with more epochs. It will automatically restore the last weights. Use Custom training loop always.

steindor commented 4 years ago

Yes I noticed that was possible. I presume that is only possible when the stored weights are in memory?

I'm wondering if its possible with a new session, that is training for an arbitrary number of epochs, restarting the session and load the weights from the saved checkpoints file?

E.g. if the script crashes while pretraining from scratch so one doesn't have to start from the beginning?

kamalkraj commented 4 years ago

@steindor I hope this is what you're suggesting: Save the weights after every fixed no.of steps other than saving at the end of each epoch. So that if training script crashes, it can continue from the latest checkpoint other than the last epoch checkpoint

steindor commented 4 years ago

Yes, I guess that would solve it. Would be great to be able to use the checkpoints though since they are already generated. Thanks!