Open hvgazula opened 6 months ago
In a nutshell, resumption from an existing checkpoint using API tools is still not working/clean. Works just fine with the tf inbuiltBackupAndRestore
callback.
appending /
to checkpoint_filepath
resolved this. see https://github.com/neuronets/nobrainer_training_scripts/commit/d5d1de07f6b8fde0d6471326f50c9bae6289aad1 🤦♂️
the getctime
function only works if the checkpoint filepath has epoch
in it's name..
For example: if checkpoint_filepath = f"output/{output_dirname}/nobrainer_ckpts/" + "{epoch:02d}"
then the output (in addition to other folders) will look as follows:
Explanation of the folders:
Summary:
epoch
and doing away with 3 will enable loading from checkpoints cleanly. Else, we may want to write improved logic for load
when no folders are created for each epoch.In hindsight, we should include 'BackupAndRestore' in addition to ModelCheckPoint
, because the latter only saves a checkpoint at the end of each epoch. This will not be enough if the model passes through the entire data and fails just before writing whereas BackupAndRestore
has a save_freq
argument that can be taken advantage of.
ouch https://github.com/keras-team/tf-keras/issues/430. Looks like, we will have to stay put with ModelCheckPoint for now. 😞 This is because I intend to save the best model as well.
https://github.com/neuronets/nobrainer/blob/976691d685824fd4bba836498abea4184cffd798/nobrainer/processing/checkpoint.py#L57
What am I trying to do? Initialize from a previous checkpoint, to resume training over more epochs.
For example, the following snippet
should initialize from a checkpoint if the
checkpoint_filepath
exists. However, thegetctime
part conflicts with other folders created during training (could be predictions or other folders).Solution: