Open nicolasnn opened 2 years ago
@nicolasnn - Thanks for reporting this issue with detailed examples.
I am working on an update to BackupAndRestore
callback to address this specific scenario. I will submit a PR in the next few weeks. @rchao - please assign this issue to me.
Thanks Ramesh!
I have run into the same issue. @sampathweb are you still planning to fix this? @rchao
Hello, just curious if this issue has been fixed or if it is still in the works?
System information.
Describe the problem. I am using ModelCheckpoint callback to save the best model, combined with BackupAndRestore callback to handle interruptions. The problem lies when running again a training script after an interruption. The model restored by BackupAndRetore doesn't have the previous value of losses and metrics. Thus, ModelCheckpoint saves the model on the 1st epoch of this new run, whatever the value of loss, it even overwrites the "best" model with a not-as-good model.
Describe the current behavior.
Describe the expected behavior.
Standalone code to reproduce the issue.
First training
The output is:
2nd training
The output is:
The problem lies at
val_loss improved from inf to 2.09097
, the model restored by BackupAndRetore doesn't restore the previous value of val_loss. The model is initialized with aninf
value, thus ModelCheckpoint doesn't fulfill what it is supposed to do and it even overwrites the "best" model with a not-as-good model.