Error on resuming from checkpoint

Gaelium commented 2 years ago

I'm having a similar problem as https://github.com/omertov/encoder4editing/issues/22#issue-869018928. The issue is closed, but no solution has been posted.

Traceback (most recent call last): File "scripts/train.py", line 88, in <module> main() File "scripts/train.py", line 28, in main coach = Coach(opts, previous_train_ckpt) File "./training/coach.py", line 87, in __init__ self.load_from_train_checkpoint(prev_train_checkpoint) File "./training/coach.py", line 93, in load_from_train_checkpoint self.best_val_loss = ckpt['best_val_loss'] KeyError: 'best_val_loss'

Can anyone let me know if there is a way to resolve this issue? Thanks!

omertov commented 2 years ago

Hi @Gaelium! There are two types of resuming training from checkpoint:

If you wish to initialize the encoder weights from a previous checkpoint (such as the official checkpoints), you can do so by specifing the --checkpoint_path flag and point it to the pretrained checkpoint.
In order to resume training from a specific training step state, the optimizer, global step, discriminators, and best loss value should be kept as part of the checkpoint, which is done by providing the --save_training_data flag. By default, this behaviour is disabled due to the large size of each resulting checkpoint. However, in case you do run a new training session with the --save_training_data flag, you can continue from a saved checkpoint using the --resume_training_from_ckpt flag.

Hope it helps, Best, Omer

Gaelium commented 2 years ago

Thank you! Adding the --save_training_data flag let me resume from the checkpoint.

omertov / encoder4editing

Error on resuming from checkpoint #72