Open qAp opened 2 years ago
Normally, validation is carried out after each epoch of training, and normally, model checkpoints are made by monitoring some validation metric. This implies that validation is done once per epoch, and model checkpoints are made at most once per epoch.
In pytorch lightning, Trainer.val_check_interval
allows one to set how often validation is to be carried out. If it's set to 0.25, then validation is done after each quarter of an epoch of training, so validation is done a total of 4 times per epoch. This allows one to check how well the model is doing more frequently.
And, for the version of pytorch lightning currently in use, ModelCheckpoint.every_n_val_epochs
should monitor the specified validation metric and decide whether to make a checkpoint or not.
At the moment, it takes well over 9 hours (~ 14 hr) to train a single epoch, so there's not a single checkpoint before the Kaggle session is timed out. This makes resuming training impossible.