qAp / gresearch_crypto_forecasting_kaggle

Transformer-based solution for crypto time-series forecasting Kaggle
3 stars 1 forks source link

Checkpointing before session timed out #10

Open qAp opened 2 years ago

qAp commented 2 years ago

At the moment, it takes well over 9 hours (~ 14 hr) to train a single epoch, so there's not a single checkpoint before the Kaggle session is timed out. This makes resuming training impossible.

qAp commented 2 years ago

Normally, validation is carried out after each epoch of training, and normally, model checkpoints are made by monitoring some validation metric. This implies that validation is done once per epoch, and model checkpoints are made at most once per epoch.

In pytorch lightning, Trainer.val_check_interval allows one to set how often validation is to be carried out. If it's set to 0.25, then validation is done after each quarter of an epoch of training, so validation is done a total of 4 times per epoch. This allows one to check how well the model is doing more frequently.

And, for the version of pytorch lightning currently in use, ModelCheckpoint.every_n_val_epochs should monitor the specified validation metric and decide whether to make a checkpoint or not.