marius-team / marius

Large scale graph learning on a single machine.
https://marius-project.org
Apache License 2.0
160 stars 45 forks source link

Support for interval based Checkpointing #97

Closed basavaraj29 closed 2 years ago

basavaraj29 commented 2 years ago

Introduced support for checkpointing periodically, checkpoint disabled by default.

when enabled with training.checkpoint.interval set to k, writes the checkpointed model into the <model_dir>/checkpoint_x directory, where x=n*k. tested resuming training from the checkpointed model, works fine.

previous checkpoints can be preserved by setting training.checkpoint.save_prev_num. the specified number of previous checkpoints would be preserved.

Checkpointing Overhead for freebase86m dataset, on cosmos 1 machine, the epoch time is around 37.2 mins (2234 secs) and the checkpointing time is around 1 min. ~3% overhead if we checkpoint for each epoch.

[05/21/22 06:42:12.880] Epoch Runtime: 2233951ms
[05/21/22 06:43:12.637] Checkpoint Time: 59756ms