Introduced support for checkpointing periodically, checkpoint disabled by default.
when enabled with training.checkpoint.interval set to k,
writes the checkpointed model into the <model_dir>/checkpoint_x directory, where x=n*k. tested resuming training from the checkpointed model, works fine.
previous checkpoints can be preserved by setting training.checkpoint.save_prev_num. the specified number of previous checkpoints would be preserved.
Checkpointing Overhead
for freebase86m dataset, on cosmos 1 machine, the epoch time is around 37.2 mins (2234 secs) and the checkpointing time is around 1 min. ~3% overhead if we checkpoint for each epoch.
Introduced support for checkpointing periodically, checkpoint disabled by default.
when enabled with
training.checkpoint.interval
set tok
, writes the checkpointed model into the<model_dir>/checkpoint_x
directory, where x=n*k. tested resuming training from the checkpointed model, works fine.previous checkpoints can be preserved by setting
training.checkpoint.save_prev_num
. the specified number of previous checkpoints would be preserved.Checkpointing Overhead for freebase86m dataset, on cosmos 1 machine, the epoch time is around 37.2 mins (2234 secs) and the checkpointing time is around 1 min. ~3% overhead if we checkpoint for each epoch.