Closed emjotde closed 6 years ago
Just as a reminder for myself: some code for a single-GPU training and Adam exists in the pretty old branch feature-save-opt-params
. It doesn't work yet properly, probably saving scheduler parameters or a proper restoration of data iteration is missing. I already have regression tests.
Adam or Adagrad statistics are saved now in model.npz.optimizer.npz
. Scheduler parameters are stored in model.npz.yml
. Saving and restoring the training state should work regardless of the number of devices used.
We have some regression tests.
Restoring is not yet fully seamless as the corpus state is not saved yet.
Losing corpus state is probably relatively harmless. We should however think of a way to do that. My suggestion would be to only enable this with the SQLite container.
I've implemented a proper corpus restoration and added to the master branch already. It works for default corpora and for SQLite based data management. We have several rather basic regression tests for that feature, however, I recommend more testing.
After discussion with Marcin, I'm going to change the way how the restoration is implemented for SQLite and pass the state of the random number generator to the custom random_seed()
function as void*
to get out of multiple shuffling. This is just a refactorization and shouldn't change anything for the end user or regression tests.
Corpus restoration works if --restore-corpus
is used. I'm going to enable it by default for the release of version 1.4 (and fix the bug we have at the moment with the incorrect number of sentences/batches displayed in scheduler logs if the option is not used).
Fixed as corpus state restoration is now enabled by default.
I would keep this issue open still, as we need to review how we handle continuation when models are being smoothed. I think this is still not fully correct.
@snukky
To finish this I believe we would need to include proper handling when --exponential-smoothing
is set. I.e. the original unsmoothed parameters need to be save in the checkpoint as well. It should then restore the model to the average and the unsmoothed parameters to the original model. That's a bit confusing.
This is a blocker for a lot of the work @afaji and @XapaJIaMnu should be doing because CSD3 is the best place but has an annoying 36 hour limit.
Currently, do not use --exponential-smoothing
and rather average checkpoints manually like everyone else does. Then it should work.
As a reminder for myself: restarting the training with --exponential-smoothing
is implemented for single-GPU in the save-exp-smoothing
branch.
It's implemented for single- and multi-GPU training in save-exp-smoothing
. The regression tests are added for this except for asynchronous SGD as it's too nondeterministic. I'm going to merge it into master soon.
Continuation of training is a bit shady right now. Adam optimizer statistics are not being saved and cannot be used for resumed training. They should be saved in a special *.npz file next to the model. This is a bit complicated with sharded training, shards need to be restored.
The scheduler should also save its parameters, possibly in the same file?