marian-nmt / marian-dev

Fast Neural Machine Translation in C++ - development repository
https://marian-nmt.github.io
Other
255 stars 125 forks source link

Save training parameters in separate *.npz file to allow seamless continuation #64

Closed emjotde closed 6 years ago

emjotde commented 7 years ago

Continuation of training is a bit shady right now. Adam optimizer statistics are not being saved and cannot be used for resumed training. They should be saved in a special *.npz file next to the model. This is a bit complicated with sharded training, shards need to be restored.

The scheduler should also save its parameters, possibly in the same file?

snukky commented 6 years ago

Just as a reminder for myself: some code for a single-GPU training and Adam exists in the pretty old branch feature-save-opt-params. It doesn't work yet properly, probably saving scheduler parameters or a proper restoration of data iteration is missing. I already have regression tests.

snukky commented 6 years ago

Adam or Adagrad statistics are saved now in model.npz.optimizer.npz. Scheduler parameters are stored in model.npz.yml. Saving and restoring the training state should work regardless of the number of devices used.

We have some regression tests.

Restoring is not yet fully seamless as the corpus state is not saved yet.

emjotde commented 6 years ago

Losing corpus state is probably relatively harmless. We should however think of a way to do that. My suggestion would be to only enable this with the SQLite container.

snukky commented 6 years ago

I've implemented a proper corpus restoration and added to the master branch already. It works for default corpora and for SQLite based data management. We have several rather basic regression tests for that feature, however, I recommend more testing.

After discussion with Marcin, I'm going to change the way how the restoration is implemented for SQLite and pass the state of the random number generator to the custom random_seed() function as void* to get out of multiple shuffling. This is just a refactorization and shouldn't change anything for the end user or regression tests.

snukky commented 6 years ago

Corpus restoration works if --restore-corpus is used. I'm going to enable it by default for the release of version 1.4 (and fix the bug we have at the moment with the incorrect number of sentences/batches displayed in scheduler logs if the option is not used).

snukky commented 6 years ago

Fixed as corpus state restoration is now enabled by default.

emjotde commented 6 years ago

I would keep this issue open still, as we need to review how we handle continuation when models are being smoothed. I think this is still not fully correct.

emjotde commented 6 years ago

@snukky To finish this I believe we would need to include proper handling when --exponential-smoothing is set. I.e. the original unsmoothed parameters need to be save in the checkpoint as well. It should then restore the model to the average and the unsmoothed parameters to the original model. That's a bit confusing.

kpu commented 6 years ago

This is a blocker for a lot of the work @afaji and @XapaJIaMnu should be doing because CSD3 is the best place but has an annoying 36 hour limit.

emjotde commented 6 years ago

Currently, do not use --exponential-smoothing and rather average checkpoints manually like everyone else does. Then it should work.

snukky commented 6 years ago

As a reminder for myself: restarting the training with --exponential-smoothing is implemented for single-GPU in the save-exp-smoothing branch.

snukky commented 6 years ago

It's implemented for single- and multi-GPU training in save-exp-smoothing. The regression tests are added for this except for asynchronous SGD as it's too nondeterministic. I'm going to merge it into master soon.