Continue training with new data

tensorflow / tensor2tensor

Library of deep learning models and datasets designed to make deep learning more accessible and accelerate ML research.

Apache License 2.0

15.19k stars 3.45k forks source link

Continue training with new data #588

Open tmkhalil opened 6 years ago

tmkhalil commented 6 years ago

Hello all,

I'm using my own data for training a transformer model for machine translation. I am using the standard pipeline with t2t-datagen and t2t-trainer and it's fine to train the model. In some use cases such as domain adaptation I need to continue training using a new dataset (domain data for example) and update the vocabulary using the new sub-words if possible. Is this scenario supported in tensor2tensor?

Thank you! Talaat

martinpopel commented 6 years ago

Updating the vocabulary during training is not supported (there exists a technique for this implemented in some other NMT frameworks, but not in T2T, it has been discussed I think here in issues or gitter or google groups). However, the T2T internal subwords are robust enough to encode unseen words or even characters (although not optimally).

Resumed training (e.g. for the purpose of domain adaptation) is supported, just be careful with the learning rate (which follows a given decay schedule which is guided by global_step which is stored in the checkpoint) and ADAM moments (which are stored in the checkpoint). A simple way (hack) how to set global_step to zero in a given checkpoint is to use avg_checkpoints.py with just that one file. I don't say it is a good idea to set the global_step to zero.

tmkhalil commented 6 years ago

Thanks a lot Marin, that's really useful! Would be great also if we can have control over the optimizers states.

I have some more follow-up questions:

Is it as simple as providing a different data directory to the trainer?
Will setting global_step to zero reset the adam parameters to the initial state?

Thank you!

martinpopel commented 6 years ago

Would be great also if we can have control over the optimizers states.

Yes, there are papers reporting that resetting ADAM moments to zero from time to time helps (even when not doing domain adaptation). I'm not convinced this is the best way, but if you want you can ad-hoc edit the checkpoints, see avg_checkpoints.py for inspiration (I've tried to simplify it, but was not successful).

Is it as simple as providing a different data directory to the trainer?

Yes, I think so.

Will setting global_step to zero reset the adam parameters to the initial state?

No. Adam has "bias handling" which is a kind of warmup and I am not sure now whether the current implementation depends on global_step. In addition there is the learning rate warmup if you use the default noam scheme (renamed in the newest T2T version). In my experiments, starting with global_step=0 and trained model resulted in a diverged training.

surafelml commented 6 years ago

I had a similar issue and raised it on the google-group discussion, Lukasz mentioned something similar as @martinpopel "...T2T internal subwords are robust enough to encode unseen words or even characters.." and how the individual characters which are part of the generated vocabulary can help in mapping unseen words.

The OpenNMT approach for updating the vocabulary discussed here http://opennmt.net/OpenNMT/training/retraining/#updating-the-vocabularies

Best, Surafel.

cwlinghk commented 6 years ago

Is it as simple as providing a different data directory to the trainer? Yes, I think so.

The problem is how can I tokenize and encode the new training data using my subword list generated from my old data? Thank you very much.

martinpopel commented 6 years ago

For new data, you need to run t2t-datagen first and let it use the original vocabulary file (it will automatically reuse it, if the file exists).