How to change vocabulary size while running model re-training with new data?

tensorflow / nmt

TensorFlow Neural Machine Translation Tutorial

Apache License 2.0

6.37k stars 1.96k forks source link

How to change vocabulary size while running model re-training with new data? #243

Closed Sabyasachi18 closed 6 years ago

Sabyasachi18 commented 6 years ago

I want to run incremental training on my trained German-English Engine using NMT with subword BPE encoding. Can I update my vocab file with new words from the incremental training data. If Yes, then kindly let me know the process.

Should I append the new words at the end of the existing vocabulary file while running incremental training? Or should i do a sorting of the vocab file after appending the new words to it?

Sabyasachi18 commented 6 years ago

I think I have the answer now. Vocab size cannot be changed. It has to be pre-decided before training the model. I used vocab size of 32000 for training my NMT model. It gave good results!

Sabyasachi18 commented 6 years ago

I think I have the answer now. Vocab size cannot be changed. It has to be pre-decided before training the model. I used vocab size of 32000 for training my NMT model. It gave good results!

divyashreepathihalli commented 1 year ago

you can now do this. Keras (tf-nightly version) has added a new util keras.utils.warmstart_embedding_matrix. Using this you can continuously train your model with changing vocabulary. https://www.tensorflow.org/api_docs/python/tf/keras/utils/warmstart_embedding_matrix