finetuning the model with new data samples

kmario23 commented 7 years ago

I have trained a seq2seq NMT model (EN-DE) with 1M samples and saved the latest checkpoint. Now, I have some domain-specific data of 50K sentence pairs which has not been seen in previous training data. How can I adapt the current model to this new data?

Specifically, I'd like to finetune the model to the new domain and as I increase the number of samples in that domain, the model should output reasonably well translation for test sentences in that domain. The finetuning is commonly done in computer vision but I'm not sure how to achieve this in seq2seq architecture.

I'm aware of the fact that the vocabulary files for both languages has to be updated according to the new sentence pairs. But, to achieve this, do we have to again start training from scratch? Isn't there a smarter way to continue training from the current checkpoint after dynamically updating the necessary components?

Any ideas or relevant papers which address this issue?

lmthang commented 7 years ago

You can change the train_prefix but keep the out_dir argument. Technically, it should work (or would require minimal changes to get things to work). I would recommend using subword units (BPE) if you haven't to present unseen words. Also, make sure you back up the checkpoints before trying :)

See section 3 here for NMT adaptation https://nlp.stanford.edu/~lmthang/data/papers/iwslt15.pdf.

kmario23 commented 7 years ago

@lmthang Thank you very much for your insights! Yes, we're trying to use subword units. Since we want to update the vocabulary files in accordance with new training samples, this also involves changing vocab_prefix, right? If yes, how does the tensor size of old parameters (in the checkpoint) agree with new vocab size? :)

ghost commented 7 years ago

@kmario23 Did you found any way to update vocabulary for new data samples?

Sabyasachi18 commented 6 years ago

@ssokhey Same question here!

tensorflow / nmt

finetuning the model with new data samples #112