Domain adaptation using Marian

MukundKhandelwal commented 5 years ago

Hi @emjotde

I was trying my hands on training machine translation models using Marian and I wanted to know the process of doing domain adaptation in Marian. I checked that OpenNMT has the guidelines to do it. It would be great if you could help me with how I can use Marian to do the same.

Thanks.

emjotde commented 5 years ago

Hi, do you have links to the OpenNMT guidelines? I guess these should be pretty universal?

MukundKhandelwal commented 5 years ago

Here is the link to the OpenNMT guidelines: http://opennmt.net/OpenNMT-tf/training.html#fine-tune-an-existing-model

emjotde commented 5 years ago

OK, not much of a guideline really :)

What I am seeing there is two things 1) vocabulary change, 2) continued training.

This seems like a horrible idea since your embedding indices and embedding matrix dimensions are going to be tied to your vocabulary. You would really really need to know exactly what you are doing here to not cause damage to your models. Since vocabulary files are external to Marian models, you can always replace them. It's just not a good idea. Now with SentencePiece support I would also say there is absolutely no reason to ever do this.
That's actually easy. Marian has two ways of reusing models and weights:
- just via --model path/to/model.npz. If you copy your model to a new folder and set the option to point to that model. It's going to reload the model from the path. It's also going to overwrite it during the next checkpoint. This overrides the model parameters with the model parameters from the file, so you cannot change architectures between continued trainings. This method also works well for normal continued training. So you can interrupt your running training, change the training corpus and run the same command you used before for the training to resume. In the case where you change the training files you would want it to not restore the corpus positions which can be set with --no-restore-corpus. You can also change other training parameters like learning-rate or early-stopping criteria.
- via --pretrained-model path/to/model.npz this will load weight matrices from model.npz that match in name corresponding parameters from your architecture. This is more flexible than the method above as it allows you to mix model types. For instance you can initialize the decoder of a RNN encoder-decoder translation model with a RNN language model or deep models with shallow models. This can be used for domain-adaptation or transfer-learning. Non-matching parameters will be initialized randomly. This is a method you should only choose with different model types when the first one is not working for you and you have a reason to go for partial initialization. It is quite safe with matching model parameters.

emjotde commented 5 years ago

Thanks for the question. We probably turn this into a documentation item.

MukundKhandelwal commented 5 years ago

Thanks for the detailed answer. Appreciate it. I will follow the above methods and will re-post if I have any queries moving forward.

MukundKhandelwal commented 5 years ago

Hi @emjotde

I ran the training on UN data set from English to Spanish having 11 million sentences. After completing 5K steps, the process saved the checkpoints and stopped due to out of memory error. It's surprising that there was memory error as I was using 8 GPU's each having 12 GB of memory. Do you think I need to make changes elsewhere to account for better memory utilization?

Thanks so much.

wingsyuan commented 5 years ago

@emjotde Hi,

I could not understand about Domain adaptation via --pretrained-model path/to/model.npz, can you explain it more explictly? Thanks a lot!

If I have trained an out-domain Transformer encoder-decoder model , how to use this pretrain-model to implement domain adaption with marian..Is there any guideline ?

Thanks very much!@emjotde

alvations commented 5 years ago

Just found out @mjpost and sockeye's https://awslabs.github.io/sockeye/tutorials/adapt.html and the "continue" training routine from https://arxiv.org/pdf/1612.06897v1.pdf

Sometimes, we kind of use something like that by writing bash script and substituting the training directory with the out-of-domain data.

Would be nice if there's an in-built option to do this, maybe something like --continue outdomain.src outdomain.trg? After the early stopping from the threshold, swap to the continue directory to access the training data.

ykl7 commented 5 years ago

@emjotde I followed the first way of reusing models and weights for continued training (the --no-restore-corpus option). I'm doing continued training on a converged model.

According to the logs, it loads the model, restarts training and finishes immediately, giving me no indication of the number of batches and epochs covered. I suspect the model isn't being retrained at all. My continued training dataset has ~12k sentences. I also tried deleting the old model's optimizer value file, but to no avail.

The end of the log is:

[2019-04-13 11:03:43] Loading model from $path/model.npz.orig.npz
[2019-04-13 11:03:49] Training started
[2019-04-13 11:03:49] Training finished

Is there another parameter I need to reset? Or something else I'm missing?

geovedi commented 5 years ago

perhaps try to increase —after-epoch

ykl7 commented 5 years ago

@geovedi I have that option set, to no avail.

Also, should I be deleting the optimizer.npz file for the model? Or go ahead with all the files

geovedi commented 5 years ago

other option would be modify or remove progress.yml... but i’m not sure if that’s the correct method

snukky commented 5 years ago

@ykl7 The training finishes immediately after starting, because the stopping condition you used previously is already met. Usually --early-stopping, --after-epochs or --after batches needs to be increased. Be aware that this parameter can be loaded implicitly from model.npz.yml.

If I want to do a model fine-tuning by just continuing the training on a different data set, I usually remove model.npz.yml from the training directory, and modify the training command: I set new paths to the training corpus, increase --early-stopping (as it's my main stopping criterium), and add --no-restore-corpus.

If I want to continue the training with a constant learning rate, I additionally remove parameters modifying learning rate scheduling from my training command (like --lr-decay-inv-sqrt), and set --learn-rate to a value of eta from model.npz.progress.yml. Alternatively, both can be set to an arbitrary value like 0.0001.

Removing model.npz.optimizer.npz resets the optimizer parameters. This might be helpful or might not.

@wingsyuan With --pretrained-model path/to/model.npz it is possible to initialize neural network weights from another model trained on different training data. This can be used as a domain-adaptation method if the initial model is trained on a large amount out-of-domain data, but it also requires a larger in-domain training data set than model fine-tuning as it starts a new training.

noe commented 5 years ago

From the marian google group, if you are switching the training corpus to do domain adaptation you may want to use option --no-restore-corpus to avoid marian trying to restore to the same position in the file.

marian-nmt / marian

Domain adaptation using Marian #224