marian-nmt / marian

Fast Neural Machine Translation in C++
https://marian-nmt.github.io
Other
1.22k stars 228 forks source link

Alternating datasets #357

Open sianvolta opened 3 years ago

sianvolta commented 3 years ago

Is there a way to alternate between training corpora for different batches?

E.g. I've two set of files:

- train-1.src
- train-1.trg
- train-2.src
- train-2.trg

And for every batch it alternates between train-1.src train-1.trg or train-2.src train-2.trg.

snukky commented 3 years ago

Marian has no built-in option for this, but I think you can prepare the batches yourself and guide Marian to just consume them by disabling data shuffling with --no-shuffle, disabling batches generation with --maxi-batch-sort none and specifying size of your batches with --mini-batch <NUMBER>.

You can also generate batches on the fly as Marian can read training data from STDIN (read more about this there: https://groups.google.com/g/marian-nmt/c/zSb7MT4kZ6M). If you use training from STDIN, consider re-defining an epoch as a specific number of batches by --logical-epoch <NUMBER>u.