Recommended values for modifiers

It's not clear from the examples in the Readme and from the paper what would be a good first choice of the modifiers' probabilities to start with. I understand that it likely depends a lot on language and data. However developing the intuition for setting those probabilities and other settings will take a lot of experimentation. It would help if the paper disclosed the full OpusTrainer config for the French-English case study to provide a good starting point and increase reproducibility (there is some config listed in the paper but it's not clear whether it's a real training config or just an example).

For context we're trying to reproduce the results from the paper by adding the same methods to our training pipeline to increase robustness of our models. We've successfully integrated UpperCase, TitleCase and SentencePiece sampling so far.

hplt-project / OpusTrainer

Recommended values for modifiers #48