SYSTRAN’s submission to the WMT 2017 shared news translation task for English-German
Back-translation and Hyper-specialization
uses OpenNMT
Details
WMT 2017 News Translation Task
Data 4.6M Parallel corpus
Training
Nvidia GTX 1080 ~ 64 per minibatch
SGD (0.1) with annealing rate (0.7)
Back Translation
translating target language back into source language and using it as parallel corpus
Synthetically generated back-translated data 4.5M + original 4.5M after 13 epochs of original 4.5M training
it improves performance!
Data Selection vis LM model
Less data are used to fine-tune model
data is chosen by two 3-gram LM model trained one from news corpus and one from random sampling. When the difference of cross-entropy is big, we treat it as news related sentence and include in fine-tune corpus
Hyper-specialization
25K news related set tuned with learning rate 0.7
improves BLEU by +0.3~0.5
Personal Thoughts
Good to see Systran openly participating and contributing to WMT2017
Amount of data is really strong, when generated via back-translation, distillation, monolingual!
Hyper-specialization is competition-fit strategy for squeezing the performance ~ likely overfitting
Abstract
Details
WMT 2017 News Translation Task
Training
Back Translation
Data Selection vis LM model
Hyper-specialization
Personal Thoughts
Link : https://arxiv.org/pdf/1709.03814.pdf Authors : Deng et al. 2017