Open jowagner opened 4 years ago
As to using machine translation output, reading footnote 12 of Way (2018) Quality Expectations of Machine Translation, we could get synthetic output more remote to original source text by adding a few pivot translation steps, e.g. starting with German text, translate it to French, then from French to English, then back to German and repeat the loop a few times and finally from English to Irish. Further ideas to avoid situations where the iterations converge would be to (a) vary the MT engine for each language pair and (b) add more language pairs and perform a random walk in the graph of supported language pairs.
Relevant ACL 2022 paper: Ri and Tsuruoka 2022 Pretraining with Artificial Language: Studying Transferable Knowledge in Language Models compare different methods for generating synthetic pre-training data in terms of LM perplexity
We could augment the BERT training data with English text, or text in other languages, machine translated to Irish and/or with automatic paraphrases of Irish text.
Is their previous work adding synthetic text in the target language to the BERT training data, such as output from a machine translation model?