Bilingual models with back-translation

j0ma commented 2 years ago

Need to take bilingual baselines and improve upon them using (iterative) backtranslation.

Started work on this by training a German to English reverse model on HPCC (ongoing)

Random comment: damn SLURM jobs getting preempted :D

Approach to implementing the logic will follow the backtranslation example from fairseq .

j0ma commented 2 years ago

The original paper by Edunov (2018) uses sharding when creating the monolingual data, presumably because they use so much of it.

This will be important to have if we want to include significant amounts of backtranslated data, but it may not be relevant if we add only an amount that fits into RAM, i.e. around the same order of magnitude as the original data.

Thus for the first iteration of backtranslation, BT data will be constructed by:

Downloading monolingual data e.g. NewsCrawl data
Concatenate all data into one file, subsample $N = 5000000$ items.
Convert to UTF-8 with iconv and deduplicate.

j0ma commented 2 years ago

Note that the difficulties in incorporating sharding come from the current design of ExperimentPreprocessingPipeline

j0ma / mrl_nmt22

Bilingual models with back-translation #7