Open j0ma opened 2 years ago
The original paper by Edunov (2018) uses sharding when creating the monolingual data, presumably because they use so much of it.
This will be important to have if we want to include significant amounts of backtranslated data, but it may not be relevant if we add only an amount that fits into RAM, i.e. around the same order of magnitude as the original data.
Thus for the first iteration of backtranslation, BT data will be constructed by:
iconv
and deduplicate.Note that the difficulties in incorporating sharding come from the current design of ExperimentPreprocessingPipeline
Need to take bilingual baselines and improve upon them using (iterative) backtranslation.
Started work on this by training a German to English reverse model on HPCC (ongoing)
Random comment: damn SLURM jobs getting preempted :D
Approach to implementing the logic will follow the backtranslation example from .