j0ma / mrl_nmt22

NMT for Morphologically Rich Languages
0 stars 0 forks source link

Feature: Subsampler for multilingual corpora #1

Open j0ma opened 2 years ago

j0ma commented 2 years ago

Multilingual corpora tend to get large, and higher-resourced languages can overpower lower-resourced ones.

To get around this, need functionality to subsample corpora. Should be easy to specify in a YAML config.

j0ma commented 2 years ago

May not be needed if using fairseq's translation_multi_simple_epoch task