mozilla / translations

The code, training pipeline, and models that power Firefox Translations
https://mozilla.github.io/translations/
Mozilla Public License 2.0
154 stars 33 forks source link

Consider statistically translating short sentences from monolingual datasets. #880

Open gregtatum opened 2 weeks ago

gregtatum commented 2 weeks ago

Short sentences are frequently removed from parallel datasets, so there aren't enough to train on.

In HPLT 2.0 the data is filtered at the document level, rather than sentence level. We could take high-scoring documents and extract short sentences from them. These short sentences could be higher quality given they are embedded in a higher quality document. Then using tokenization and alignment, we can statistically extract a corresponding translation from the parallel sentences. These synthesized pairs could be used for training.

The biggest risk I see here is the model learning bad translations from using wrong parts of speech or morphology for a word. For instance, if declensions for the word are different in their isolated form, it may mis-translate the sentence into an awkward form. Perhaps this would only work for subsets of languages with lower morphological differences between their words.