Short sentences are frequently removed from parallel datasets, so there aren't enough to train on.
In HPLT 2.0 the data is filtered at the document level, rather than sentence level. We could take high-scoring documents and extract short sentences from them. These short sentences could be higher quality given they are embedded in a higher quality document. Then using tokenization and alignment, we can statistically extract a corresponding translation from the parallel sentences. These synthesized pairs could be used for training.
The biggest risk I see here is the model learning bad translations from using wrong parts of speech or morphology for a word. For instance, if declensions for the word are different in their isolated form, it may mis-translate the sentence into an awkward form. Perhaps this would only work for subsets of languages with lower morphological differences between their words.
Short sentences are frequently removed from parallel datasets, so there aren't enough to train on.
In HPLT 2.0 the data is filtered at the document level, rather than sentence level. We could take high-scoring documents and extract short sentences from them. These short sentences could be higher quality given they are embedded in a higher quality document. Then using tokenization and alignment, we can statistically extract a corresponding translation from the parallel sentences. These synthesized pairs could be used for training.
The biggest risk I see here is the model learning bad translations from using wrong parts of speech or morphology for a word. For instance, if declensions for the word are different in their isolated form, it may mis-translate the sentence into an awkward form. Perhaps this would only work for subsets of languages with lower morphological differences between their words.