Closed johnml1135 closed 7 months ago
So, 50% random mix, if one source has no text, use the other source. Only 2 sources needed. Follow SILNLP implementation.
Proposal: for implementation, when posting the single corpora:
@ddaspit - here is a proposal of the changes:
We need API documentation for this feature.
Spin off of https://github.com/sillsdev/serval/issues/266 for Serval implemenatation.
Mixing multiple sources from different NLLB-200 languages has shown to make a big bump especially if the backtranslation language is different than the source text (say English backtranslation with Spanish source text). Including target sentences in multiple times makes the behavior worse (memorizing the target sentences), but interweaving half the target with 2 different sources works pretty well. How can this be implemented:
Proposal:
mixingRatio
which defaults to 1.0 - but can accept any valid float number including 1e100, etc..mixingRatio
(such as 2.0), a source can take be aligned with more of the target sentencesmixingRatio
specified and Mark and Matthew will be exclusively from the other source.