mozilla / firefox-translations-training

Training pipelines for Firefox Translations neural machine translation models
https://mozilla.github.io/firefox-translations-training/
Mozilla Public License 2.0
135 stars 28 forks source link

allow for more than 99 training datasets (fixes #653) #682

Open bhearsum opened 1 week ago

bhearsum commented 1 week ago

The most visible change here is the addition of post-bicleaner-dummy tasks that sit between bicleaner and merge-corpus. These tasks depend on up to 98 bicleaner tasks themselves, and republish the artifacts from each. merge-corpus pulls artifacts directly from these dummy tasks.

Ideally we would still pull artifacts from the bicleaner tasks, but doing so is not a trivial matter. I did some math on the cost of republishing and my best estimate is in the 3 figures per year if we had 100 full sized training runs - so IMO it's not worth the cost or complexity to take another path.

We can depend on up to 98 post-bicleaner-dummy tasks, so we can have 98*98 training datasets now (~9600), which is more than enough for the forseeable future.

There was some necessary refactoring and preamble before fixing this which is described in the individual commits.