allow for more than 99 training datasets (fixes #653)

bhearsum commented 4 months ago

The most visible change here is the addition of post-bicleaner-dummy tasks that sit between bicleaner and merge-corpus. These tasks depend on up to 98 bicleaner tasks themselves, and republish the artifacts from each. merge-corpus pulls artifacts directly from these dummy tasks.

Ideally we would still pull artifacts from the bicleaner tasks, but doing so is not a trivial matter. I did some math on the cost of republishing and my best estimate is in the 3 figures per year if we had 100 full sized training runs - so IMO it's not worth the cost or complexity to take another path.

We can depend on up to 98 post-bicleaner-dummy tasks, so we can have 98*98 training datasets now (~9600), which is more than enough for the forseeable future.

There was some necessary refactoring and preamble before fixing this which is described in the individual commits.

bhearsum commented 4 months ago

On the topic of increasing the limit: I agree that is the ideal fix (and would be useful in a whole lot of other places as well). I've been chatting with the Taskcluster folks about it. The crux of the issue is that there's some inner loops that are O(dependencies) that have cause reliability issues in the past. https://bugzilla.mozilla.org/show_bug.cgi?id=1316593 speaks to this a bit, and there's a thread in #taskcluster on Matrix from June 19th with a bit more background.

It's not impossible to do this, but the Taskcluster folks are hesistant because of issues in the past. I'm going to continue that conversation, and see if we can either prove that we won't hurt reliability by increasing or removing the limit, or see if we can find a way to change or rearchitect the code to remove the problematic inner loops.

I don't expect either of these things to be quick though. I'm happy to simply forego this "fix" in the meantime if we can live with the 99 training dataset limit - totally up to you folks.

bhearsum commented 4 months ago

I don't understand how the dummy chunking happens. I generated a taskgraph locally with a couple hundred fake datasets. The graph produced 3 all- tasks, all-en-ru-1, all-en-ru-2, all-en-ru-3. The chunking for post bicleaner step was named a bit weirder post-bicleaner-dummy-ai-opus-ELRC-5067-SciPar_v1-en-ru-2 as it included a dataset name. I'm unsure where this splitting is happening.

Hm; I may have missed a needed from-deps setting in post-bicleaner-dummy (from-deps is what's dealing with the chunking: https://taskcluster-taskgraph.readthedocs.io/en/latest/reference/transforms/from_deps.html). I'll look into this - thanks for pointing it out.

bhearsum commented 4 months ago

Just to crosslink, https://github.com/mozilla/firefox-translations-training/issues/653#issuecomment-2203472524 contains another possible solution for this problem.

bhearsum commented 3 months ago

I don't understand how the dummy chunking happens. I generated a taskgraph locally with a couple hundred fake datasets. The graph produced 3 all- tasks, all-en-ru-1, all-en-ru-2, all-en-ru-3. The chunking for post bicleaner step was named a bit weirder post-bicleaner-dummy-ai-opus-ELRC-5067-SciPar_v1-en-ru-2 as it included a dataset name. I'm unsure where this splitting is happening.

To explain the naming a bit more, it's done in a few stages. In taskgraph terms, label is actually the thing we're talking about.

First, the default behaviour of from-deps will derive the name for generated tasks. The default for this (which we use) is to strip off the kind from the primary-dependency, which is the first one listed in kind-dependencies when not explicitly specified. For all this is export, which means we end up with, eg: en-ru (from export-en-ru).

Second, dependency_dummies comes along and potentially splits this one task into multiple tasks (dependening on the number of dependencies). This gives us things like en-ru-1.

Finally, taskgraph sets the label based on the name over here, which is how the kind name (all in this example) gets prepended.

(All of this can be bypassed by setting an explicit label, but that doesn't work very well when you're generating more than 1 task from a single kind entry.)

eu9ene commented 3 months ago

I'm ok with merging this if there's no other way based on https://bugzilla.mozilla.org/show_bug.cgi?id=1316593.

bhearsum commented 3 months ago

I'm ok with merging this if there's no other way based on https://bugzilla.mozilla.org/show_bug.cgi?id=1316593.

As it turns out, we're making some headway in adjusting the maximum number of dependencies in the past few days! See https://github.com/taskcluster/taskcluster/issues/7151. It's not a guarantee that it will happen, but recent investigations are encouraging.

If you'd like to hold off on this until we see the outcome of that I'm happy to let it sit.

eu9ene commented 3 months ago

I'm ok with merging this if there's no other way based on https://bugzilla.mozilla.org/show_bug.cgi?id=1316593.

As it turns out, we're making some headway in adjusting the maximum number of dependencies in the past few days! See taskcluster/taskcluster#7151. It's not a guarantee that it will happen, but recent investigations are encouraging.

If you'd like to hold off on this until we see the outcome of that I'm happy to let it sit.

I think we're not in a rush to land this since our workaround with removing small datasets works fine for now.

bhearsum commented 1 month ago

Taskcluster has gained support for 10,000 dependencies per task. This is no longer needed. After #858 merges we'll be able to take advantage of this.

mozilla / translations

allow for more than 99 training datasets (fixes #653) #682