Closed bhearsum closed 1 month ago
On the topic of increasing the limit: I agree that is the ideal fix (and would be useful in a whole lot of other places as well). I've been chatting with the Taskcluster folks about it. The crux of the issue is that there's some inner loops that are O(dependencies) that have cause reliability issues in the past. https://bugzilla.mozilla.org/show_bug.cgi?id=1316593 speaks to this a bit, and there's a thread in #taskcluster on Matrix from June 19th with a bit more background.
It's not impossible to do this, but the Taskcluster folks are hesistant because of issues in the past. I'm going to continue that conversation, and see if we can either prove that we won't hurt reliability by increasing or removing the limit, or see if we can find a way to change or rearchitect the code to remove the problematic inner loops.
I don't expect either of these things to be quick though. I'm happy to simply forego this "fix" in the meantime if we can live with the 99 training dataset limit - totally up to you folks.
I don't understand how the dummy chunking happens. I generated a taskgraph locally with a couple hundred fake datasets. The graph produced 3 all- tasks, all-en-ru-1, all-en-ru-2, all-en-ru-3. The chunking for post bicleaner step was named a bit weirder post-bicleaner-dummy-ai-opus-ELRC-5067-SciPar_v1-en-ru-2 as it included a dataset name. I'm unsure where this splitting is happening.
Hm; I may have missed a needed from-deps
setting in post-bicleaner-dummy
(from-deps
is what's dealing with the chunking: https://taskcluster-taskgraph.readthedocs.io/en/latest/reference/transforms/from_deps.html). I'll look into this - thanks for pointing it out.
Just to crosslink, https://github.com/mozilla/firefox-translations-training/issues/653#issuecomment-2203472524 contains another possible solution for this problem.
I don't understand how the dummy chunking happens. I generated a taskgraph locally with a couple hundred fake datasets. The graph produced 3
all-
tasks,all-en-ru-1
,all-en-ru-2
,all-en-ru-3
. The chunking for post bicleaner step was named a bit weirderpost-bicleaner-dummy-ai-opus-ELRC-5067-SciPar_v1-en-ru-2
as it included a dataset name. I'm unsure where this splitting is happening.
To explain the naming a bit more, it's done in a few stages. In taskgraph terms, label
is actually the thing we're talking about.
First, the default behaviour of from-deps
will derive the name for generated tasks. The default for this (which we use) is to strip off the kind
from the primary-dependency
, which is the first one listed in kind-dependencies
when not explicitly specified. For all
this is export
, which means we end up with, eg: en-ru
(from export-en-ru
).
Second, dependency_dummies
comes along and potentially splits this one task into multiple tasks (dependening on the number of dependencies). This gives us things like en-ru-1
.
Finally, taskgraph sets the label
based on the name
over here, which is how the kind name (all
in this example) gets prepended.
(All of this can be bypassed by setting an explicit label
, but that doesn't work very well when you're generating more than 1 task from a single kind
entry.)
I'm ok with merging this if there's no other way based on https://bugzilla.mozilla.org/show_bug.cgi?id=1316593.
I'm ok with merging this if there's no other way based on https://bugzilla.mozilla.org/show_bug.cgi?id=1316593.
As it turns out, we're making some headway in adjusting the maximum number of dependencies in the past few days! See https://github.com/taskcluster/taskcluster/issues/7151. It's not a guarantee that it will happen, but recent investigations are encouraging.
If you'd like to hold off on this until we see the outcome of that I'm happy to let it sit.
I'm ok with merging this if there's no other way based on https://bugzilla.mozilla.org/show_bug.cgi?id=1316593.
As it turns out, we're making some headway in adjusting the maximum number of dependencies in the past few days! See taskcluster/taskcluster#7151. It's not a guarantee that it will happen, but recent investigations are encouraging.
If you'd like to hold off on this until we see the outcome of that I'm happy to let it sit.
I think we're not in a rush to land this since our workaround with removing small datasets works fine for now.
Taskcluster has gained support for 10,000 dependencies per task. This is no longer needed. After #858 merges we'll be able to take advantage of this.
The most visible change here is the addition of
post-bicleaner-dummy
tasks that sit betweenbicleaner
andmerge-corpus
. These tasks depend on up to 98 bicleaner tasks themselves, and republish the artifacts from each.merge-corpus
pulls artifacts directly from these dummy tasks.Ideally we would still pull artifacts from the
bicleaner
tasks, but doing so is not a trivial matter. I did some math on the cost of republishing and my best estimate is in the 3 figures per year if we had 100 full sized training runs - so IMO it's not worth the cost or complexity to take another path.We can depend on up to 98
post-bicleaner-dummy
tasks, so we can have 98*98 training datasets now (~9600), which is more than enough for the forseeable future.There was some necessary refactoring and preamble before fixing this which is described in the individual commits.