The most visible change here is the addition of post-bicleaner-dummy tasks that sit between bicleaner and merge-corpus. These tasks depend on up to 98 bicleaner tasks themselves, and republish the artifacts from each. merge-corpus pulls artifacts directly from these dummy tasks.
Ideally we would still pull artifacts from the bicleaner tasks, but doing so is not a trivial matter. I did some math on the cost of republishing and my best estimate is in the 3 figures per year if we had 100 full sized training runs - so IMO it's not worth the cost or complexity to take another path.
We can depend on up to 98 post-bicleaner-dummy tasks, so we can have 98*98 training datasets now (~9600), which is more than enough for the forseeable future.
There was some necessary refactoring and preamble before fixing this which is described in the individual commits.
The most visible change here is the addition of
post-bicleaner-dummy
tasks that sit betweenbicleaner
andmerge-corpus
. These tasks depend on up to 98 bicleaner tasks themselves, and republish the artifacts from each.merge-corpus
pulls artifacts directly from these dummy tasks.Ideally we would still pull artifacts from the
bicleaner
tasks, but doing so is not a trivial matter. I did some math on the cost of republishing and my best estimate is in the 3 figures per year if we had 100 full sized training runs - so IMO it's not worth the cost or complexity to take another path.We can depend on up to 98
post-bicleaner-dummy
tasks, so we can have 98*98 training datasets now (~9600), which is more than enough for the forseeable future.There was some necessary refactoring and preamble before fixing this which is described in the individual commits.