Open eu9ene opened 4 months ago
I agree that we need to step back and take a more holistic look at how to deal with this.
We don't really have a great analogue for "things on disk" in Taskcluster. The closest is the cached tasks, but obviously we've run into a number of issues with that in the past, and most of the tools listed above have been workarounds for issues we've had with it. To be quite honest, if the level of flexibility that Snakemake provided (ie: being able to manipulate it by adjusting things on disk) is needed, it may be worth considering reviving it, and perhaps adding the ability to scale into the cloud with Slurm. I seriously doubt that Taskcluster will be able to provide the same level of flexibility and simplicity at the same time.
With that all said, I think we can probably make things a lot better than they are. If the goal is "be able to override anything, anywhere", one idea is to require training configs to have all previous artifacts specified. Perhaps this would take the form of pointers to either tasks, or task artifacts that have been uploaded to GCS. Anything not present would then be scheduled.
Whatever the goals and/or use cases are, having a complete list of them written down would be very helpful. We can't design a solution until we know the extent of the problem!
With that all said, I think we can probably make things a lot better than they are. If the goal is "be able to override anything, anywhere", one idea is to require training configs to have all previous artifacts specified. Perhaps this would take the form of pointers to either tasks, or task artifacts that have been uploaded to GCS. Anything not present would then be scheduled.
Incidentally, https://github.com/mozilla/firefox-translations-training/pull/683 has largely implemented this already when it comes to pulling from tasks (we can still only pull pretrained models from elsewhere). As we use it a bit more, it may help us evaluate this option.
One implication of taking things to this extreme is that I imagine we'd need a fairly good cli tool and/or UI to make it practical to do this. It wouldn't be shocking to end up pointing at 5 or 6 different locations (be they task groups or buckets), and it would be unfortunate if it ends up taking hours to collect all of the existing tasks or locations of artifacts and/or it ends up error prone.
@eu9ene, @ahal, and I met today to talk about this. There's some fairly extensive notes taken, but here's a high level summary with the main takeaways:
existing_tasks
is the most flexible mechanism for adjusting dependencies that taskgraph provides, and is the most comparable thing to the way we used to use files on disk to move things along with Snakemakeexisting_tasks
is not necessarily a trivial thing. Ideally, all upstream tasks would end up in it, which will often be from many different task groups. To do this effectively, we need good tooling that can help us to find and sort through possible candidates for upstream tasks we wish to reuse.
taskcluster
code should be needed to try this out, as it will already prioritize existing_tasks
ahead of anything else.
While working on the big training and bug fixes I ran into many issues with scheduling specific tasks. Basically, the graph and caches can be in an arbitrary state and we still should be able to run the pipeline starting with specific stages and reusing the stages that ran before.
We currently have several tools to work with:
target-stage
start_stage
andprevious_group_ids
existing_tasks
Usually, I see there's an issue when it starts scheduling tasks I don't need to schedule and I try to come up with a workaround using those tools. Also, using the current tools adds a significant mental load and is hard to use when training 10s of languages with fixes at the same time (see this PR). We should rethink this approach to make it more flexible and easy to use.
Maybe introducing a concept of state similar to the data on disk in Snakemake can help here. There was an option to skip smart scheduling based on file creation dates and information about the past runs and just treat everything present on disk as completed tasks and schedule the rest.