Improve usability of running selected tasks

eu9ene commented 4 months ago

While working on the big training and bug fixes I ran into many issues with scheduling specific tasks. Basically, the graph and caches can be in an arbitrary state and we still should be able to run the pipeline starting with specific stages and reusing the stages that ran before.

We currently have several tools to work with:

target-stage
start_stage and previous_group_ids
existing_tasks
pre-trained models
Git branches
Adding extra tasks to the graph

Usually, I see there's an issue when it starts scheduling tasks I don't need to schedule and I try to come up with a workaround using those tools. Also, using the current tools adds a significant mental load and is hard to use when training 10s of languages with fixes at the same time (see this PR). We should rethink this approach to make it more flexible and easy to use.

Maybe introducing a concept of state similar to the data on disk in Snakemake can help here. There was an option to skip smart scheduling based on file creation dates and information about the past runs and just treat everything present on disk as completed tasks and schedule the rest.

bhearsum commented 3 months ago

I agree that we need to step back and take a more holistic look at how to deal with this.

We don't really have a great analogue for "things on disk" in Taskcluster. The closest is the cached tasks, but obviously we've run into a number of issues with that in the past, and most of the tools listed above have been workarounds for issues we've had with it. To be quite honest, if the level of flexibility that Snakemake provided (ie: being able to manipulate it by adjusting things on disk) is needed, it may be worth considering reviving it, and perhaps adding the ability to scale into the cloud with Slurm. I seriously doubt that Taskcluster will be able to provide the same level of flexibility and simplicity at the same time.

With that all said, I think we can probably make things a lot better than they are. If the goal is "be able to override anything, anywhere", one idea is to require training configs to have all previous artifacts specified. Perhaps this would take the form of pointers to either tasks, or task artifacts that have been uploaded to GCS. Anything not present would then be scheduled.

Whatever the goals and/or use cases are, having a complete list of them written down would be very helpful. We can't design a solution until we know the extent of the problem!

bhearsum commented 2 months ago

With that all said, I think we can probably make things a lot better than they are. If the goal is "be able to override anything, anywhere", one idea is to require training configs to have all previous artifacts specified. Perhaps this would take the form of pointers to either tasks, or task artifacts that have been uploaded to GCS. Anything not present would then be scheduled.

Incidentally, https://github.com/mozilla/firefox-translations-training/pull/683 has largely implemented this already when it comes to pulling from tasks (we can still only pull pretrained models from elsewhere). As we use it a bit more, it may help us evaluate this option.

One implication of taking things to this extreme is that I imagine we'd need a fairly good cli tool and/or UI to make it practical to do this. It wouldn't be shocking to end up pointing at 5 or 6 different locations (be they task groups or buckets), and it would be unfortunate if it ends up taking hours to collect all of the existing tasks or locations of artifacts and/or it ends up error prone.

bhearsum commented 3 weeks ago

@eu9ene, @ahal, and I met today to talk about this. There's some fairly extensive notes taken, but here's a high level summary with the main takeaways:

existing_tasks is the most flexible mechanism for adjusting dependencies that taskgraph provides, and is the most comparable thing to the way we used to use files on disk to move things along with Snakemake
Populating existing_tasks is not necessarily a trivial thing. Ideally, all upstream tasks would end up in it, which will often be from many different task groups. To do this effectively, we need good tooling that can help us to find and sort through possible candidates for upstream tasks we wish to reuse.
- It is unclear if UI is a necessity for this, or if a CLI tool will be enough.
- No tweaking of any of the taskcluster code should be needed to try this out, as it will already prioritize existing_tasks ahead of anything else.
One thing that might help us find these is if we have one central place that had all of the translations tasks in it (this information is in the Taskcluster database, but there’s no way to query it, eg: by experiment name or language pair)
- One idea we came up with here is to add this information to a different database either when training actions run, or possibly when each task begins to execute.
- Another way we might be able to do this is to add additional indexes to tasks (eg: ones with the experiment name in them, and a unique identifier to avoid being overridden by reruns). This would allow us to query the Taskcluster index by some prefix and find all of the tasks within it.
On a different note, we also talked about some ways to make task building and DAG generation different/easier. @ahal suggested that one way to try this out could be write a custom loader that would execute a Python script to build the initial kind data that gets fed to transforms. (This would replace the existing kinds.)
- This is very speculative and would be experimental, but it’s the most promising idea we have for making DAGs more dynamically configurable.

mozilla / translations

Improve usability of running selected tasks #719