mozilla / translations

The code, training pipeline, and models that power Firefox Translations
https://mozilla.github.io/translations/
Mozilla Public License 2.0
154 stars 33 forks source link

Improve usability of running selected tasks #719

Open eu9ene opened 4 months ago

eu9ene commented 4 months ago

While working on the big training and bug fixes I ran into many issues with scheduling specific tasks. Basically, the graph and caches can be in an arbitrary state and we still should be able to run the pipeline starting with specific stages and reusing the stages that ran before.

We currently have several tools to work with:

  1. target-stage
  2. start_stage and previous_group_ids
  3. existing_tasks
  4. pre-trained models
  5. Git branches
  6. Adding extra tasks to the graph

Usually, I see there's an issue when it starts scheduling tasks I don't need to schedule and I try to come up with a workaround using those tools. Also, using the current tools adds a significant mental load and is hard to use when training 10s of languages with fixes at the same time (see this PR). We should rethink this approach to make it more flexible and easy to use.

Maybe introducing a concept of state similar to the data on disk in Snakemake can help here. There was an option to skip smart scheduling based on file creation dates and information about the past runs and just treat everything present on disk as completed tasks and schedule the rest.

bhearsum commented 3 months ago

I agree that we need to step back and take a more holistic look at how to deal with this.

We don't really have a great analogue for "things on disk" in Taskcluster. The closest is the cached tasks, but obviously we've run into a number of issues with that in the past, and most of the tools listed above have been workarounds for issues we've had with it. To be quite honest, if the level of flexibility that Snakemake provided (ie: being able to manipulate it by adjusting things on disk) is needed, it may be worth considering reviving it, and perhaps adding the ability to scale into the cloud with Slurm. I seriously doubt that Taskcluster will be able to provide the same level of flexibility and simplicity at the same time.

With that all said, I think we can probably make things a lot better than they are. If the goal is "be able to override anything, anywhere", one idea is to require training configs to have all previous artifacts specified. Perhaps this would take the form of pointers to either tasks, or task artifacts that have been uploaded to GCS. Anything not present would then be scheduled.

Whatever the goals and/or use cases are, having a complete list of them written down would be very helpful. We can't design a solution until we know the extent of the problem!

bhearsum commented 2 months ago

With that all said, I think we can probably make things a lot better than they are. If the goal is "be able to override anything, anywhere", one idea is to require training configs to have all previous artifacts specified. Perhaps this would take the form of pointers to either tasks, or task artifacts that have been uploaded to GCS. Anything not present would then be scheduled.

Incidentally, https://github.com/mozilla/firefox-translations-training/pull/683 has largely implemented this already when it comes to pulling from tasks (we can still only pull pretrained models from elsewhere). As we use it a bit more, it may help us evaluate this option.

One implication of taking things to this extreme is that I imagine we'd need a fairly good cli tool and/or UI to make it practical to do this. It wouldn't be shocking to end up pointing at 5 or 6 different locations (be they task groups or buckets), and it would be unfortunate if it ends up taking hours to collect all of the existing tasks or locations of artifacts and/or it ends up error prone.

bhearsum commented 3 weeks ago

@eu9ene, @ahal, and I met today to talk about this. There's some fairly extensive notes taken, but here's a high level summary with the main takeaways: