mozilla / translations

The code, training pipeline, and models that power Firefox Translations
https://mozilla.github.io/translations/
Mozilla Public License 2.0
154 stars 33 forks source link

start_stage often reruns amost all "evaluate" tasks #728

Open eu9ene opened 3 months ago

eu9ene commented 3 months ago

I ran this one to do the export task but since "evaluate" tasks are not sequential it leads to rerunning them each time I use start_stage which wastes GPU resources.

https://firefox-ci-tc.services.mozilla.com/tasks/aAroslw9Ru-c5cam6SvLBg

bhearsum commented 3 months ago

As you've already identified, the problem here is that the evaluate tasks that are being rerun are not in any of the previous groups provided. ie: this is working as designed at the moment. The simple "fix" for this is to provide additional previous groups with the evaluate tasks.

Of course, that's annoying, and not ideal. This is another case we should consider when we discuss https://github.com/mozilla/firefox-translations-training/issues/719.

If we'd like to do something in the meantime, we could remove the evaluate tasks as dependencies on all, which is how they get pulled in. Doing so would mean that nothing depends on them, and they would have to be targeted explicitly through a target-stage. This might be fine if you're already targeting things like evaluate-teacher for the first phase of training. We might be able to make this less annoying by making target-stage into target-stages.

bhearsum commented 2 months ago

@eu9ene - Any thoughts on the two options above?

eu9ene commented 2 months ago

Another option is to run them right after training and make the next task depend on them. It makes sense because this way we don't continue the pipeline until we have good evaluation results. There was even a suggestion to add a sanity check (see https://github.com/mozilla/firefox-translations-training/issues/78). They will not be the dependencies of the "all" task then.

bhearsum commented 1 month ago

Sure, if you want to always run them when you run a training, adjusting the stage entries such as https://github.com/mozilla/firefox-translations-training/blob/f7247a60a095015d39fb73065830eb9e980147e2/taskcluster/kinds/evaluate/kind.yml#L118 to be the same as their associated training step would be a good idea. Doing that would mean they always end up in the same task groups, which would let you remove them from the all and all-pr dependencies with no real downside, and should fix this.