mozilla / translations

The code, training pipeline, and models that power Firefox Translations
https://mozilla.github.io/translations/
Mozilla Public License 2.0
154 stars 33 forks source link

Taskcluster train actions fail #810

Closed eu9ene closed 1 month ago

eu9ene commented 2 months ago

I’m trying to restart some trainings that failed with deadline exceeded and they all fail with 404 in logs. Apparently it can’t find some task. For example: https://firefox-ci-tc.services.mozilla.com/tasks/Jyi8Pmf8TZ-ve9Q8JE8RcA/runs/0

It fails for all the languages I'm trying to restart.

eu9ene commented 2 months ago

It's likely related to the workaround for the deadline issue: https://github.com/mozilla/firefox-translations-training/issues/691

bhearsum commented 1 month ago

I do not think this is related to the deadline issue - we wouldn't see 404's for tasks that are past their deadline, only for tasks that have expired. It looks to me like this is a new edge case with how we're handling previous_group_ids. We walk up the graph of any tasks in the given group(s) to find all ancestors. The stacktrace shows us a few levels deep into that when we hit this 404, so we're presumably getting this when trying to fetch the task definition of a transitive dependency of one of the tasks from previous_group_ids.

The fix here is most likely to ignore 404s when fetching ancestors, which is something I'll need to fix upstream in taskgraph.

bhearsum commented 1 month ago

Upstream fix is being worked on in https://github.com/taskcluster/taskgraph/pull/569

eu9ene commented 1 month ago

@bhearsum I see, thanks for the investigation and prioritizing this! I'd like to point out that this issue has been blocking us for two weeks and all the training in the big batch is currently stopped. I didn't want to ping other people since you have the most context on all this. I guess we'll need to figure out how to cherry pick the required fixes in release branch because we did not upgrade taskgraph there yet.

bhearsum commented 1 month ago

I'll see what I can do about release. I can probably have something up for that today or tomorrow.

eu9ene commented 1 month ago

The action worked with the fix: https://firefox-ci-tc.services.mozilla.com/tasks/O7cfmFR_SuaZNg8d8b0EWQ

bhearsum commented 1 month ago

https://github.com/mozilla/firefox-translations-training/pull/834 fixed this on main by picking up the upstream fix.