Retry failed tasks - Githubissues

mozilla / code-coverage

Code Coverage analysis for Mozilla products

https://coverage.moz.tools/

Mozilla Public License 2.0

28 stars 84 forks source link

Retry failed tasks #159

Open marco-c opened 7 years ago

marco-c commented 7 years ago

Most of the failures are due to transient issues, we should implement some kind of retry mechanism.

Does it apply to static analysis too, @La0 @jankeromnes?

mutterroland commented 6 years ago

I would love to have some more infos. Maybe I can have a try on that one :).

marco-c commented 6 years ago

@mutterroland this issue requires knowledge of asyncio, are you familiar with it? If you aren't, I'd suggest working on another good-first-bug.

mutterroland commented 6 years ago

@marco-c I will move to another good-first-bug.

jankeromnes commented 6 years ago

Does it apply to static analysis too?

Yes, it just happened during today's release, because pulse-listener and static-analysis didn't update at exactly the same time (in this window, an incompatibility between bot versions caused a few analyses to fail).

La0 commented 6 years ago

The retrigger code could be reused here.

La0 commented 6 years ago

I'm starting to work on this one, here is what i can do:

Retry immediately tasks in exception state (with a maximum of N retries - N=3). Exception states are mostly due to Taskcluster (workers down, timeouts, ...)
Retry tasks in error state once, with a waiting time

If I use the retrigger code, a new task group & task id will be created, so I need to keep track of this in the code. The rerunTask would be much cleaner but is marked deprecated (i asked on #taskcluster about an alternative).

marco-c commented 6 years ago

Sounds good to me, maybe the retry with a waiting time can be configurable per project (as it's possible different projects will have different needs). I know tasks in error state for code coverage are often due to non-recoverable errors, it would be nice to have a way not to retrigger them, but it's not that easy to know whether an error is recoverable or not.

La0 commented 6 years ago

From #taskcluster:

<&pmoore> | bastien: although it isn't documented there, i believe the preferred approach is to create a new task with the same task definition, but with a different taskId. we should probably update the docs <&pmoore> | bastien: i've created https://github.com/taskcluster/taskcluster-queue/pull/292 for this (feedback welcome as this is a particularly delicate issue - aki-away/bstack/dustin/jhford)

La0 commented 5 years ago

Bumping this, as it's blocking static analysis users: mozilla/release-services#1846