Closed pocmo closed 3 years ago
@JohanLorenzo Do you have an idea what we could do here? Is there a way how we could make this task fail?
Thanks for cc'ing me on this issue! I see how much of a problem this is. I think we can solve it in 2 different ways.
complete-push
task behavesThe complete-push
task can have its definition changed so that it starts whenever all its dependencies are resolved
(instead of completed
). This means all tasks will still have to run before the complete-push
does. Therefore, it's a not a way to fail fast.
If we go this way, we should change the complete-push
logic to error out if one of its dependencies is busted.
bors allows to match patterns https://bors.tech/documentation/#configuration-borstoml. If we change taskcluster-github
to prepend taskcluster/
to all Github Checks, then we can tell bors to watch for taskcluster/%
. I'm not 100% sure, but it may enable fast failure.
Then, the complete-push
task would become useless, and we could retire it.
I'd first lean towards the second solution, which seems cleaner and easier to me. What do you think @tomprince ?
This means all tasks will still have to run before the complete-push does. Therefore, it's a not a way to fail fast.
Not as fast as possible, but already significantly better. Is that a quick fix we could do easily?
If we change taskcluster-github to prepend taskcluster/ to all Github Checks, then we can tell bors to watch for taskcluster/%. I'm not 100% sure, but it may enable fast failure. Then, the complete-push task would become useless, and we could retire it.
That would be pretty neat! But also sounds like it will require more work / coordination.
This has been quite annoying the last working days. There has been intermittent failures causing bors to do nothing for quite a while until the timeout is happening.
Okay. I'll give a shot at the first solution tomorrow.
@JohanLorenzo Is this done? I think I still see this happen.
Option 1 should be implemented. Do you have an example I can look into?
@JohanLorenzo I do not remember but I'll make sure to post it here once I saw it.
I recently saw a slightly different variant that I remember though. In that case the decision task failed which caused no complete-push task to get scheduled at all. I wonder if it would help to add the decision task to bors.toml as an additional task it should look at? I hope that bors will fail as soon as one of the tasks fails and doesn't wait for all of them.
Here is an example of a timeout. https://github.com/mozilla-mobile/android-components/pull/4948
However I can't see a failing task on that push.
I wonder if this timeout has the same root cause as the one in https://github.com/mozilla-mobile/fenix/issues/6139#issuecomment-545901930 🤔
We made bors wait for the
complete-push
task. However this task does not "fail fast". In the following task group one task failed and thecomplete-push
task is just unscheduled due do that (instead of failed). So bors continues to wait until it times out. This slows down our merge queue since every failed task will cause bors to wait until its timeout.https://tools.taskcluster.net/groups/PkeR4TLQS3eDHpOrBy6qfQ
┆Issue is synchronized with this Jira Task