Bors timeout when task fails (complete-push unscheduled vs. failed)

pocmo commented 4 years ago

We made bors wait for the complete-push task. However this task does not "fail fast". In the following task group one task failed and the complete-push task is just unscheduled due do that (instead of failed). So bors continues to wait until it times out. This slows down our merge queue since every failed task will cause bors to wait until its timeout.

https://tools.taskcluster.net/groups/PkeR4TLQS3eDHpOrBy6qfQ

┆Issue is synchronized with this Jira Task

pocmo commented 4 years ago

@JohanLorenzo Do you have an idea what we could do here? Is there a way how we could make this task fail?

JohanLorenzo commented 4 years ago

Thanks for cc'ing me on this issue! I see how much of a problem this is. I think we can solve it in 2 different ways.

Change how the `complete-push` task behaves

The complete-push task can have its definition changed so that it starts whenever all its dependencies are resolved (instead of completed). This means all tasks will still have to run before the complete-push does. Therefore, it's a not a way to fail fast. If we go this way, we should change the complete-push logic to error out if one of its dependencies is busted.

Configure bors differently

bors allows to match patterns https://bors.tech/documentation/#configuration-borstoml. If we change taskcluster-github to prepend taskcluster/ to all Github Checks, then we can tell bors to watch for taskcluster/%. I'm not 100% sure, but it may enable fast failure. Then, the complete-push task would become useless, and we could retire it.

I'd first lean towards the second solution, which seems cleaner and easier to me. What do you think @tomprince ?

pocmo commented 4 years ago

This means all tasks will still have to run before the complete-push does. Therefore, it's a not a way to fail fast.

Not as fast as possible, but already significantly better. Is that a quick fix we could do easily?

If we change taskcluster-github to prepend taskcluster/ to all Github Checks, then we can tell bors to watch for taskcluster/%. I'm not 100% sure, but it may enable fast failure. Then, the complete-push task would become useless, and we could retire it.

That would be pretty neat! But also sounds like it will require more work / coordination.

pocmo commented 4 years ago

This has been quite annoying the last working days. There has been intermittent failures causing bors to do nothing for quite a while until the timeout is happening.

JohanLorenzo commented 4 years ago

Okay. I'll give a shot at the first solution tomorrow.

JohanLorenzo commented 4 years ago

Depends on https://phabricator.services.mozilla.com/D49402

pocmo commented 4 years ago

@JohanLorenzo Is this done? I think I still see this happen.

JohanLorenzo commented 4 years ago

Option 1 should be implemented. Do you have an example I can look into?

pocmo commented 4 years ago

@JohanLorenzo I do not remember but I'll make sure to post it here once I saw it.

I recently saw a slightly different variant that I remember though. In that case the decision task failed which caused no complete-push task to get scheduled at all. I wonder if it would help to add the decision task to bors.toml as an additional task it should look at? I hope that bors will fail as soon as one of the tasks fails and doesn't wait for all of them.

pocmo commented 4 years ago

Here is an example of a timeout. https://github.com/mozilla-mobile/android-components/pull/4948

However I can't see a failing task on that push.

JohanLorenzo commented 4 years ago

I wonder if this timeout has the same root cause as the one in https://github.com/mozilla-mobile/fenix/issues/6139#issuecomment-545901930 🤔

mozilla-mobile / android-components