Automating revert commits

encukou commented 1 week ago

@pablogsal recently opened a PR for automating revert commits. I agree with the issue but respectfully disagree with that particular proposal.

Copying the reasoning from Pablo's comment:

[There should be automation that will] open (but doesn't merge) revert PRs when a builtbot builder has failed [...].

The idea is not to force the revert on people because a human still needs to merge the PR and the discussion needs to happen as well but to start the process in a way that's not someone opening a revert PR to someone else (it's a bot opening the revert PR).

This won't caught everything but it will caught as many things we can be sure that are caused by a single commit and the PRs will be open by a bot and not a human, highlighting that this is an automated process to not have the CI broken.

The problem that this solves is not having the CI red for a long time making fixes or reverts more challenging as time passes and the reverts are not clean anymore. We have discussed many times (not only in this last core dev sprint) that having the buildbot failing for a long time makes things much more difficult, specially when issues start to pile up and people are slow to act..

My own plan of action would be to allow humans to initiate the process:

[ ] Teach cherry-picker to --revert commits. (Note that in Git, cherry-picks and reverts are very similar once started, see the --continue option for cherry_pick and revert.)
[ ] Teach @miss-islington to open a revert PR, and re-open the issue, when a new needs-revert label is added. The GitHub comment should explain that we revert because failing CI can hide other issues.
[ ] Then, have Buildbot automation apply the label, so that a human isn't initiating the action.

Does that sound reasonable?

(FWIW, my current priority is putting breaking PRs, and individual test failures, on the dashboard -- the first step for a revert or fix is for the failure to show up there.)

pablogsal commented 1 week ago

Thanks for opening this issue @encukou.

As a friendly side note: allow me to mention that I would have appreciated if you would have reached out to me first so we could work together instead of proposing a counter proposal. This would have reduced the potential frustration of continuing to work on the PR just to do something different then. I think this is important because even if I know there are only good intentions everywhere, it can really impact the energy to contribute. Being said that I am also happy to discuss this here, of course.

Copying the reasoning from https://github.com/python/buildmaster-config/pull/535#issuecomment-2407221523:

Just to clarify: this reasoning it's not only coming from me but this was the feedback that I received after my Language summit 2019 talk and the discussions we had in the past in core development. Indeed, a version of this was originally suggested by @willingc in 2019 and discussed with @ambv and other people in the builedbot watch in the core dev sprint at Microsoft and then again at the core dev sprint at Bloomberg and Google.

This was discussed as a solution yet again when we had to manually trigger 4 or 5 different reverts for sub interpreters in ~3.9.

Does that sound reasonable?

I don't think these two solutions are orthogonal. The automation cannot detect all situations so it's beneficial to ALSO have a way to trigger this as a human.

I guess what I am failing to understand is the rationale to view this as "either this or that".

In any case, thanks a lot @encukou for caring about this and helping with this issue ❤️

pablogsal commented 1 week ago

Update: I have closed my proposal.

zware commented 1 week ago

Yet a third option is enabling merge queue and moving some/all buildbot builds from post-merge to merge queue. This would prevent changes that break buildbots from being merged in the first place, without significantly impacting the usual flow in the common case.

pablogsal commented 1 week ago

Yet a third option is enabling merge queue and moving some/all buildbot builds from post-merge to merge queue. This would prevent changes that break buildbots from being merged in the first place, without significantly impacting the usual flow in the common case.

That has 3 challenges I can envision:

It won't work for the most heavyweight buildbots which either are slow by nature or are runnign very slow tests like refleaks.
We have a lot of random failures either by tests or by buildbot so this will cancel the merge of perfectly valid commit and will either reschedule it or fail the merge which will be very frustrating for a lot of core devs and will trigger a lot of future conflicts.
The longer the time for the merge queue to pass the more chances of conflicts between the commits that were mergeable at the time they were introduced in the queue.

encukou commented 1 week ago

This was clearly miscommunication. I don't want to assign blame. Let me explain how I see it, so I can figure out how to do better in the future.

I wasn't at the 2019 summit, and wasn't aware of the discussions and your work. I can only blame myself for not paying enough attention, but, your PR was a genuine surprise to me. If I knew you were working on it, I would have reached out! I recall Łukasz telling me about this briefly near the end of this year's sprint, but I assume he assumed I had lots of prior context. I thought things were at the stage of getting a problem statement out.

I regret getting in this situation, but I'm having trouble apologizing because I don't know how to make this not happen in the future :(

the rationale to view this as "either this or that".

It's not that. My problem with the fully-automatic tool is that it would trigger so rarely. If you asked me estimate of the mean-time-to-failure of integration between 3 online services, I'd say several times a year. When some API changes, or there's a network hiccup at the wrong time, it would take more work to debug the tool than do the reverts manually. It's still good that the PRs would be opened by a bot and not a human, but it's only 4 PRs a year, and I don't see a good path to make trigger more often.

Frustratingly, even if I was more aware of your effort, I'm not sure I'd catch this before the tool was implemented :(

pablogsal commented 1 week ago

Thanks for the explanation @encukou! As there is some nuance and in the interest of having a focused discussion on the path forward here, maybe we can chat in discord about how to avoid miscommunication in the future.

My problem with the fully-automatic tool is that it would trigger so rarely.

I think this exposes an argument of ROI, but the fact that triggers very rarely should not be a problem. People still invest in monitoring systems, debuggers and metrics even if applications have these problems in rare occasions.

In that perspective, it becomes a matter of weighting the maintenance cost. And in my opinion is quite low (although I understand that others may have different opinions). The reasons are that:

Is in Python
Doesn't impact anything or blocks anyone if it fails
We can still do the work manually
Is stateless, doesn't depend on resources like DBs or Redis and can be run anywhere with a cron tab

I can understand if you still that the ROI is very low. But on the other hand when I introduced the automatic messages to PRs when buildbot failed that reduced the failure rate dramatically as you can see here. So not sure what is the right answer here.

willingc commented 1 week ago

@encukou @pablogsal Since I can't recall what I said in 2019 (lol), I'm trying to understand where there is consensus and where we need to converge on more alignment.

Automating the process of reverting the commit and checking that it reverted cleanly.
Whether to automate a trigger to revert a commit (yes/no)
When (time passed) to trigger a revert
When (under what conditions) to trigger a revert
Does automating a trigger make more sense than manually triggering?

Trying to understand context as we discuss this.

encukou commented 1 week ago

I apologize for making this look like an “either-or” situation. It's not; cherry-picker/miss-islington reverts could live alongside BuildbotHammer. (Though if we decide, at some point, to merge common code, it would need to move to the workflow that humans can trigger.)

I wasn't at the 2019 summit

Correction: I was there, I got my PyCons mixed up. Apologies for the misinformation.

I'm trying to understand where there is consensus and where we need to converge on more alignment.

IMO, we're agreeing on everything but the ROI of a fully automated solution. Please correct me if I'm wrong.

Automating the process of reverting the commit and checking that it reverted cleanly.

Yes, that is useful.
Whether to automate a trigger to revert a commit (yes/no)

Yes to opening the PR automatically, no to merging it.
When (time passed) to trigger a revert
When (under what conditions) to trigger a revert

Immediately when the breaking commit is identified. (Currently, and the human investigating the issue will often decide that a fix will be better. Automation can afford to open a PR immediately and start the CI.)
Does automating a trigger make more sense than manually triggering?

That's the ROI question.

Pablo ran the simulation: last year the tool would have triggered 4 times. IMO, a fully automatic tool can't do much better than this, unless we use heuristics¹. I wish that number was higher.

From my experience looking at the failures, many times there are multiple new commits in the breaking build, so classic automation can't pick one. But it's often easy for a human¹ to identify the breaking one -- for example, a docs-only PR probably didn't cause a crash. Hence my focus on a dashboard to show the info to humans, after which I want to work on this issue.

¹ This might be one of the rare cases where AI could be useful, but I assume we don't want a bot to revert people's work without hard evidence.

willingc commented 1 week ago

IMO, we're agreeing on everything but the ROI of a fully automated solution.

Awesome!

When (under what conditions) to trigger a revert

Immediately when the breaking commit is identified. (Currently, and the human investigating the issue will often decide that a fix will be better. Automation can afford to open a PR immediately and start the CI.)

Does automating a trigger make more sense than manually triggering?

That's the ROI question.

I'm going to walkthrough my understanding:

A buildbot is failing.
Notify the maintainers right away by either (ROI decisions pending).
- a. Active notification
  - i. by opening a PR that reverts all commits prior to failure
  - ii. by opening an issue that lists all commits that landed since last passing run; then:
    - open an automated revert pr after some human intervention to identify the likely commit causing failure
    - a manual pr is opened to revert the likely commit causing failure.
  - b. dashboard notification. Then manual or automated intervention.

I guess the path taken really boils down to time to maintain, time to implement, and activity level of automated issues/PR.

encukou commented 5 days ago

As a first iteration towards something like that, BuildbotHammer is starting to sound much better!

pablogsal commented 2 days ago

I apologise for the delay in the answer. My last week has been quite challenging :(

@encukou @pablogsal Since I can't recall what I said in 2019 (lol), I'm trying to understand where there is consensus and where we need to converge on more alignment.

That's a lot for chiming in @willingc! ❤️

I'm going to walkthrough my understanding:

A buildbot is failing.

Notify the maintainers right away by either (ROI decisions pending).

a. Active notification

i. by opening a PR that reverts all commits prior to failure

ii. by opening an issue that lists all commits that landed since last passing run; then:

open an automated revert pr after some human intervention to identify the likely commit causing failure

a manual pr is opened to revert the likely commit causing failure.

b. dashboard notification. Then manual or automated intervention.

Notice only the commit that fails the build would be automatically reverted (not all the commits prior to failure). This is because the automatic revert would only trigger if we can identify a single PR as the culprit. Some small clarification here that's very important: buildbots unfortunately fail some times randomly due to test instability, upgrades and other things that are not real problems that need to be addressed right away. To minimise this we discussed that automatic PR openings for revert would only be opened if there are at least N consecutive failures since the last green build. This means that we are quite confident SOMETHING broke the build and we reduce random revert PR openings.

The other important thing to clarify: the benefit for an automated process is that it reduces the cost of having to do reverts. When there have been a bunch of commits on top of the commit that breaks the build it makes it much more challenging to make reverts and that process by itself can introduce even more bugs. So addressing this promptly has an advantage of reducing the risk of the operation (if we decide to proceed with it).

pablogsal commented 2 days ago

As a first iteration towards something like that, BuildbotHammer is starting to sound much better!

Maybe in the meanwhile I can modify buildbothammer to just post to a new discord channel of high-priority problems when it detects that a builder has been failing and we have a culprit. This channel will be different from "all the buildbot failures channel" because we know that something really is happening and it needs to be addressed as soon as possible.

This will allow core devs in residence but also other core devs to act appropiately knowing that everything that goes to that channel is important.

How does that sound @encukou @willingc ?

erlend-aasland commented 2 days ago

Maybe in the meanwhile I can modify buildbothammer to just post to a new discord channel of high-priority problems when it detects that a builder has been failing and we have a culprit. This channel will be different from "all the buildbot failures channel" because we know that something really is happening and it needs to be addressed as soon as possible.

I think this is a really good idea. (BTW, I would definitely welcome Pablo's buildbot hammer 🔨)

encukou commented 2 days ago

Sounds great! Hopefully it'll not post often, but enough that we'll notice if it starts failing.

pablogsal commented 2 days ago

Sounds great! Hopefully it'll not post often, but enough that we'll notice if it starts failing.

I think is conservative enough that it won't post often. As you mentioned, this has a very low rate of occurring due to the restrictions on single-commit of failure + several failures in a row.

willingc commented 2 days ago

I love the solutions @encukou and @pablogsal. Here's to better workflows for all :tada:

python / core-workflow

Automating revert commits #559