Open encukou opened 1 week ago
Thanks for opening this issue @encukou.
As a friendly side note: allow me to mention that I would have appreciated if you would have reached out to me first so we could work together instead of proposing a counter proposal. This would have reduced the potential frustration of continuing to work on the PR just to do something different then. I think this is important because even if I know there are only good intentions everywhere, it can really impact the energy to contribute. Being said that I am also happy to discuss this here, of course.
Copying the reasoning from https://github.com/python/buildmaster-config/pull/535#issuecomment-2407221523:
Just to clarify: this reasoning it's not only coming from me but this was the feedback that I received after my Language summit 2019 talk and the discussions we had in the past in core development. Indeed, a version of this was originally suggested by @willingc in 2019 and discussed with @ambv and other people in the builedbot watch in the core dev sprint at Microsoft and then again at the core dev sprint at Bloomberg and Google.
This was discussed as a solution yet again when we had to manually trigger 4 or 5 different reverts for sub interpreters in ~3.9.
Does that sound reasonable?
I don't think these two solutions are orthogonal. The automation cannot detect all situations so it's beneficial to ALSO have a way to trigger this as a human.
I guess what I am failing to understand is the rationale to view this as "either this or that".
In any case, thanks a lot @encukou for caring about this and helping with this issue ❤️
Update: I have closed my proposal.
Yet a third option is enabling merge queue and moving some/all buildbot builds from post-merge to merge queue. This would prevent changes that break buildbots from being merged in the first place, without significantly impacting the usual flow in the common case.
Yet a third option is enabling merge queue and moving some/all buildbot builds from post-merge to merge queue. This would prevent changes that break buildbots from being merged in the first place, without significantly impacting the usual flow in the common case.
That has 3 challenges I can envision:
This was clearly miscommunication. I don't want to assign blame. Let me explain how I see it, so I can figure out how to do better in the future.
I wasn't at the 2019 summit, and wasn't aware of the discussions and your work. I can only blame myself for not paying enough attention, but, your PR was a genuine surprise to me. If I knew you were working on it, I would have reached out! I recall Łukasz telling me about this briefly near the end of this year's sprint, but I assume he assumed I had lots of prior context. I thought things were at the stage of getting a problem statement out.
I regret getting in this situation, but I'm having trouble apologizing because I don't know how to make this not happen in the future :(
the rationale to view this as "either this or that".
It's not that. My problem with the fully-automatic tool is that it would trigger so rarely. If you asked me estimate of the mean-time-to-failure of integration between 3 online services, I'd say several times a year. When some API changes, or there's a network hiccup at the wrong time, it would take more work to debug the tool than do the reverts manually. It's still good that the PRs would be opened by a bot and not a human, but it's only 4 PRs a year, and I don't see a good path to make trigger more often.
Frustratingly, even if I was more aware of your effort, I'm not sure I'd catch this before the tool was implemented :(
Thanks for the explanation @encukou! As there is some nuance and in the interest of having a focused discussion on the path forward here, maybe we can chat in discord about how to avoid miscommunication in the future.
My problem with the fully-automatic tool is that it would trigger so rarely.
I think this exposes an argument of ROI, but the fact that triggers very rarely should not be a problem. People still invest in monitoring systems, debuggers and metrics even if applications have these problems in rare occasions.
In that perspective, it becomes a matter of weighting the maintenance cost. And in my opinion is quite low (although I understand that others may have different opinions). The reasons are that:
I can understand if you still that the ROI is very low. But on the other hand when I introduced the automatic messages to PRs when buildbot failed that reduced the failure rate dramatically as you can see here. So not sure what is the right answer here.
@encukou @pablogsal Since I can't recall what I said in 2019 (lol), I'm trying to understand where there is consensus and where we need to converge on more alignment.
Trying to understand context as we discuss this.
I apologize for making this look like an “either-or” situation. It's not; cherry-picker/miss-islington reverts could live alongside BuildbotHammer. (Though if we decide, at some point, to merge common code, it would need to move to the workflow that humans can trigger.)
I wasn't at the 2019 summit
Correction: I was there, I got my PyCons mixed up. Apologies for the misinformation.
I'm trying to understand where there is consensus and where we need to converge on more alignment.
IMO, we're agreeing on everything but the ROI of a fully automated solution. Please correct me if I'm wrong.
Automating the process of reverting the commit and checking that it reverted cleanly.
Yes, that is useful.
Whether to automate a trigger to revert a commit (yes/no)
Yes to opening the PR automatically, no to merging it.
When (time passed) to trigger a revert
When (under what conditions) to trigger a revert
Immediately when the breaking commit is identified. (Currently, and the human investigating the issue will often decide that a fix will be better. Automation can afford to open a PR immediately and start the CI.)
Does automating a trigger make more sense than manually triggering?
That's the ROI question.
Pablo ran the simulation: last year the tool would have triggered 4 times. IMO, a fully automatic tool can't do much better than this, unless we use heuristics¹. I wish that number was higher.
From my experience looking at the failures, many times there are multiple new commits in the breaking build, so classic automation can't pick one. But it's often easy for a human¹ to identify the breaking one -- for example, a docs-only PR probably didn't cause a crash. Hence my focus on a dashboard to show the info to humans, after which I want to work on this issue.
¹ This might be one of the rare cases where AI could be useful, but I assume we don't want a bot to revert people's work without hard evidence.
IMO, we're agreeing on everything but the ROI of a fully automated solution.
Awesome!
When (under what conditions) to trigger a revert
Immediately when the breaking commit is identified. (Currently, and the human investigating the issue will often decide that a fix will be better. Automation can afford to open a PR immediately and start the CI.)
Does automating a trigger make more sense than manually triggering?
That's the ROI question.
I'm going to walkthrough my understanding:
I guess the path taken really boils down to time to maintain, time to implement, and activity level of automated issues/PR.
As a first iteration towards something like that, BuildbotHammer is starting to sound much better!
I apologise for the delay in the answer. My last week has been quite challenging :(
@encukou @pablogsal Since I can't recall what I said in 2019 (lol), I'm trying to understand where there is consensus and where we need to converge on more alignment.
That's a lot for chiming in @willingc! ❤️
I'm going to walkthrough my understanding:
- A buildbot is failing.
Notify the maintainers right away by either (ROI decisions pending).
a. Active notification
i. by opening a PR that reverts all commits prior to failure
ii. by opening an issue that lists all commits that landed since last passing run; then:
open an automated revert pr after some human intervention to identify the likely commit causing failure
a manual pr is opened to revert the likely commit causing failure.
- b. dashboard notification. Then manual or automated intervention.
Notice only the commit that fails the build would be automatically reverted (not all the commits prior to failure). This is because the automatic revert would only trigger if we can identify a single PR as the culprit. Some small clarification here that's very important: buildbots unfortunately fail some times randomly due to test instability, upgrades and other things that are not real problems that need to be addressed right away. To minimise this we discussed that automatic PR openings for revert would only be opened if there are at least N consecutive failures since the last green build. This means that we are quite confident SOMETHING broke the build and we reduce random revert PR openings.
The other important thing to clarify: the benefit for an automated process is that it reduces the cost of having to do reverts. When there have been a bunch of commits on top of the commit that breaks the build it makes it much more challenging to make reverts and that process by itself can introduce even more bugs. So addressing this promptly has an advantage of reducing the risk of the operation (if we decide to proceed with it).
As a first iteration towards something like that, BuildbotHammer is starting to sound much better!
Maybe in the meanwhile I can modify buildbothammer to just post to a new discord channel of high-priority problems when it detects that a builder has been failing and we have a culprit. This channel will be different from "all the buildbot failures channel" because we know that something really is happening and it needs to be addressed as soon as possible.
This will allow core devs in residence but also other core devs to act appropiately knowing that everything that goes to that channel is important.
How does that sound @encukou @willingc ?
Maybe in the meanwhile I can modify buildbothammer to just post to a new discord channel of high-priority problems when it detects that a builder has been failing and we have a culprit. This channel will be different from "all the buildbot failures channel" because we know that something really is happening and it needs to be addressed as soon as possible.
I think this is a really good idea. (BTW, I would definitely welcome Pablo's buildbot hammer 🔨)
Sounds great! Hopefully it'll not post often, but enough that we'll notice if it starts failing.
Sounds great! Hopefully it'll not post often, but enough that we'll notice if it starts failing.
I think is conservative enough that it won't post often. As you mentioned, this has a very low rate of occurring due to the restrictions on single-commit of failure + several failures in a row.
I love the solutions @encukou and @pablogsal. Here's to better workflows for all :tada:
@pablogsal recently opened a PR for automating revert commits. I agree with the issue but respectfully disagree with that particular proposal.
Copying the reasoning from Pablo's comment:
My own plan of action would be to allow humans to initiate the process:
cherry-picker
to--revert
commits. (Note that in Git, cherry-picks and reverts are very similar once started, see the--continue
option forcherry_pick
andrevert
.)needs-revert
label is added. The GitHub comment should explain that we revert because failing CI can hide other issues.Does that sound reasonable?
(FWIW, my current priority is putting breaking PRs, and individual test failures, on the dashboard -- the first step for a revert or fix is for the failure to show up there.)