rust-lang / homu

A bot that integrates with GitHub and your favorite continuous integration service
MIT License
183 stars 56 forks source link

Sometimes approves the wrong commit #174

Open ehuss opened 2 years ago

ehuss commented 2 years ago

There have been two circumstances where a normal @bors r+ approved an old commit:

My hunch is that somehow bors missed the push notification, and the state.head_sha doesn't get updated to the latest version.

I think Homu should never approve a commit that is not the latest commit. I imagine it should check what the latest commit is when approving instead of assuming that the database is in sync. There's definitely risks about race conditions here that may not be solvable, but some extra effort might make it more resilient.

What's scary is that this may be happening and nobody notices. These two instances only got noticed because the old commit failed in CI.

pinkforest commented 2 years ago

Related Zulip: https://rust-lang.zulipchat.com/#narrow/stream/242791-t-infra/topic/Bors.20races.20.2F.20out.20of.20sync/near/288119101

So homu is the result of GitHub REST API rate limitation if I understood it correctly where as bors-ng is stateless and homu stateful relying entirely on the ingress webhooks for state synchronisation - which seems to have failed few or more times.

I am unsure if there is some type of retry mechanism in the ingress GH webhook calls and whether homu honors retries or whether this is something that could be configured on GH side or make Homu support - which could provide easy low hanging fruit fix to increase stability - webhooks are supposed to get status back until they are delivered EDIT: Yes there is new API for exactly that ^^^ :partying_face:

1 - Figure out the impact first

I am going to see if I can do a GraphQL query/ies to scrape off all the PR r+ comments/commits and correlate to figure out how many times - and exactly where - homu has picked up the wrong commit vs what the latest commit was given PR

That analysis would / will allow us to see which and how many PRs are affected and then see how much of an issue this is.

2 - Potentially fix something

Then to potentially fix this in Homu it would mean either - depending on reliability / real impact factors as above

1) Using GraphQL queries to ensure the state is synchronsised that doesn't hurt rate limits - if impact was high

pinkforest commented 2 years ago

Re: GitHub delivery tracking API

Seems GitHub has introduced API exactly for what we might need re 2.2) ..

https://github.blog/changelog/2021-06-30-webhook-deliveries-api/

https://docs.github.com/en/rest/reference/repos#webhooks

Can someone go see what the situation looks for 30 last days?

Also it may as well be that simple "healthcheck" polling monitor that can trigger the retries via this new API in homu might greatly improve webhook deliverability if required ?

And if the "healthcheck" monitor simply reports to Homu that there are undelivered webhooks it would hang on before acting on things..

Mark-Simulacrum commented 2 years ago

I think rather than focusing on approval time, we should probably have a check at (bors) merge commit generation time that the approved source commit is equivalent to the PR head commit at that time. That also helps us avoid issues with further pushes being missed.

I don't think focusing on missed webhooks is going to work well; GitHub has just not delivered things in the past for some time, we shouldn't depend on that for reliability here.

pinkforest commented 2 years ago

That would require querying GH one per commit generation for those but I guess there isn't too many of those that would break the rate limit as was the reason for Homu to exist Also that would be just to provide a failure and feedback mechanism but not really fixing the sync issue that happens at approval time Are we happy with that? Or do we mean that homu will pick nearest-before-approval commit regardless what was supposed approved commit (out of sync) without providing feedback about some internal sync failure e.g. just deal with it?

Mark-Simulacrum commented 2 years ago

We would post an error message in that case. There's no real issue with rate limits -- we're only generating new merge commits probably roughly <15 times/day, so checking at that time is very cheap.

I think there's more work that can be done to improve homu's synchronization with real state, but that's a separate issue from this one and doesn't need to be coupled.

pinkforest commented 2 years ago

I'll also try to finish the impact analysis during the weekend off curiosity -

just needs some regular expression on bors messages and comparing them to commit history

I just got some quick GraphQL running on GH API for doing analysis what has been going on: https://github.com/pinkforest/rust-gh-homu-off-syncs

---- PR(true): Add a `--build-dir` flag to rustbuild
 79f8dc0 - commit
bors - :pushpin: Commit 79f8dc0b898b0a387df684a539cd97446a0f964f has been approved by `jyn514`

<!-- @bors r=jyn514 79f8dc0b898b0a387df684a539cd97446a0f964f -->
<!-- homu: {"type":"Approved","sha":"79f8dc0b898b0a387df684a539cd97446a0f964f","approver":"jyn514"} -->
---- PR(true): Add macro_rules! rustdoc change to 1.62 relnotes
 4ea18cc - commit
bors - :v: @ can now approve this pull request
<!-- homu: {"type":"Delegated","delegator":"jyn514","delegate":""} -->
bors - :v: @CAD97 can now approve this pull request
<!-- homu: {"type":"Delegated","delegator":"jyn514","delegate":"CAD97"} -->
bors - :pushpin: Commit ab437ad5e46db30f1dc08d21cd73f9ef9ffa13b5 has been approved by `Mark-Simulacrum`

<!-- @bors r=Mark-Simulacrum ab437ad5e46db30f1dc08d21cd73f9ef9ffa13b5 -->
<!-- homu: {"type":"Approved","sha":"ab437ad5e46db30f1dc08d21cd73f9ef9ffa13b5","approver":"Mark-Simulacrum"} -->
bors - :pushpin: Commit 4ea18ccf7e4e4afe213c2b3a74c558135c423fde has been approved by `jyn514`

<!-- @bors r=jyn514 4ea18ccf7e4e4afe213c2b3a74c558135c423fde -->
<!-- homu: {"type":"Approved","sha":"4ea18ccf7e4e4afe213c2b3a74c558135c423fde","approver":"jyn514"} -->
---- PR(true): Rollup of 9 pull requests
 1d845bd - commit
 c6f362a - commit
 ddb6313 - commit
 debee1e - commit
 8931fbd - commit
 e043821 - commit
 c29e584 - commit
 cd7bd8b - commit
 3fcf84a - commit
 d791310 - commit
 79f8dc0 - commit
 4ea18cc - commit
 0d5636c - commit
 41e7991 - commit
 0420231 - commit
 e59693a - commit
 734f21c - commit
 80dd48b - commit
 c4acd06 - commit
 335e7d3 - commit
 18d4228 - commit
bors - :pushpin: Commit 18d4228456a98fd6d8950f74fd117aba7fb45757 has been approved by `matthiaskrgr`

<!-- @bors r=matthiaskrgr 18d4228456a98fd6d8950f74fd117aba7fb45757 -->
<!-- homu: {"type":"Approved","sha":"18d4228456a98fd6d8950f74fd117aba7fb45757","approver":"matthiaskrgr"} -->
bors - :hourglass: Testing commit 18d4228456a98fd6d8950f74fd117aba7fb45757 with merge 7e2733bb1dd9afe5fd20370ca4d539d42ac50419...
<!-- homu: {"type":"BuildStarted","head_sha":"18d4228456a98fd6d8950f74fd117aba7fb45757","merge_sha":"7e2733bb1dd9afe5fd20370ca4d539d42ac50419"} -->
bors - :sunny: Test successful - [checks-actions](https://github.com/rust-lang-ci/rust/runs/7148849106?check_suite_focus=true)
Approved by: matthiaskrgr
Pushing 7e2733bb1dd9afe5fd20370ca4d539d42ac50419 to master...
<!-- homu: {"type":"BuildCompleted","approved_by":"matthiaskrgr","base_ref":"master","builders":{"checks-actions":"https://github.com/rust-lang-ci/rust/runs/7148849106?check_suite_focus=true"},"merge_sha":"7e2733bb1dd9afe5fd20370ca4d539d42ac50419"} -->

After that I can push a PR for Homu to do that merge commit commit sync check

notriddle commented 2 years ago

if I understood it correctly where as bors-ng is stateless and homu stateful

Unfortunately, no. There are three borses, not two.

pinkforest commented 2 years ago

Raised #178 for suspected general WebHook delivery instability which addressing commit mixups alone would not fix (this ticket)

tgross35 commented 6 months ago

For reference, here is another case where this (or something similar) happened https://github.com/rust-lang/rust/pull/119748#issuecomment-1971772018