kubernetes / test-infra

Test infrastructure for the Kubernetes project.
Apache License 2.0
3.83k stars 2.65k forks source link

tide should not fail to sync subpool with unmergable pr that is already merged #16643

Closed BenTheElder closed 4 years ago

BenTheElder commented 4 years ago

What happened:

jsonPayload: {
  base-sha: "9e853ce03cf83525699f5a623372c6b09694ab9e"   
  branch: "master"   
  component: "tide"   
  controller: "sync"   
  error: "failed merging [16641]: [PR is unmergable. Do the Tide merge requirements match the GitHub settings for the repo? Pull Request is not mergeable]"   
  file: "prow/tide/tide.go:410"   
  func: "k8s.io/test-infra/prow/tide.(*Controller).Sync.func2"   
  level: "error"   
  msg: "Error syncing subpool."   
  org: "kubernetes"   
  repo: "test-infra"   
 }

and yet https://github.com/kubernetes/test-infra/pull/16641#event-3103161369 is already merged ??

What you expected to happen:

tide should continue merging and not fail to sync the subpool

How to reproduce it (as minimally and precisely as possible):

no idea!

Please provide links to example occurrences, if any:

see above link and the log snippet

Anything else we need to know?:

this is bizzare /area prow

cjwagner commented 4 years ago

We have seen this behavior when GitHub's search index becomes incorrect. If that is the case, commenting on the PR is enough to resolve the problem, but we should still report it to GH support so they can address the bug. Did you by chance see if the PR was showing up in queries for open PRs in the repo (using GH search UI not the API)?

Katharine commented 4 years ago

It was not, which was a little surprising to me.

chases2 commented 4 years ago

This (or a similar issue) is occurring again:

jsonPayload: {
  base-sha: "a9226bb268e3234e671cbf56ad9109b801d298fc"   
  branch: "master"   
  component: "tide"   
  controller: "sync"   
  error: "PR is unmergable. Do the Tide merge requirements match the GitHub settings for the repo? Pull Request is not mergeable"   
  file: "prow/tide/tide.go:1085"   
  func: "k8s.io/test-infra/prow/tide.(*Controller).mergePRs"   
  level: "error"   
  merge-targets: [1]   
  msg: "Merge failed."   
  org: "kubernetes"   
  pr: 2949   
  repo: "autoscaler"   
  sha: "8e619f67ad496f21400a4486d4fc14db9a53de9c"   
 }

The PR in question merged about 7 hours ago. Here is the other error, paired with the one above:

{
  base-sha: "a9226bb268e3234e671cbf56ad9109b801d298fc"   
  branch: "master"   
  component: "tide"   
  controller: "sync"   
  error: "failed merging [2949]: [PR is unmergable. Do the Tide merge requirements match the GitHub settings for the repo? Pull Request is not mergeable]"   
  file: "prow/tide/tide.go:417"   
  func: "k8s.io/test-infra/prow/tide.(*Controller).Sync.func2"   
  level: "error"   
  msg: "Error syncing subpool."   
  org: "kubernetes"   
  repo: "autoscaler"   
 }
chases2 commented 4 years ago

Queries for the repositories' open issues did not surface 2949. I only found it after looking for a closed PR.

munnerz commented 4 years ago

We've also seen this: https://github.com/jetstack/cert-manager-nginx-plus-lab/pull/21

munnerz commented 4 years ago

Appears that this one probably is caused by GitHub returning incorrect results:

Screenshot 2020-03-26 at 15 30 01
BenTheElder commented 4 years ago

getting stuck due to this without much visibility is sub-optimal. we could probably query the unmergable PR to discover if this is the case and blacklist it. alternatively, we could recommend alerting on stuck tide I guess.

so far this seems to lock up until someone realizes their PRs are taking suspiciously long to test & merge, which is difficult given how slow and flaky our tests are >.> even test-infra has > 10m presubmits again :/

On Thu, Mar 26, 2020 at 8:30 AM James Munnelly notifications@github.com wrote:

Appears that this one probably is caused by GitHub returning incorrect results: [image: Screenshot 2020-03-26 at 15 30 01] https://user-images.githubusercontent.com/203583/77664853-b7e0ab80-6f76-11ea-8cf0-12fccfa35aff.png

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/kubernetes/test-infra/issues/16643#issuecomment-604497207, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHADK4RA7LMJHYWXO7HI33RJNYKVANCNFSM4LCUP56Q .

cjwagner commented 4 years ago

we could probably query the unmergable PR to discover if this is the case and blacklist it.

We're actually already checking that information so the issue is likely that GH is returning incorrect info.

alternatively, we could recommend alerting on stuck tide I guess.

This should already be happening if a repo is stuck for more than 15 mins (besides kubeflow repos which have been broken forever): https://github.com/kubernetes/test-infra/blob/cf4eb1c6d2060675dbf72d04afdc13276ea9f070/prow/cluster/monitoring/mixins/prometheus/tide_alerts.libsonnet#L33-L45 If it isn't then that alert isn't working as intended for that case for some reason.

BenTheElder commented 4 years ago

We're actually already checking that information so the issue is likely that GH is returning incorrect info.

via what API? the search results are clearly wrong, but when I loaded the page it was merged, so one of the APIs must know.

cjwagner commented 4 years ago

We're actually already checking that information so the issue is likely that GH is returning incorrect info.

via what API? the search results are clearly wrong, but when I loaded the page it was merged, so one of the APIs must know.

Via the v4 search API (We avoid the v3 API since it is far more expensive in terms of rate limit when handling lots of PRs. That is one of the major distinctions between Tide and mungegithub). Unfortunately we've seen that the fact that something works on GitHub's front end doesn't imply a corresponding API exists or works properly. The mergable status of PRs is unique in that it is lazily computed. As a result it is possible that Tide tries to merge an unmergable PR if the status isn't populated yet, but the act of requesting the PR is supposed to trigger a background job to recalculate mergability so it should quickly identify the PR as unmergable on subsequent sync loops. https://developer.github.com/v3/pulls/#response-1 https://developer.github.com/v4/object/pullrequest/#fields

If we see this again I'll try to check out the v3 behavior on the offending PR, but either way this seems like a problem with GHs API so I would prefer to have them address it rather than work around it (especially if it costs ratelimit to do so).

Katharine commented 4 years ago

To risk asking a really dumb question: if a PR is unmergeable, why not just move on to the next PR and try merging that? We have seen many times that it is unwise to assume that PRs that we think are mergeable actually always are.

We see lots of single PR failures in interesting ways. It's not clear to me that this should hold up the rest of the repo.

BenTheElder commented 4 years ago

It has to be costing us API tokens to continue to do useless things instead of actually merging.

I bet the information that a PR is "merged" is not lazy versus the "mergeable" data and the search API. I'll try to check next time I see this.

Continuing to rely solely on the search API seems to be a mistake, we know that this involves running a complex indexing job and frequently has incorrect information. We've even had indexing for our repos manually repaired before by contacting support IIRC, which seems less than ideal.

+1 for moving on to the next PR.

On Thu, Mar 26, 2020 at 2:25 PM Katharine Berry notifications@github.com wrote:

To risk asking a really dumb question: if a PR is unmergeable, why not just move on to the next PR and try merging that? We have seen many times that it is unwise to assume that PRs that we think are mergeable actually always are.

We see lots of single PR failures in interesting ways. It's not clear to me that this should hold up the rest of the repo.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/kubernetes/test-infra/issues/16643#issuecomment-604694944, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHADKYD2FHXA53TR5MT67DRJPB3DANCNFSM4LCUP56Q .

cpoole commented 4 years ago

For what it's worth, we have seen this on our internal repos as well (repos with orders of magnitude less activity than the k8s community repos).

Same symptom and resolution: GitHub v4 API shows the PR as open even though tide has already merged it. Commenting on the PR causes GitHub to properly update it's state

cjwagner commented 4 years ago

@cpoole I left this note in Slack, please ping me when you see such occurrences so I can try to get GH to take this seriously. They won't reopen the support ticket until I can link them live examples (even though I have done that repeatedly on prior support requests for the same problem).

If anyone sees Tide get stuck on a repo trying to merge a PR that has already been merged please avoid touching the problematic PR (e.g. don't comment or change labels) and ping me ASAP! We've been having some problems with a GitHub search indexing bug(s) on and off for over a year and GH support is requesting some more live examples of the problem. We normally can fix the search index by commenting or otherwise updating the problematic PR, but please avoid doing so if the repo can survive a temporary merge automation outage so that GH support can inspect it. Thanks!

cpoole commented 4 years ago

Will do @cjwagner

DoomGerbil commented 4 years ago

If anyone sees Tide get stuck on a repo trying to merge a PR that has already been merged please avoid touching the problematic PR (e.g. don't comment or change labels) and ping me ASAP! We've been having some problems with a GitHub search indexing bug(s) on and off for over a year and GH support is requesting some more live examples of the problem. We normally can fix the search index by commenting or otherwise updating the problematic PR, but please avoid doing so if the repo can survive a temporary merge automation outage so that GH support can inspect it. Thanks!

I've just had this occur on an internal repo in my company's org (improbable) with our private instance of Tide.

In our case, the PR was stuck for 6 hours, and became unstuck and got merged by Tide when I started sending search queries via the GitHub UI - nobody touched the PR in question when it was in the stuck state.

@cjwagner - If this might help, you can point the GitHub people at my example:

{
  "base-sha": "a266dc919b94dfd76efa5ce97d7b76a66339421d",
  "batch-passing": null,
  "batch-pending": null,
  "branch": "master",
  "component": "tide",
  "controller": "sync",
  "file": "prow/tide/tide.go:1470",
  "func": "k8s.io/test-infra/prow/tide.(*Controller).syncSubpool",
  "level": "info",
  "msg": "Subpool accumulated.",
  "org": "improbable",
  "prs-missing": null,
  "prs-passing": [
    104,
    100
  ],
  "prs-pending": null,
  "repo": "config-intinf",
  "time": "2020-04-30T15:25:30Z"
}
cjwagner commented 4 years ago

@DoomGerbil Thanks for sharing here. I've included a link to your comment in the email thread with GH support.

pabrahamsson commented 4 years ago

@cjwagner in case you need another sample, redhat-cop/containers-quickstarts#323. This PR was merged last week but Tide is still trying to merge it and other PRs are now waiting behind it. Tide log:

{
  "base-sha": "e60b85d992e43cb07c583b9b69c793146a8458a6",
  "batch-passing": null,
  "batch-pending": null,
  "branch": "master",
  "component": "tide",
  "controller": "sync",
  "file": "prow/tide/tide.go:1180",
  "func": "k8s.io/test-infra/prow/tide.(*Controller).syncSubpool",
  "level": "info",
  "msg": "Subpool accumulated.",
  "org": "redhat-cop",
  "prs-missing": null,
  "prs-passing": [
    332,
    323,
    333
  ],
  "prs-pending": null,
  "repo": "containers-quickstarts",
  "time": "2020-05-06T21:16:02Z"
}
cjwagner commented 4 years ago

@pabrahamsson I'd poke the PR with a comment to try to fix the search indexing. It sounds like GH has enough examples for the time being. This was the last response from May 4th:

Thanks for all the details. We investigated and found this to be the result of stale search indices. We therefore started working on a patch which will effectively restart failed indexing jobs.

I don't have an estimate on when this fix will ship, but we have all the info we need, so you can go ahead and comment on the pull requests in question and unblock your developers.

Thanks for your patience and collaboration in this matter.

fejta-bot commented 4 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

fejta-bot commented 4 years ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle rotten

fejta-bot commented 4 years ago

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /close

k8s-ci-robot commented 4 years ago

@fejta-bot: Closing this issue.

In response to [this](https://github.com/kubernetes/test-infra/issues/16643#issuecomment-703180544): >Rotten issues close after 30d of inactivity. >Reopen the issue with `/reopen`. >Mark the issue as fresh with `/remove-lifecycle rotten`. > >Send feedback to sig-testing, kubernetes/test-infra and/or [fejta](https://github.com/fejta). >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.