istio / test-infra

Apache License 2.0
73 stars 175 forks source link

Cherrypicker missing many (all?) cherrypicks from label #2309

Open howardjohn opened 4 years ago

howardjohn commented 4 years ago

Example: https://github.com/istio/istio/pull/20543, https://github.com/istio/istio/pull/20442

Do we need to setup config per-branch for this label? I could not find anything where we specify this so I assumed it applies to any cherrypick/* label. If the labels don't work we should just remove the labels and have everyone use the command since it seems to work more reliably

clarketm commented 4 years ago

Do we need to setup config per-branch for this label? I could not find anything where we specify this so I assumed it applies to any cherrypick/* label. If the labels don't work we should just remove the labels and have everyone use the command since it seems to work more reliably

@howardjohn

There is no special configuration; the plugin simply looks for the cherrypick/ label prefix. Interestingly, not all 1.5 cherrypicks were failures: https://github.com/istio/istio/pull/20244; https://github.com/istio/istio/pull/20408; https://github.com/istio/istio/pull/20530 though a large portion were 🤔. I would surmise that it is not an issue specifically with the label but more generally with the plugin (since this is not the first occurrence of it flaking: #1598). Moreover, the slash comment (i.e. /cherrypick ...) version seems to be prone to error as well: https://github.com/istio/istio/pull/20511#issuecomment-578378155, which further leads me to this conclusion. There is, of course, a possiblity that this is environmental (i.e. GitHub not sending webhooks), but upon initial inspection I do not think this is the case. Regardless, I recommend going with the slash comment version if it is more reliable - at least until I can delve into the logs, determine the cause, and provide a solution.

ericvn commented 4 years ago

FYI: https://github.com/istio/istio.io/issues/6332

clarketm commented 4 years ago

All of the errors that I am seeing thus far are git fetch timeouts (related: https://github.com/openshift/release/issues/6110, kubernetes/test-infra#15225).

e.g. https://github.com/istio/istio/pull/20543

{
 insertId: "170kjw2f50zixp"  
 jsonPayload: {
  error: "git fetch error: exit status 128. output: fatal: remote error: upload-pack: not our ref 2df38397f8b7ef9e1415dd777ddce3c64a0d149d
fatal: The remote end hung up unexpectedly
"   
  event-GUID: "ecdba800-4157-11ea-9f97-d30febca7c89"   
  event-type: "pull_request"   
  level: "info"   
  msg: "Cherry-pick failed."   
  plugin: "cherrypick"   
 }
 labels: {
  compute.googleapis.com/resource_name: "gke-prow-prod-pool-fb335f1f-5rds"   
  container.googleapis.com/namespace_name: "default"   
  container.googleapis.com/pod_name: "cherrypicker-6895fb69fc-mcchq"   
  container.googleapis.com/stream: "stderr"   
 }
 logName: "projects/istio-testing/logs/cherrypicker"  
 receiveTimestamp: "2020-01-27T22:54:27.758925058Z"  
 resource: {
  labels: {
   cluster_name: "prow"    
   container_name: "cherrypicker"    
   instance_id: "294245110686867759"    
   namespace_id: "default"    
   pod_id: "cherrypicker-6895fb69fc-mcchq"    
   project_id: "istio-testing"    
   zone: "us-west1-a"    
  }
  type: "container"   
 }
 severity: "ERROR"  
 timestamp: "2020-01-27T22:54:22Z"  
}

cc @cjwagner

cjwagner commented 4 years ago

Well I don't see a 2df38397f8b7ef9e1415dd777ddce3c64a0d149d commit anywhere in the istio org so it makes sense that that fails. I wonder where that came from... Looks like we only fetch with git when checking out the repo and use GitHub to get the PR patch?

howardjohn commented 4 years ago

That SHA is in my fork (and the PR referenced is from me): https://github.com/howardjohn/istio/commit/2df38397f8b7ef9e1415dd777ddce3c64a0d149d. I am not exactly sure where its from, but its not completely random?

cjwagner commented 4 years ago

Interesting, that commit is a merge of the merge commit of istio/istio#20543 and some other commit. I don't see anywhere we try to clone the user's fork and that commit shouldn't matter anyways since it is entirely separate from the PR...

clarketm commented 4 years ago

There are plenty more, but here are a few of the other commits scraped from the logs which also failed on fetch: istio/istio@c36fd36db6639120a0a2151a31d97dbf2d607b27, istio/istio@975734e3124d5019ad99fbdbbf69bbb85f8a8215, istio/istio@d9af9f83e59a924b805544f88dc4207c5b931196. They all appear to be merge commits.

cjwagner commented 4 years ago

In particular they are all merge commits in which one of the 2 parent commits is the merge commit of a PR that requested a cherry pick.

clarketm commented 4 years ago

Running a load test using the following code and git version 2.24.1 I was able to reproduce the error for istio/istio in an isolated environment. That tells me it is likely a git or GitHub issue:

export GIT_TRACE=1

tmp_dir=$(mktemp -d -t ci-XXXXXXXXXX)
trap 'rm -rf $mktemp' EXIT

org="istio"
format_string="https://github.com/%s/%s.git"

repos=(
    "istio.io"
    "istio"
    "release-builder"
    "proxy"
    "envoy"
    "cni"
    "operator"
    "client-go"
)

pushd "$tmp_dir"

for repo in "${repos[@]}"; do
    src="$(printf "$format_string" "$org" "$repo")"

    git clone --mirror "$src"

    pushd "$(basename "$src")"
    git fetch
    popd
done

popd

with error:

15:54:11.760238 git.c:439               trace: built-in: git clone --mirror https://github.com/istio/istio.git
Cloning into bare repository 'istio.git'...
warning: templates not found in /Users/clarketm/.git-templates
15:54:11.767597 run-command.c:663       trace: run_command: git-remote-https origin https://github.com/istio/istio.git
15:54:12.697024 run-command.c:663       trace: run_command: git fetch-pack --stateless-rpc --stdin --lock-pack --thin --check-self-contained-and-connected --cloning --no-progress https://github.com/istio/istio.git/
15:54:12.704064 git.c:439               trace: built-in: git fetch-pack --stateless-rpc --stdin --lock-pack --thin --check-self-contained-and-connected --cloning --no-progress https://github.com/istio/istio.git/

fatal: remote error: upload-pack: not our ref 39c3dbcd9b9c18c21ea97898922749528bbf96fe

Note: istio/istio@39c3dbcd9b9c18c21ea97898922749528bbf96fe follows the same pattern as above.

clarketm commented 4 years ago

@stevekuznetsov @petr-muller

I see both openshift (https://github.com/openshift/release/issues/6110) and prow (https://github.com/kubernetes/test-infra/issues/15225) have been experiencing pain related to this error remote error: upload-pack: not our ref ... in the recent past (~2-3 months). I am curious if anyone has found a solution, workaround, or cause (short of increasing retries)?

stevekuznetsov commented 4 years ago

I wonder if that merge commit is somehow related to in-repo config fetching?

The upload-pack: not our ref error is in general just related to GitHub's REST API racing with what's available over git from their remotes. The workaround in most cases is just to try again, the most common case is when the REST API/webhooks tell us about a refspec that's not pullable (yet) from the git servers.

clarketm commented 4 years ago

I wonder if that merge commit is somehow related to in-repo c fetching?

I am still unsure about the merge commit origination. I will need to dig into the in-repo code to see.

The upload-pack: not our ref error is in general just related to GitHub's REST API racing with what's available over git from their remotes. The workaround in most cases is just to try again, the most common case is when the REST API/webhooks tell us about a refspec that's not pullable (yet) from the git servers.

That makes sense. So, it sounds like this is less of a bug and more of a necessary side-effect of eventual consistency and replication. I am fine with more retries or a longer backoff factor; the problem is that these are hardcoded to 3 retries with a relatively short backoff factor of 2 in the git client. This is apparently not enough since the above missed cherrypicks only happen after all retries are exhausted.

I guess a good first step is to try to identify, with certainty, what/who creates the merge commit(s). The I can tweak the retryCmd in k8s/test-infra to make the retry count and/or backoff factor configurable - and increase the quantity for these values.

cjwagner commented 4 years ago

Still sounds like there is a bug somewhere given that we still don't know why the unrelated merge commit is being fetched. +1 for identifying that first before trying to work around with retries.

Since this looks like a git/GitHub bug and since you have a simple script to reproduce this it is probably time to escalate to GH support?

clarketm commented 4 years ago

Still sounds like there is a bug somewhere given that we still don't know why the unrelated merge commit is being fetched. +1 for identifying that first before trying to work around with retries.

Since this looks like a git/GitHub bug and since you have a simple script to reproduce this it is probably time to escalate to GH support?

GitHub support ticket created. In the meantime, I will add support for more (or longer time between) retries to see if that mitigates.

alvaroaleman commented 4 years ago

The git fetch error is emitted in the git clients Clone which doesn't get any kind of revision passed in.

The cherrypicker fetches the patch from the GitHub api and not via the git client: https://github.com/kubernetes/test-infra/blob/b2d632960440e47f671dd3b997709aa299911456/prow/external-plugins/cherrypicker/server.go#L378

This doesn't really answer the question why exactly this happens (maybe revisions that are not actually existing anymore are advertised by github?) but I doesn't look like a bug in our code.