Open howardjohn opened 4 years ago
Do we need to setup config per-branch for this label? I could not find anything where we specify this so I assumed it applies to any
cherrypick/*
label. If the labels don't work we should just remove the labels and have everyone use the command since it seems to work more reliably
@howardjohn
There is no special configuration; the plugin simply looks for the cherrypick/
label prefix. Interestingly, not all 1.5 cherrypicks were failures: https://github.com/istio/istio/pull/20244; https://github.com/istio/istio/pull/20408; https://github.com/istio/istio/pull/20530 though a large portion were 🤔. I would surmise that it is not an issue specifically with the label but more generally with the plugin (since this is not the first occurrence of it flaking: #1598). Moreover, the slash comment (i.e. /cherrypick ...
) version seems to be prone to error as well: https://github.com/istio/istio/pull/20511#issuecomment-578378155, which further leads me to this conclusion. There is, of course, a possiblity that this is environmental (i.e. GitHub not sending webhooks), but upon initial inspection I do not think this is the case. Regardless, I recommend going with the slash comment version if it is more reliable - at least until I can delve into the logs, determine the cause, and provide a solution.
All of the errors that I am seeing thus far are git fetch
timeouts (related: https://github.com/openshift/release/issues/6110, kubernetes/test-infra#15225).
{
insertId: "170kjw2f50zixp"
jsonPayload: {
error: "git fetch error: exit status 128. output: fatal: remote error: upload-pack: not our ref 2df38397f8b7ef9e1415dd777ddce3c64a0d149d
fatal: The remote end hung up unexpectedly
"
event-GUID: "ecdba800-4157-11ea-9f97-d30febca7c89"
event-type: "pull_request"
level: "info"
msg: "Cherry-pick failed."
plugin: "cherrypick"
}
labels: {
compute.googleapis.com/resource_name: "gke-prow-prod-pool-fb335f1f-5rds"
container.googleapis.com/namespace_name: "default"
container.googleapis.com/pod_name: "cherrypicker-6895fb69fc-mcchq"
container.googleapis.com/stream: "stderr"
}
logName: "projects/istio-testing/logs/cherrypicker"
receiveTimestamp: "2020-01-27T22:54:27.758925058Z"
resource: {
labels: {
cluster_name: "prow"
container_name: "cherrypicker"
instance_id: "294245110686867759"
namespace_id: "default"
pod_id: "cherrypicker-6895fb69fc-mcchq"
project_id: "istio-testing"
zone: "us-west1-a"
}
type: "container"
}
severity: "ERROR"
timestamp: "2020-01-27T22:54:22Z"
}
cc @cjwagner
Well I don't see a 2df38397f8b7ef9e1415dd777ddce3c64a0d149d
commit anywhere in the istio org so it makes sense that that fails. I wonder where that came from... Looks like we only fetch with git when checking out the repo and use GitHub to get the PR patch?
That SHA is in my fork (and the PR referenced is from me): https://github.com/howardjohn/istio/commit/2df38397f8b7ef9e1415dd777ddce3c64a0d149d. I am not exactly sure where its from, but its not completely random?
Interesting, that commit is a merge of the merge commit of istio/istio#20543 and some other commit. I don't see anywhere we try to clone the user's fork and that commit shouldn't matter anyways since it is entirely separate from the PR...
There are plenty more, but here are a few of the other commits scraped from the logs which also failed on fetch
: istio/istio@c36fd36db6639120a0a2151a31d97dbf2d607b27, istio/istio@975734e3124d5019ad99fbdbbf69bbb85f8a8215, istio/istio@d9af9f83e59a924b805544f88dc4207c5b931196. They all appear to be merge commits.
In particular they are all merge commits in which one of the 2 parent commits is the merge commit of a PR that requested a cherry pick.
Running a load test using the following code and git
version 2.24.1 I was able to reproduce the error for istio/istio
in an isolated environment. That tells me it is likely a git
or GitHub
issue:
export GIT_TRACE=1
tmp_dir=$(mktemp -d -t ci-XXXXXXXXXX)
trap 'rm -rf $mktemp' EXIT
org="istio"
format_string="https://github.com/%s/%s.git"
repos=(
"istio.io"
"istio"
"release-builder"
"proxy"
"envoy"
"cni"
"operator"
"client-go"
)
pushd "$tmp_dir"
for repo in "${repos[@]}"; do
src="$(printf "$format_string" "$org" "$repo")"
git clone --mirror "$src"
pushd "$(basename "$src")"
git fetch
popd
done
popd
with error:
15:54:11.760238 git.c:439 trace: built-in: git clone --mirror https://github.com/istio/istio.git
Cloning into bare repository 'istio.git'...
warning: templates not found in /Users/clarketm/.git-templates
15:54:11.767597 run-command.c:663 trace: run_command: git-remote-https origin https://github.com/istio/istio.git
15:54:12.697024 run-command.c:663 trace: run_command: git fetch-pack --stateless-rpc --stdin --lock-pack --thin --check-self-contained-and-connected --cloning --no-progress https://github.com/istio/istio.git/
15:54:12.704064 git.c:439 trace: built-in: git fetch-pack --stateless-rpc --stdin --lock-pack --thin --check-self-contained-and-connected --cloning --no-progress https://github.com/istio/istio.git/
fatal: remote error: upload-pack: not our ref 39c3dbcd9b9c18c21ea97898922749528bbf96fe
Note: istio/istio@39c3dbcd9b9c18c21ea97898922749528bbf96fe follows the same pattern as above.
@stevekuznetsov @petr-muller
I see both openshift (https://github.com/openshift/release/issues/6110) and prow (https://github.com/kubernetes/test-infra/issues/15225) have been experiencing pain related to this error remote error: upload-pack: not our ref ...
in the recent past (~2-3 months). I am curious if anyone has found a solution, workaround, or cause (short of increasing retries)?
I wonder if that merge commit is somehow related to in-repo config fetching?
The upload-pack: not our ref
error is in general just related to GitHub's REST API racing with what's available over git
from their remotes. The workaround in most cases is just to try again, the most common case is when the REST API/webhooks tell us about a refspec that's not pullable (yet) from the git
servers.
I wonder if that merge commit is somehow related to in-repo c fetching?
I am still unsure about the merge commit origination. I will need to dig into the in-repo code to see.
The
upload-pack: not our ref
error is in general just related to GitHub's REST API racing with what's available overgit
from their remotes. The workaround in most cases is just to try again, the most common case is when the REST API/webhooks tell us about a refspec that's not pullable (yet) from thegit
servers.
That makes sense. So, it sounds like this is less of a bug and more of a necessary side-effect of eventual consistency and replication. I am fine with more retries or a longer backoff factor; the problem is that these are hardcoded to 3 retries with a relatively short backoff factor of 2 in the git
client. This is apparently not enough since the above missed cherrypicks only happen after all retries are exhausted.
I guess a good first step is to try to identify, with certainty, what/who creates the merge commit(s). The I can tweak the retryCmd
in k8s/test-infra
to make the retry count and/or backoff factor configurable - and increase the quantity for these values.
Still sounds like there is a bug somewhere given that we still don't know why the unrelated merge commit is being fetched. +1 for identifying that first before trying to work around with retries.
Since this looks like a git/GitHub bug and since you have a simple script to reproduce this it is probably time to escalate to GH support?
Still sounds like there is a bug somewhere given that we still don't know why the unrelated merge commit is being fetched. +1 for identifying that first before trying to work around with retries.
Since this looks like a git/GitHub bug and since you have a simple script to reproduce this it is probably time to escalate to GH support?
GitHub support ticket created. In the meantime, I will add support for more (or longer time between) retries to see if that mitigates.
The git fetch
error is emitted in the git clients Clone
which doesn't get any kind of revision passed in.
The cherrypicker fetches the patch from the GitHub api and not via the git client: https://github.com/kubernetes/test-infra/blob/b2d632960440e47f671dd3b997709aa299911456/prow/external-plugins/cherrypicker/server.go#L378
This doesn't really answer the question why exactly this happens (maybe revisions that are not actually existing anymore are advertised by github?) but I doesn't look like a bug in our code.
Example: https://github.com/istio/istio/pull/20543, https://github.com/istio/istio/pull/20442
Do we need to setup config per-branch for this label? I could not find anything where we specify this so I assumed it applies to any
cherrypick/*
label. If the labels don't work we should just remove the labels and have everyone use the command since it seems to work more reliably