Closed jwatte closed 9 months ago
Also seeing what appear to be related failures permafailing the pull-kubernetes-node-e2e-containerd presubmit: https://prow.k8s.io/job-history/gs/kubernetes-jenkins/pr-logs/directory/pull-kubernetes-node-e2e-containerd?buildId=1757818703933607936
failed to pull and unpack image \"registry.k8s.io/e2e-test-images/perl:5.26\": failed to resolve reference \"registry.k8s.io/e2e-test-images/perl:5.26\": unexpected status from HEAD request to https://registry.k8s.io/v2/e2e-test-images/perl/manifests/5.26: 504 Gateway Timeout
We're seeing this as well
Warning Failed 65s kubelet Failed to pull image "registry.k8s.io/kube-proxy:v1.27.10": rpc error: code = Unknown desc = failed to pull and unpack image "registry.k8s.io/kube-proxy:v1.27.10": failed to copy: httpReadSeeker: failed open: unexpected status code https://registry.k8s.io/v2/kube-proxy/blobs/sha256:db7b01e105753475c198490cf875df1314fd1a599f67ea1b184586cb399e1cae: 504 Gateway Timeout
seems all better now
We're still seeing issues with image pulling being slow and/or timing out.
I see possible issues with in us-west1 (GCP) specifically.
Notes for anyone following along:
From cloud run logs I see a lot of error logs, we normally only see warnings (e.g. artifactory instances spamming the non-standard catalog API => 404).
Spot checking the error I see us-west1, and filtering the error logs I only see errors in us-west1.
Checking us-west1, we have high CPU utilization in that region, which started around 8 AM, along with a large increase in concurrent requests.
It looks to me like something deadlocked, maybe the sync.Map around cached S3 lookups.
Edit: probably attributable to impacts of the outage mentioned below.
As pointed out by someone else on the call, us-west1 is experiencing issues https://status.cloud.google.com/incidents/u6rQ2nNVbhAFqGCcTm58#wVCaQTRV9VEch1ZTHVQn
Given that this is impacting multiple services including loadbalancing, we're probably better off waiting this out currently (past experience: reconfiguring LBs during an outage like this is preferably avoided ...).
For the future we probably want to have outlier detection enabled, which may have let us automatically route around this (depending on the scope of the outage...)
That config is in https://github.com/kubernetes/k8s.io/tree/main/infra/gcp/terraform/modules/oci-proxy
I've been meaning to look into that myself, but registry.k8s.io is pretty far onto my backburner currently / too many other obligations ...
For immediate mitigation options: https://github.com/kubernetes/registry.k8s.io/blob/main/docs/mirroring/README.md
Also, outages affecting loadbalancers usually affect end-users near the affected region trying to access services on GCP fronted by global GCP Loadbalancers.
This should be mitigated now.
https://status.cloud.google.com/incidents/u6rQ2nNVbhAFqGCcTm58#wVCaQTRV9VEch1ZTHVQn
I'm no longer seeing issues in my environments either, ran some additional test pulls.
Spoke too soon, definitely still some errors.
It's trending in the right direction, but not error-free yet.
I manually deployed a spurious new revision to us-west1 (same config) to prompt cycling all of the running instances there immediately.
That looks promising
No new error logs since 14:17:13, my tests are not producing any errors, and the graphs are holding after https://github.com/kubernetes/registry.k8s.io/issues/274#issuecomment-1944812760
I think we can actually close this now, I'm going to check back in again later as well though.
Still no further errors logged and the regional metrics graphs continue to look healthy now.
Is there an existing issue for this?
What did you expect to happen?
curl https://registry.k8s.io/v2/
should return somethingSimilarly ingress-nginx image pull should work.
Debugging Information
Anything else?
No response
Code of Conduct