regional outage due to GCP us-west1 incident

kubernetes / registry.k8s.io

This project is the repo for registry.k8s.io, the production OCI registry service for Kubernetes' container image artifacts

https://registry.k8s.io

Apache License 2.0

403 stars 73 forks source link

regional outage due to GCP us-west1 incident #274

Closed jwatte closed 9 months ago

jwatte commented 9 months ago

Is there an existing issue for this?

[X] I have searched the existing issues

What did you expect to happen?

curl https://registry.k8s.io/v2/ should return something

Similarly ingress-nginx image pull should work.

Debugging Information

upstream request timeout

Anything else?

No response

Code of Conduct

[X] I agree to follow this project's Code of Conduct

benluddy commented 9 months ago

Also seeing what appear to be related failures permafailing the pull-kubernetes-node-e2e-containerd presubmit: https://prow.k8s.io/job-history/gs/kubernetes-jenkins/pr-logs/directory/pull-kubernetes-node-e2e-containerd?buildId=1757818703933607936

failed to pull and unpack image \"registry.k8s.io/e2e-test-images/perl:5.26\": failed to resolve reference \"registry.k8s.io/e2e-test-images/perl:5.26\": unexpected status from HEAD request to https://registry.k8s.io/v2/e2e-test-images/perl/manifests/5.26: 504 Gateway Timeout

kyle-render commented 9 months ago

We're seeing this as well

 Warning  Failed     65s                  kubelet            Failed to pull image "registry.k8s.io/kube-proxy:v1.27.10": rpc error: code = Unknown desc = failed to pull and unpack image "registry.k8s.io/kube-proxy:v1.27.10": failed to copy: httpReadSeeker: failed open: unexpected status code https://registry.k8s.io/v2/kube-proxy/blobs/sha256:db7b01e105753475c198490cf875df1314fd1a599f67ea1b184586cb399e1cae: 504 Gateway Timeout

jwatte commented 9 months ago

seems all better now

kyle-render commented 9 months ago

We're still seeing issues with image pulling being slow and/or timing out.

BenTheElder commented 9 months ago

I see possible issues with in us-west1 (GCP) specifically.

BenTheElder commented 9 months ago

Notes for anyone following along:

From cloud run logs I see a lot of error logs, we normally only see warnings (e.g. artifactory instances spamming the non-standard catalog API => 404).

Spot checking the error I see us-west1, and filtering the error logs I only see errors in us-west1.

Checking us-west1, we have high CPU utilization in that region, which started around 8 AM, along with a large increase in concurrent requests.

It looks to me like something deadlocked, maybe the sync.Map around cached S3 lookups.

Edit: probably attributable to impacts of the outage mentioned below.

BenTheElder commented 9 months ago

As pointed out by someone else on the call, us-west1 is experiencing issues https://status.cloud.google.com/incidents/u6rQ2nNVbhAFqGCcTm58#wVCaQTRV9VEch1ZTHVQn

Given that this is impacting multiple services including loadbalancing, we're probably better off waiting this out currently (past experience: reconfiguring LBs during an outage like this is preferably avoided ...).

For the future we probably want to have outlier detection enabled, which may have let us automatically route around this (depending on the scope of the outage...)

That config is in https://github.com/kubernetes/k8s.io/tree/main/infra/gcp/terraform/modules/oci-proxy

I've been meaning to look into that myself, but registry.k8s.io is pretty far onto my backburner currently / too many other obligations ...

For immediate mitigation options: https://github.com/kubernetes/registry.k8s.io/blob/main/docs/mirroring/README.md

upodroid commented 9 months ago

Also, outages affecting loadbalancers usually affect end-users near the affected region trying to access services on GCP fronted by global GCP Loadbalancers.

BenTheElder commented 9 months ago

This should be mitigated now.

https://status.cloud.google.com/incidents/u6rQ2nNVbhAFqGCcTm58#wVCaQTRV9VEch1ZTHVQn

BenTheElder commented 9 months ago

I'm no longer seeing issues in my environments either, ran some additional test pulls.

BenTheElder commented 9 months ago

Spoke too soon, definitely still some errors.

BenTheElder commented 9 months ago

It's trending in the right direction, but not error-free yet.

BenTheElder commented 9 months ago

I manually deployed a spurious new revision to us-west1 (same config) to prompt cycling all of the running instances there immediately.

BenTheElder commented 9 months ago

That looks promising

BenTheElder commented 9 months ago

No new error logs since 14:17:13, my tests are not producing any errors, and the graphs are holding after https://github.com/kubernetes/registry.k8s.io/issues/274#issuecomment-1944812760

BenTheElder commented 9 months ago

I think we can actually close this now, I'm going to check back in again later as well though.

BenTheElder commented 9 months ago

Still no further errors logged and the regional metrics graphs continue to look healthy now.