kubernetes / registry.k8s.io

This project is the repo for registry.k8s.io, the production OCI registry service for Kubernetes' container image artifacts
https://registry.k8s.io
Apache License 2.0
365 stars 65 forks source link

enable outlier detection #276

Open BenTheElder opened 4 months ago

BenTheElder commented 4 months ago

We should enable https://cloud.google.com/load-balancing/docs/https/setting-up-global-traffic-mgmt#configure_outlier_detection

https://github.com/kubernetes/k8s.io/tree/main/infra/gcp/terraform/modules/oci-proxy

I think last time I looked into this I got hung up on migrating to current terraform / modules versions in k8s.io and then got interrupted with other priorities, but it might be possible to do without updating those first.

xref: https://github.com/kubernetes/registry.k8s.io/issues/274#issuecomment-1944454342, previously https://github.com/kubernetes/registry.k8s.io/issues/234

upodroid commented 4 months ago

Might not work as we are using Serverless NEGs that don't have health checks.

BenTheElder commented 4 months ago

Might not work as we are using Serverless NEGs that don't have health checks.

Outlier detection doesn't rely on (active) healthchecks, it's basically acting on observed response codes to normal requests.

It is available for serverless NEGs for a while now, with some limitations (I forget which but IIRC one of the config options wasn't applicable).

BenTheElder commented 4 months ago

It might not have solved this particular outage though, if nothing else since it's an LB behavior and LBs are impacted in some way (haven't had a chance to look further for now).

k8s-triage-robot commented 5 days ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

BenTheElder commented 5 days ago

/remove-lifecycle stale /lifecycle frozen

We should still probably do this, it's just hard to prioritize versus getting the rest of the infra migrated into the community, we're rarely having outages as-is and it's not 100% clear if this would solve the problem(s) (needs more investigating, just haven't really had time and nobody else seems to have looked yet).