Open BenTheElder opened 8 months ago
Might not work as we are using Serverless NEGs that don't have health checks.
Might not work as we are using Serverless NEGs that don't have health checks.
Outlier detection doesn't rely on (active) healthchecks, it's basically acting on observed response codes to normal requests.
It is available for serverless NEGs for a while now, with some limitations (I forget which but IIRC one of the config options wasn't applicable).
It might not have solved this particular outage though, if nothing else since it's an LB behavior and LBs are impacted in some way (haven't had a chance to look further for now).
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale /lifecycle frozen
We should still probably do this, it's just hard to prioritize versus getting the rest of the infra migrated into the community, we're rarely having outages as-is and it's not 100% clear if this would solve the problem(s) (needs more investigating, just haven't really had time and nobody else seems to have looked yet).
We should enable https://cloud.google.com/load-balancing/docs/https/setting-up-global-traffic-mgmt#configure_outlier_detection
https://github.com/kubernetes/k8s.io/tree/main/infra/gcp/terraform/modules/oci-proxy
I think last time I looked into this I got hung up on migrating to current terraform / modules versions in k8s.io and then got interrupted with other priorities, but it might be possible to do without updating those first.
xref: https://github.com/kubernetes/registry.k8s.io/issues/274#issuecomment-1944454342, previously https://github.com/kubernetes/registry.k8s.io/issues/234