Premature readiness probe success due to race condition in check for backends initialization, causing 404s

bnu0 commented 2 years ago

NGINX Ingress controller version

bash-5.1$ /nginx-ingress-controller --version  
-------------------------------------------------------------------------------
NGINX Ingress controller
  Release:       v1.1.2
  Build:         bab0fbab0c1a7c3641bd379f27857113d574d904
  Repository:    https://github.com/kubernetes/ingress-nginx
  nginx version: nginx/1.19.9

-------------------------------------------------------------------------------

Kubernetes version (use kubectl version): 1.21

Environment:

Cloud provider or hardware configuration: Bare metal (on-premises)
OS (e.g. from /etc/os-release): Debian 10 (buster)
Kernel (e.g. uname -a): 5.10.0-13-amd64
Install tools: N/A
Basic cluster related info: N/A
How was the ingress-nginx-controller installed: Manifests. There are multiple instances of ingress-nginx using different ingressClassName and IngressClass .spec.controller.
Current state of ingress object, if applicable: There are thousands of them, this is probably triggering the issue :)

What happened:

One of our ingress classes has ~3k associated ingress objects. When a new ingress pod for this class starts up, it returns 404s for backends for a brief period of time, even after passing the readiness probe. We have increased the readinessProbe initialDelaySeconds to 40, which helps, but feels like a band-aid.

What you expected to happen:

The readiness probe should not pass until the upstreams are fully synchronized.

How to reproduce it:

I am working on a reproducer, but i think the actual issue is here:

The readiness probe is checking that the backends data is set via configuration.get_backends_data().
When the backends are POSTed by the controller, this variable is set directly but there is actually an asynchronous syncronization loop that later applies these backends to the underlying nginx proxy upstreams.
This sync runs every second. But with 3000+ ingresses, many with multiple hosts (multiple server blocks in resulting nginx config), i am not actually sure how long a single sync takes (i guess it could be many seconds?).
During the gap between these two, the pod is reporting ready but is serving 404s. This adds the pod to the service endpoints, and advertises it with BGP in our datacenter. Clients get 404s 😭.

bnu0 commented 2 years ago

I was able to reproduce this easily in kind by making 3000 ingresses pointing to 3000 services, and looping over one of the ingress hosts using curl while doing a kubectl rollout restart on the ingress controller deployment. the new pod returns 404 for a period of time after reporting ready.

longwuyuan commented 2 years ago

Ah ok, at 3000 ingress objects and 300 services, its likely you are experiencing a real problem. Was that kind on a laptop or kind on a 8+cores with 32+GB RAM host. Assuming a single node kind cluster here.

bnu0 commented 2 years ago

It was kind 3-worker cluster on a host with 8 core / 16 threads and 64g memory, not a laptop but nothing crazy. I am not sure I really need 3000 Ingresses to reproduce, but that is how many we have in production so it is the number I started with.

I am planning to try changing the is-dynamic-lb-initialized probe to return false until the sync_backends has run at least once after backends are POSTed by the controller. But if someone is running this controller and has zero Ingresses I am worried it will never report ready 😧

longwuyuan commented 2 years ago

I think I will need to know more details but if it possible for you to simulate 300 ingress objects, then you can maybe explore make dev-env https://kubernetes.github.io/ingress-nginx/developer-guide/getting-started/#local-build , that produces a controller just for your test server environment.

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

amirschw commented 2 years ago

/remove-lifecycle rotten

amirschw commented 2 years ago

We're experiencing the exact same issue with "just" ~200 ingresses in our clusters.

longwuyuan commented 2 years ago

how many replicas of the ingress-nginx-controller pod and how many instances of the ingress-nginx-controller and how many nodes in the cluster ?

amirschw commented 2 years ago

It mostly happens in our busier clusters. In one of the latest examples that I checked there were 600 replicas of ingress controller and 900 nodes in the cluster.

longwuyuan commented 2 years ago

Is it 600 replicas of one single instance of the controller ? How many instances of the controller in this 900 node cluster ?

amirschw commented 2 years ago

What do you mean by instance? IngressClass? If so, then the answer is yes - 600 replicas of one instance.

longwuyuan commented 2 years ago

One instance is one installation of the ingress-nginx-controller so thanks yes, one ingressClass would imply one installation of the ingress-nginx-controller.

This has been reported before and there is highest priority work in progress that includes attention to this, besides security. But the release of the new design is likely to emerge at the end of the current stabilization work-in-progress.

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

walthowd commented 1 year ago

/remove-lifecycle stale

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

amirschw commented 1 year ago

/remove-lifecycle rotten

jan-kantert commented 10 months ago

We are also affected by this. We run NGINX-Ingress with HPA and it happens regularly on scale up. We currently got around 700 ingress objects.

@longwuyuan any update on the design work?

longwuyuan commented 5 months ago

The design is basically a new approach to split the control-plane from the data-plane. Much progress has been made and the dev worked has reached some testing stage. You can search for the PR in progress (about cp/dp split)

/triage accepted /priority important-longterm

kubernetes / ingress-nginx

Premature readiness probe success due to race condition in check for backends initialization, causing 404s #8449