Open bnu0 opened 2 years ago
I was able to reproduce this easily in kind
by making 3000 ingresses pointing to 3000 services, and looping over one of the ingress hosts using curl while doing a kubectl rollout restart
on the ingress controller deployment. the new pod returns 404 for a period of time after reporting ready.
Ah ok, at 3000 ingress objects and 300 services, its likely you are experiencing a real problem. Was that kind on a laptop or kind on a 8+cores with 32+GB RAM host. Assuming a single node kind cluster here.
It was kind 3-worker cluster on a host with 8 core / 16 threads and 64g memory, not a laptop but nothing crazy. I am not sure I really need 3000 Ingresses to reproduce, but that is how many we have in production so it is the number I started with.
I am planning to try changing the is-dynamic-lb-initialized probe to return false until the sync_backends has run at least once after backends are POSTed by the controller. But if someone is running this controller and has zero Ingresses I am worried it will never report ready 😧
I think I will need to know more details but if it possible for you to simulate 300 ingress objects, then you can maybe explore make dev-env
https://kubernetes.github.io/ingress-nginx/developer-guide/getting-started/#local-build , that produces a controller just for your test server environment.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
/remove-lifecycle rotten
We're experiencing the exact same issue with "just" ~200 ingresses in our clusters.
how many replicas of the ingress-nginx-controller pod and how many instances of the ingress-nginx-controller and how many nodes in the cluster ?
It mostly happens in our busier clusters. In one of the latest examples that I checked there were 600 replicas of ingress controller and 900 nodes in the cluster.
Is it 600 replicas of one single instance of the controller ? How many instances of the controller in this 900 node cluster ?
What do you mean by instance? IngressClass? If so, then the answer is yes - 600 replicas of one instance.
One instance is one installation of the ingress-nginx-controller so thanks yes, one ingressClass would imply one installation of the ingress-nginx-controller.
This has been reported before and there is highest priority work in progress that includes attention to this, besides security. But the release of the new design is likely to emerge at the end of the current stabilization work-in-progress.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
/remove-lifecycle rotten
We are also affected by this. We run NGINX-Ingress with HPA and it happens regularly on scale up. We currently got around 700 ingress objects.
@longwuyuan any update on the design work?
The design is basically a new approach to split the control-plane from the data-plane. Much progress has been made and the dev worked has reached some testing stage. You can search for the PR in progress (about cp/dp split)
/triage accepted /priority important-longterm
NGINX Ingress controller version
Kubernetes version (use
kubectl version
): 1.21Environment:
uname -a
): 5.10.0-13-amd64.spec.controller
.What happened:
One of our ingress classes has ~3k associated ingress objects. When a new ingress pod for this class starts up, it returns 404s for backends for a brief period of time, even after passing the readiness probe. We have increased the readinessProbe initialDelaySeconds to 40, which helps, but feels like a band-aid.
What you expected to happen:
The readiness probe should not pass until the upstreams are fully synchronized.
How to reproduce it:
I am working on a reproducer, but i think the actual issue is here:
configuration.get_backends_data()
.