Closed davinkevin closed 1 week ago
@davinkevin: This issue is currently awaiting triage.
If Ingress contributors determines this is a relevant issue, they will accept it by applying the triage/accepted
label and provide further guidance.
The triage/accepted
label can be added by org members by writing /triage accepted
in a comment.
/remove-kind bug
Hi, there is some customization visible like resources and topology, in addition to 2 min replicas for the controller. In the context of this, the actual events and reasons are not very explicit, based on just that connection error to port 8443. If you had enough resources available and replicas not being overtly loaded, the validation of the posted json, from the helm-chart, to the api-server should succeed, as you noted.
So more information is needed on multiple objects like the controller pod load, the network load, the kubectl get events -A
, the kubectl descrbe
of the ingress objects in question, etc etc
Some data at the moment I did the upgrade:
CPU Usage:
Memory usage:
Bandwitdh:
The usage seen here are just from the upgrade process, because users are not online when we do the upgrade.
I'm not able to extract events and ingress in question weren't updated because the upgrade process was blocked due to this error, so they stayed the same, without any events associated to them.
Do you see exhausted resources. Not just cpu or memory but in the context of the limits you have set in kubernetes and the limits from OS like filehandles, inodes etc. That connection error is basically saying that a working socket at 8443 before the upgrade went belly up during the upgrade.
Also, it may not hurt to bump the ingress-nginx version. You are on a old one and there are several PRs merged (not meaning that one relates to your issue)
Because it's a managed system, we didn't define so much for OS or Kubernetes. But from the monitoring system, we don't have any alert related to file descriptor or inode exhaustion. Additionally, pods are running in specific nodes and has auto-scaling setup which is working well usually when we have a traffic spike.
Here, I think the admission controller just consume too much ram, it got killed by k8s (OOM) and this leads to the connection error… because nothing is able to answer at this specific moment.
Of course, upgrade the ingress is planned, but not before xmas 😇, and I need to test it before, especially to confirm the values.yaml
compatibility.
I am not sure if its a new behaviour after changes or not, but OOM killed would be related to 250Mi resource config. So with this data it looks lesss like a problem with the controller and more with your config. Have you tried removing the resource limits temporarily ?
It's unfortunate, because we would need to increase the memory available for every instances all the time for a surge happening only during upgrades?
Is there a way to get some kind of control-plane (like a lot of ingress controller), with a pod dedicated to this kind of action (only upgrading nginx rules), with more resources than others? It can prevents any downtime due to manifests publications…
Because in our case, due to all this OOM, we lost the whole traffic in the cluster during the upgrade period, which is really not safe in our case.
We will for sure upgrade the ingress-nginx, but we would really like to make our system more resilient to this kind of problem too, so if you have some advice, it's welcome! 😇
Thanks
I think you have the option to base your decision on data. For example, you can remove the limits and let it run for a few cycles. Then review the usage of resources during normal and eventful use. Once you have that data, you will be in a better position to make informed decision on limits.
Of course, we did that for the current setup. We also have a defined HPA enabled to accept the extra load too.
I plan to upgrade the ingress controller this week and check if this fix the problem and evaluate alternative solution after this.
@davinkevin did you find a solution to this?
cc @deepy who is more on that subject than me atm.
Upgrading the controller might have helped, we no longer trigger this all the time in the small cluster and despite the big cluster having grown we're seeing less failures (but only once in a blue moon do we get 0 failures) We get a brief moment of throttling so there's still some tweaking to do, but haven't run out of memory in a while despite still using 250Mi as the limit
This error
dial tp 10.80.126.183:8443: connect: connection refused
is a classic root-cause when exactly what happnened is exactly as shown. This is possible momentarily for any pod and not just the ingress-nginx controller pod. It is suspected that wither the load on the host or the load on the network or running out of inodes or similar such transitional states can cause failure to create a new connection to a pod temporarily. The mitigation is a play with Observability stack giving input on the multiple factors including inodes and accordingly adjusting resource availability.
No action item to track here so closing this issue.
/close
@longwuyuan: Closing this issue.
What happened:
We have approximately 20 instances of the same app in a cluster. We've deployed a new version of the app for all of them using
gitops
+fluxcd
+helm
. FYI, each app has 2 ingress definitions, one forhttp
and another forgrpc
.When all the system started to deploy new version of manifests, we had a lot of errors like this:
What you expected to happen:
Of course, I would have expected to see no error during the reconciliation process. It's the second time this problem occurs in the cluster, always after an upgrade of an "important" number of instances.
NGINX Ingress controller version:
Kubernetes version (use
kubectl version
): 1.22Environment:
helm
:values.yaml:
ingressclasses
Describe pods:
Pods consumed during the operation way more than the limit and got killed multiple time by the orchestrator.