kubernetes / ingress-nginx

Ingress NGINX Controller for Kubernetes
https://kubernetes.github.io/ingress-nginx/
Apache License 2.0
17.38k stars 8.23k forks source link

Ingress nginx scaling to max due to memory #12167

Open sivamalla42 opened 4 days ago

sivamalla42 commented 4 days ago

Hi All,

We observe a strange behaviour with the ingress-nginx pods in our production. We started observing the pods scaling to max due to memory usage. EKS: 1.29

helm list -n ingress-nginx NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION ingress-nginx ingress-nginx 1 2024-05-01 11:27:32.802401 +0530 IST deployed ingress-nginx-4.8.3 1.9.4

Not sure why all of a sudden we started observing this behaviour. There is no clue on why it started and how to fix it If we are increasing the pods, the memory is still getting consumed and pods are scaling up again.

image

image

Any help is very much appreciated.

Thanks Siva

k8s-ci-robot commented 4 days ago

This issue is currently awaiting triage.

If Ingress contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.
longwuyuan commented 4 days ago

You can check the logs of the controller pods and hardcode the number of workers

longwuyuan commented 4 days ago

/kind support

tao12345666333 commented 4 days ago

Can you observe your request traffic? Have you encountered more requests or are there many large requests?

sivamalla42 commented 4 days ago

the controller pod logs have the data of the requests, but nothing specific with failures or OOM errors

sivamalla42 commented 4 days ago

@tao12345666333 , we do not observe any abnormal traffic coming into the ingress layers. it looks to be regular traffic

sivamalla42 commented 4 days ago

You can check the logs of the controller pods and hardcode the number of workers

@longwuyuan , can you please elaborate a bit more on what needs to be done on hardcode the number of workers ?

sivamalla42 commented 4 days ago

Also sending the n/w level in and out metrics

network
longwuyuan commented 4 days ago

You can check the logs of the controller pods and hardcode the number of workers

@longwuyuan , can you please elaborate a bit more on what needs to be done on hardcode the number of workers ?

https://github.com/kubernetes/ingress-nginx/issues/8166

sivamalla42 commented 4 days ago

You can check the logs of the controller pods and hardcode the number of workers

@longwuyuan , can you please elaborate a bit more on what needs to be done on hardcode the number of workers ?

8166

@longwuyuan , i see the below in the running ingress pod worker_processes 16;

should this value be sufficient to continue

I tried manually reducing this worker_processes to 8 on few nodes and observe that the memory consumption looked to be reduced.

Please suggest

Gacko commented 4 days ago

There are a few things coming into play here.

The static memory consumption of the Ingress NGINX Controller partially depends on your cluster size, so nodes and pods, and amount of Ingress resources.

In the past I observed Ingress NGINX Controller pods to consume up to 4 GB of memory right after startup because the cluster contained both a lot of nodes/pods and around 2,500 Ingress resources.

This memory consumption does still not take actual traffic into account and is a design flaw of our current implementation as the control plane consuming the memory for internal operations is in the same container as the data plane which is actually doing the heavy lifting.

If you now use HPA to scale your deployment and would expect it to do so depending on actual load produced by traffic, you might hit your target average memory utilization just with static data produced by how your environment (again, number of nodes, pods and Ingresses influence this) looks like.

This especially can become a problem when you start with resource and HPA settings for a smaller setup and then slowly grow to the before mentioned point.

Is the actual memory consumption this big right after pod startup or does it grow with time? The former would confirm my assumption while the latter could be caused by a memory leak.

For the former you will probably need to tweak your resource requests and/or HPA settings. Sadly we can not overcome this design flaw at the moment, but we are planning to split the controller into a control plane and a data plane in the future.

For the latter I'd recommend you to update to the latest stable release of our controller first, if not already on it, and verify again.

Regards Marco

longwuyuan commented 4 days ago

@sivamalla42 since your graph shows increase started after 9/24, then you have no other choice but to first look at all other helpful graphs and co-relate them to the log messages timestamps. Idea is to know if memory increased for handling requests or not.

sivamalla42 commented 3 days ago

@Gacko , Currently we are on eks 1.29 ingress-nginx : ingress-nginx-4.8.3 APP version: 1.9.4 . Which version would you suggest to upgrade to ? Please suggest

Gacko commented 3 days ago

Hey,

sorry, I missed this information in your initial issue description.

Well, at best you'd upgrade to v1.11.3. But it would be interesting to know if the memory consumption rises by time or is high from the very beginning.

Regards Marco

sivamalla42 commented 3 days ago

@Gacko , the pods were consuming the memory over the time. When they are restarted, they were taking time to consume memory. but in case if we are adding more pods, they are right away starting to consume the memory. We would like to try upgrading to v1.11.3 but instead going to the latest version and come across with new issues, we would like to upgrade to any laters version in v1.10.x. so please suggest on this

Gacko commented 2 days ago

Hello,

but in case if we are adding more pods, they are right away starting to consume the memory.

This sounds like your cluster is just big and Ingress NGINX therefore consuming comparable lot static memory.

v1.10.x is out of support. You can of course just use v1.10.5, but this is up to you. We cannot make recommendations about versions to use other than the latest stable one.

Regards Marco