Closed ParagPatil96 closed 1 year ago
@ParagPatil96: The label(s) triage/support
cannot be applied, because the repository doesn't have them.
/remove-kind bug /kind support
What other info is available ? You can get some info if you use the documented prometheus+grafana config. Does it happen on controller version 0.50.X What do the logs contain relevant to this
NGINX Ingress controller Release: v1.1.1 Build: a17181e43ec85534a6fea968d95d019c5a4bc8cf Repository: https://github.com/kubernetes/ingress-nginx nginx version: nginx/1.19.9 Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.1", GitCommit:"9e5f344f6cdbf2eaa7e450d5acd8fd0b7f669bf9", GitTreeState:"clean", BuildDate:"2021-05-19T04:34:27Z", GoVersion:"go1.16.4", Compiler:"gc", Platform:"linux/amd64"} We are also facing this memory issue. Initially, when we set the memory limits to 2GB, ingress controller was continuously restarting due to OOM. Attached dmesg log for reference. We then increased the memory limit to 6GB and from the attached Grafana Metrics we can see that the Pod constantly consumes close to 4GB of memory. We were earlier using the following version, where we noticed the memory consumption stabilised at less than 2GB. Release: v0.48.1 Build: git-1de9a24b2 Repository: git@github.com:kubernetes/ingress-nginx.git nginx version: nginx/1.20.1 It looks like this version of ingress controller consumes too much memory when compared to earlier versions. ingress-controller_pod1_dmesg.log ingress-controller_pod1.log ingress-controller_pod2_dmesg.log ingress-controller_pod2.log
@longwuyuan I posted the details here as I found the issue similar to ours. Let me know if you would want me to open a new one with all the details.
Can you upgrade to latest release and kindly answer the questions I asked earlier. Thank you.
@longwuyuan I am a co-worker of @rmathagiarun . The details shared by @rmathagiarun were from the latest released controller version v1.1.1, and we see this issue happening frequently.
Some questions I asked earlier are not answered. Basically, some digging is required and specifics on which process was using memory high should be determined. Along with checking node resources at that point of time.
@longwuyuan
I can see that you have suggested us to test using controller version 0.5.x - We have been using v0.48.1 for a long time and have never faced this issue. We had to upgrade to v1.1.1 as v0.48.1(even the suggested version v0.50.x) is not compatible with K8s Version 1.22 and 1.23.
The core components of our product has remained the same on both the versions(0.48.1 and v1.1.1) and we are facing this memory issue only with v1.1.1.
Unfortunately, the logs doesn't have much info on this memory leak. We were able to find the OOM issue only by using dmesg command inside the pod. I have already shared the ingress logs and Grafana screenshots for the same.
Nodes all along had sufficient resources and as you can see from the Grafana Screenshot that the pod is constantly consuming close to 4GB and spike is not specific only during certain operations.
@longwuyuan Any update on this. I have answered all the queries that you have posted, Let me know if you need any additional info.
What other info is available ? Every time when we add/remove ingress, ingress controller gets reloaded (due to configuration changes). During this reload, memory utilisation shoots up. I0218 13:29:45.249555 8 controller.go:155] "Configuration changes detected, backend reload required" I0218 13:30:03.025708 8 controller.go:172] "Backend successfully reloaded" Even after Pod restarts it continues to keep crashing as the Pod continues to hold the memory. Only a Deployment rollout or Deleting the Pod would release the memory.
You can get some info if you use the documented prometheus+grafana config. Screenshots already shared.
Does it happen on controller version 0.50.X As stated in my Previous comment, we have been using v0.48.X for a long time and we want to upgrade to V1.1.1 to make our selves compatible with K8s Version 1.22 and 1.23.
What do the logs contain relevant to this. Have already shared the Logs, but, the logs doesn't have much info on this memory leak. We were able to find the OOM issue only by using dmesg command inside the pod.
@longwuyuan Any update on this. We have been hitting this issue quite frequently.
Hi, its unfortunate that you have a problem. I have made some attempts to drive this to some resolution but failed. So next attempt I can make is suggest some info gathering.
You will requires some kind of a background or will have to get someone who has some kind of a history with performance related troubleshooting. Basically there are some steps like preparations to capture related information on the processes. Then there are some steps for the actual processes running. That is all too elaborate or "unixy" so to speak.
Please look at the history of issues weorked on here. We did discover some issues in Lua and nginx and some singnificant changes were made to the base image and components like Lua. Please find those issues and checkout the procedures described there. It included attaching debuggers for specific processes and regular trace/ptrace for controller process.
Also, I have no clue as to how many replicas you are running and/or if you are using a daemonset. Maybe its in the info in this issue but its not glaring.
My comment summary is get the info likes trace, trace, debugger, and then relate that with statistics from monitoring. Once you can provide a precise step-by-step instruction here, someone can try to reproduce the problem.
Hi @longwuyuan,
The issue can be reproduced by following the below steps,
Install ingress-nginx using Helm Chart - helm-chart-4.0.17 We updated ingressClass to APP-nginx-ingress-controller before installing.
Execute the following to create multiple ingress and observe the memory spike in Prometheus or Grafana GUI each time backend config is reloaded.
for i in {1..20}; do kubectl -n APP run pod$i --image=nginx:latest; kubectl -n APP expose pod pod$i --port=80 --name=service$i; kubectl create -n APP ingress secure-ingress-$i --class=APP-nginx-ingress-controller --rule="APP.com/service$i=service$i:80"; done
In our case, the memory spiked from around 0.75GB to 2GB. We also noticed the Nginx process continued to retain the 2GB memory without releasing it back after successful reloads. Attached Prometheus Screenshot for reference
kubectl -n APP scale deploy ingress-controller --replicas=0
kubectl -n APP scale deploy ingress-controller --replicas=1
In our case, the memory was back at 0.75GB. However, Nginx required multiple restarts as it wasn't able to load the backend configs. Attaching sample logs and Prometheus screenshot for reference.
nginx_ingress_after_restart.log
We tried simulating the same scenario with the latest release helm-chart-4.0.18 and the result was same.
Notably, the spike was observed only with backend reloads and without sending any load to endpoints.
Hi @rmathagiarun ,
Spike in memory usage is expected in the procedure you have described. I don't see a memory leak or bug in that procedure.
Would you like to make the reproduce steps more elaborate and detailed. Something closer to real-use would help the developers.
@longwuyuan
Agree that spike in memory usage is expected due to multiple backend reloads. Our concern is why is the Pod not releasing the memory back even after successful reload. The Pod continues to hold the memory until it is restarted.
This is a real-case scenario, as we can have multiple ingresses created over a period of time on a cluster. Meanwhile, the Pod will continue to block the memory it consumed during each re-load ultimately causing the Pod to crash due to OOM.
I am not capable of discussions that have similarities with diving down a rabbit hole. So I can mostly discuss around data and prefer to avoid theories as the ingress-controller code is clearly capable of some things and incapable of other things
Based on the reproduce procedure you typed, the behaviour observed is expected
Why memory is not released can not be answered because there is no data on where the memory is used
Your expectation is understood.
I would gladly love to be proved wrong when I say that there is not a single software out there, that will allocate and also release memory, at your expected rate, while processing a infinite for loop, without sleep, at the speed of recent multicore Intel/AMD CPUs, processing that infinite for loop on linux, and malloc'ing the exact same amount of memory, for exactly the same functionality as the ingress-nginx controller
That means someone needs to compare other ingress-controllers with exact same use-case and see what happens
I will be happy to contribute or comment, if your reproduce procedure can get be more real-world
We fixed a performance issue some months ago and I don't see those patterns here. Also, we are not getting this report from users of different use cases. Hence I think we need to get more precise and detailed on the reproduce steps
@rmathagiarun @longwuyuan we see the same behavior after upgrading the ingress helm chart to 4.0.17.
I am not certain how to proceed. We need a step-by-step procedure to reproduce the problem and then gather information as to where the memory is getting used.
@rmathagiarun Can you try to disable the metrics by setting --enable-metrics=false
You can use kubectl top to check pod memory usage
@bmv126 We already tried this and it didn't work out. I can see few others have also reported this issue, but, the response from community always seems to not to agree that Nginx has an issue during backend reload.
Alternatively, We are in the process of migrating to envoy based ingress controllers.
@rmathagiarun backend reload is a event that would occur on vanilla nginx as well as the nginx component of the ingress-nginx-controller and that implies that based on the size of the data that needs to be be read and reloaded, has a direct relevance to the impact on the usage of cpu/memory resources, which impacts the user experience ultimately.
If you add configuration to vanilla nginx, without Kubernetes, in a infinite while true loop, with each loop adding a a new virtual host with custom directives, I am sure you will experience the same issue.
Hence, the deep dive into a reproduce procedure becomes critically important to make progress here.
The project faced performance issues earlier, and that was addressed by clearly arriving at a reproduce procedure. We even got core dumps, created during those reproduce efforts.
@longwuyuan would it be possible to add the steps to get all the core dumps, attaching debuggers to get traces in the Troubleshooting Guide so that people facing such problems can give the community adequate information to debug and analyze?
@ramanNarasimhan77 I think that requires some deep dive work. It will help someone if they don't know how to get info out of a core file. But not sure if one doc here will apply to all scenarios.
But this is not a trivial process. I think that someone taking up this task will either already know how to work with core files, and very likely have some dev skills in C/C++ etc.
On a different note, the reproduce procedure described earlier was generating ingress objects in a while loop, with no sleep. That seems to far away from real use case. If we can come up with a practical & real use case test, it could help make progress. One observation, rather obvious, is that if there is a bug, then many people report experiences related to it and personally I don't see several reports of this high memory usage. That is why my comments on coming up with a practical real use test.
I did a search on the issue list and found at least one issue which describes some steps on how others worked on core files. For example https://github.com/kubernetes/ingress-nginx/issues/7080 . There are other issues as well.
Hey guys, Thank you for all information provided in this issue! Currently we're facing same high memory consumption issue reported by @rmathagiarun, @ParagPatil96 and @pdefreitas (in #8362). There's a clear memory leak pattern in 2 different kubernetes clusters and both are running on top of:
NGINX Ingress controller
Release: v1.1.1
Build: a17181e
Repository: https://github.com/kubernetes/ingress-nginx
nginx version: nginx/1.19.9
Helm Chart version: v4.0.17
Cluster 1 Version:
Server Version: version.Info{Major:"1", Minor:"19+", GitVersion:"v1.19.16-eks-25803e", GitCommit:"25803e8d008d5fa99b8a35a77d99b705722c0c8c", GitTreeState:"clean", BuildDate:"2022-02-16T23:37:16Z", GoVersion:"go1.15.15", Compiler:"gc", Platform:"linux/amd64"}
Cluster 2 Version:
Server Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.9", GitCommit:"9dd794e454ac32d97cde41ae10be801ae98f75df", GitTreeState:"clean", BuildDate:"2021-04-05T13:26:12Z", GoVersion:"go1.15.8", Compiler:"gc", Platform:"linux/amd64"}
Find below average memory consumption within a 4 day period. Peaks in both graphs represents each NGINX config reload, while baseline growing indicates a memory leak.
Cluster 1 Average Memory Consumption
Cluster 2 Average Memory Consumption
Just wanted to highlight this while we get more info as mentioned by @longwuyuan. If anyone has found some reasonable explanation on this, could you please update us?
Thank you for all your support!
Came across this issue as I recognized similar problem after upgrade k8s to 1.22 and nginx-ingress-controller 1.1.2:
Workaround: Just added resource limits to the containers, so they are going to clean up memory when they come close to their limit. This helped to avoid OOM kills in our case.
Hi. I have a similar issue.
We have a synthetic test that checks all components of our solution, including creating and removing ingresses.
In the screenshot below you can see how the memory was quite stable, then I started the test to run overnight. Memory was steadily increasing until I disabled the test. It has been 3 hours since, and nginx did not clear the memory yet.
I'll update this post if it happens without pod restart.
Helm Chart version: v4.0.18
can you copy/paste your test code here
can you copy/paste your test code here
Not really, it uses an SDK built by our developers, but it creates and deletes ingresses like this one:
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
annotations:
cert-manager.io/cluster-issuer: letsencrypt-prod
kubernetes.io/ingress.class: nginx
nginx.ingress.kubernetes.io/auth-url: http://auth.development.svc.cluster.local/api/authorization
nginx.ingress.kubernetes.io/proxy-body-size: 50m
nginx.ingress.kubernetes.io/proxy-read-timeout: "240"
nginx.ingress.kubernetes.io/rewrite-target: /$1
nginx.ingress.kubernetes.io/whitelist-source-range: xxx.xxx.xxx.xxx/32
creationTimestamp: "2022-01-13T13:00:14Z"
generation: 4
labels:
somekey: somelabel
name: name-of-the-ingress
namespace: development
resourceVersion: "78541539"
uid: 12c18f9c-1923-415c-a312-6f9c6b106a15
spec:
rules:
- host: subdomain.domain.com
http:
paths:
- backend:
service:
name: gateway-service-manualtest
port:
number: 80
path: /gateway-service-manualtest/(.*)
pathType: ImplementationSpecific
- host: subdomain.olddomain.com
http:
paths:
- backend:
service:
name: gateway-service-manualtest
port:
number: 80
path: /gateway-service-manualtest/(.*)
pathType: ImplementationSpecific
tls:
- hosts:
- subdomain.olddomain.com
secretName: tls-secret-api-dev-olddomain
- hosts:
- subdomain.domain.com
secretName: tls-secret-api-dev-newdomain
status:
loadBalancer:
ingress:
- ip: xxx.xxx.xxx.xxx
Looks like the memory was never released. Pods got restarted and usage went back down to normal.
I asked because we need to pin point the use of the memory. Without knowing where he memory was used, there is not much to discuss.
Somene else was creating ingress objects in a while true loop with no sleep in bash and also without break. So that while seems logical to the person doing it, does not form any base for discussion (simply because if you do something in while true loop, without sleep or break) because nobody will support that kind of stuff for free.
The project devs are doing some work on performance and in the past 1 year, there was a fix for performance.
We are seeing the same issue after upgrading to k8s 1.22. Our nginx worked with 2G of memory when we were using the version v0.48.1 but after upgrade its not even stabilizing at 4G.
I've seen this happen on a few clusters within the last few weeks - like others in this thread, we are using v4.0.17 of the Helm Chart to deploy v1.1.1 of the Ingress Controller. We've been running this version without issue for several months, but these issues seem to have only started once we upgraded to 1.22. However, i'm not sure this is related or just coincidental. The only coincidence that i've not seen mentioned elsewhere on this thread may be related to tracing functionality.
In our case, it does not (always?) appear to be triggered by backend reloads. In fact, backend reloads can sometimes solve the issue. Take the following charts for example - we see CPU and memory rise until a reload is triggered by editing the controller's configmap. No Ingress rule additions, removals or changes had been performed during this time and the logs show no backend reloads;
As CPU and memory rose, we started to see intermittent timeouts - which is reflected in the error rate for traffic passing through it. It's still quite low overall, but enough to trigger our monitoring since it was above the baseline and rising;
During this period, we also stopped receiving traces from the Ingress Controller. As soon as the configmap change was made to trigger a reload, tracing was restored. We configure the Ingress Controller to send traces to a host-local collector using Jaeger over HTTP;
enable-opentracing: 'true'
jaeger-endpoint: http://$HOST_IP:14268/api/traces
jaeger-service-name: ingress-nginx-external
HOST_IP
is an envvar derived from the pod's status.hostIp
value, referring to the node itself rather than a specific pod.
The relation to tracing is further corroborated in a different cluster that experienced this issue more recently. On this occasion, we resolved it by performing a rolling restart of the pods. No backend reloads had been experienced recently on this controller either.
The initial climb in CPU usage above happens immediately after we deployed an upgrade of our trace collector agents (OpenTelemetry) on each host. This would have mean that for a short period (<60s), the Ingress Controller would have been unable to send traces as the agent pod was terminated and re-scheduled on each host. However, although the CPU usage is immediately high it takes several days for the memory usage to also start climbing from normal.
There are some other clusters where we have started to see CPU climb immediately following this trace collector agent upgrade, but we haven't yet seen the memory rise. Although part of the climb is based on the load at the time of day, by extending the data back 2 weeks you can see that the magnitude is completely different;
If I jump in to a controller's pod and look at the processes, I can see that the memory usage of the controller and the nginx master process is roughly what i'd expect - however it's the workers that are showing excessive memory usage. This output is from one of the pods in the very first screenshot;
PID USER VSZ RSS COMMAND COMMAND
1 www-data 216 8 dumb-init /usr/bin/dumb-init -- /nginx-ingress-controller --default-backend-service=ingress/external-ingress-nginx-defaultbackend --publish-service=ingress/external-ingress-nginx-controll
7 www-data 798m 105m nginx-ingress-c /nginx-ingress-controller --default-backend-service=ingress/external-ingress-nginx-defaultbackend --publish-service=ingress/external-ingress-nginx-controller --election-id=ingre
28 www-data 201m 89m nginx nginx: master process /usr/local/nginx/sbin/nginx -c /etc/nginx/nginx.conf
4515 www-data 1.5g 1.2g nginx nginx: worker process
4516 www-data 1.3g 1.0g nginx nginx: worker process
4517 www-data 1.5g 1.2g nginx nginx: worker process
4518 www-data 1.5g 1.2g nginx nginx: worker process
4519 www-data 894m 727m nginx nginx: worker process
4552 www-data 1.4g 1.2g nginx nginx: worker process
4553 www-data 1.7g 1.4g nginx nginx: worker process
4586 www-data 1.4g 1.2g nginx nginx: worker process
4632 www-data 200m 80m nginx nginx: cache manager process
Finally, although I have managed to get a core dump from this pod (by sending a SIGSEGV
to one of the worker processes above) i've not touched gdb
in years, and never on someone else's code, so i'm a little lost! Are there any gdb
commands I can run that might help narrow things down @longwuyuan ? Running maintenance info sections
for example shows a large number of loadnnnn
sections going up to load2477
- most of which are ALLOC LOAD HAS_CONTENTS
.
@KingJ , thanks for the update. Obviously the information you have posted is a progress. Appreciate it because it takes effort and commitment to get to that stage.
I don't have gdb commands as I am not a developer. But "normal" gdb (from google search) is what I would do if I got my hands on that core dump. If possible, please upload the core dump to this issue.
Secondly, there are some gdb commands available in a old issue related to performance. But the software component and even the function(s) to be traced were known to the (non-project) developer. https://github.com/kubernetes/ingress-nginx/issues/6896
Can you please comment if there is any chance to be able to reproduce this by anyone.
/kind stabilization
@longwuyuan: The label(s) kind/stabilization
cannot be applied, because the repository doesn't have them.
/area stabilization
/project Stabilisation Project
@longwuyuan: You must be a member of the kubernetes/ingress-nginx github team to set the project and column.
I've seen this happen on a few clusters within the last few weeks - like others in this thread, we are using v4.0.17 of the Helm Chart to deploy v1.1.1 of the Ingress Controller. We've been running this version without issue for several months, but these issues seem to have only started once we upgraded to 1.22. However, i'm not sure this is related or just coincidental. The only coincidence that i've not seen mentioned elsewhere on this thread may be related to tracing functionality.
In our case, it does not (always?) appear to be triggered by backend reloads. In fact, backend reloads can sometimes solve the issue. Take the following charts for example - we see CPU and memory rise until a reload is triggered by editing the controller's configmap. No Ingress rule additions, removals or changes had been performed during this time and the logs show no backend reloads;
As CPU and memory rose, we started to see intermittent timeouts - which is reflected in the error rate for traffic passing through it. It's still quite low overall, but enough to trigger our monitoring since it was above the baseline and rising;
During this period, we also stopped receiving traces from the Ingress Controller. As soon as the configmap change was made to trigger a reload, tracing was restored. We configure the Ingress Controller to send traces to a host-local collector using Jaeger over HTTP;
enable-opentracing: 'true' jaeger-endpoint: http://$HOST_IP:14268/api/traces jaeger-service-name: ingress-nginx-external
HOST_IP
is an envvar derived from the pod'sstatus.hostIp
value, referring to the node itself rather than a specific pod.The relation to tracing is further corroborated in a different cluster that experienced this issue more recently. On this occasion, we resolved it by performing a rolling restart of the pods. No backend reloads had been experienced recently on this controller either.
The initial climb in CPU usage above happens immediately after we deployed an upgrade of our trace collector agents (OpenTelemetry) on each host. This would have mean that for a short period (<60s), the Ingress Controller would have been unable to send traces as the agent pod was terminated and re-scheduled on each host. However, although the CPU usage is immediately high it takes several days for the memory usage to also start climbing from normal.
There are some other clusters where we have started to see CPU climb immediately following this trace collector agent upgrade, but we haven't yet seen the memory rise. Although part of the climb is based on the load at the time of day, by extending the data back 2 weeks you can see that the magnitude is completely different;
If I jump in to a controller's pod and look at the processes, I can see that the memory usage of the controller and the nginx master process is roughly what i'd expect - however it's the workers that are showing excessive memory usage. This output is from one of the pods in the very first screenshot;
PID USER VSZ RSS COMMAND COMMAND 1 www-data 216 8 dumb-init /usr/bin/dumb-init -- /nginx-ingress-controller --default-backend-service=ingress/external-ingress-nginx-defaultbackend --publish-service=ingress/external-ingress-nginx-controll 7 www-data 798m 105m nginx-ingress-c /nginx-ingress-controller --default-backend-service=ingress/external-ingress-nginx-defaultbackend --publish-service=ingress/external-ingress-nginx-controller --election-id=ingre 28 www-data 201m 89m nginx nginx: master process /usr/local/nginx/sbin/nginx -c /etc/nginx/nginx.conf 4515 www-data 1.5g 1.2g nginx nginx: worker process 4516 www-data 1.3g 1.0g nginx nginx: worker process 4517 www-data 1.5g 1.2g nginx nginx: worker process 4518 www-data 1.5g 1.2g nginx nginx: worker process 4519 www-data 894m 727m nginx nginx: worker process 4552 www-data 1.4g 1.2g nginx nginx: worker process 4553 www-data 1.7g 1.4g nginx nginx: worker process 4586 www-data 1.4g 1.2g nginx nginx: worker process 4632 www-data 200m 80m nginx nginx: cache manager process
Finally, although I have managed to get a core dump from this pod (by sending a
SIGSEGV
to one of the worker processes above) i've not touchedgdb
in years, and never on someone else's code, so i'm a little lost! Are there anygdb
commands I can run that might help narrow things down @longwuyuan ? Runningmaintenance info sections
for example shows a large number ofloadnnnn
sections going up toload2477
- most of which areALLOC LOAD HAS_CONTENTS
.
@KingJ Could you provide me with the core file so I could look into debugging this? You can also reach me directly on the Kubernetes Slack (slack username: Ismayil)
/priority critical-important /triage accepted
@strongjz: The label(s) priority/critical-important
cannot be applied, because the repository doesn't have them.
/priority critical-urgent
Hello, I have the same issue.Release: v1.2.1. I test disabling traffic on it and the memory is still groing ...
any idea ?
Jeff
The information posted in this issue is not extremely productive or helpful, in the context of the problem description. For example, one way to produce this problem is in a for loop. If you send json payload for whatever objective in a for loop, without even a second of sleep, then CPU/Memory usage to handle that kind of train of events is expected.
The developers are already aware of one scenario where the volume of change is unusually high, like thousands of ingress objects. This leads to performance issues during an event of reloading the nginx.conf .
So in this issue, it helps to track that some users have a performance problem but the progress is going to be really slow because the information precision is lacking. The description of the problem to be solved and some sort of a reasonable reproduce procedure is a much needed aspect here.
Hello, I have the same issue.Release: v1.2.1. I test disabling traffic on it and the memory is still groing ...
any idea ?
Jeff
Is that the overall usage of the Ingress NGINX pods?
hello, I m trying to dump memory but i have some trouble to did it with gdb (permission issue ) for the moment the only thing i know is:
Is not link directly with the ingress volume because I have an ingressclass with 1901 rules and I have the issue, and another class with 2329 rules without memory groing.... in the same cluster.
Still working on it ..
Jeff
Just a +1: we see memory increasing constantly since upgrading to ingress v1, easily shown because we still also have the 0.x ingresses in the same cluster and those pods are stable. The leak is much more visible on the ingresses with modsecurity enabled, which we resorted to killing once they reach 4GB. Sample graph for a pod over 24 hours:
I suspect modsecurity too...
memory_spike or memory_leak should be solved sooner than later (to state the obvious). 👍
There is report of memory leak even after killing load in this issue Confirmed for us too, the ingress without modsecurity also leaks, but much slower so not a concern at this time.
There is report of memory_spike on enabling modsecurity
memory_spike would be expected on enabling modescurity due to higher inspection etc Correct, and it shows in the graph I posted above. Not high enough to be a concern.
But request is to kindly help with information like ;
a step-by-step procedure that can be used to copy/paste on a minikube/kind cluster and reproduce the problem Install is done on EKS 1.21 by this terraform module: https://github.com/ministryofjustice/cloud-platform-terraform-ingress-controller called from https://github.com/ministryofjustice/cloud-platform-infrastructure/blob/main/terraform/aws-accounts/cloud-platform-aws/vpc/eks/components/components.tf#L111 Not tested, but it should work on kind too.
uploading coredumps if you have I don't think we can do that sorry, as there may be confidential information in memory.
Any other info that a developer can use as a valid scenario that results in either spike or leak
sorry, that came out wrong about the coredump. The intent is to get your core dump analyzed for bugs. And the thought was more like if a developer asks, then are you willing to privately & securely provide the coredump to the developer.
While the code linked above is all MIT and you can even watch it being modified for any request, I don't think live systems data can be shared, sorry.
NGINX Ingress controller version
NGINX Ingress controller Release: v1.1.1 Build: a17181e43ec85534a6fea968d95d019c5a4bc8cf Repository: https://github.com/kubernetes/ingress-nginx nginx version: nginx/1.19.9
Kubernetes version
Environment:
Cloud provider or hardware configuration: GCP
OS : Container-Optimized OS from Google
Kernel : 5.4.129+
How was the ingress-nginx-controller installed:
We have used helm to install Nginx Ingress Controller following are the values which provided
Current State of the controller:
kubectl describe ingressclasses
Current state of ingress object, if applicable:
kubectl -n <appnamespace> describe ing <ingressname>
What happened: Our Ingress controller is serving ~1500 RPS But over time ingress controller memory gets continuously increase but never goes down when it crosses the node limitation ~15GB pods gets evicted.
What you expected to happen: We expect memory to get stablizise at some point.
profiling heap export:
high_mem.txt
For now we are manually restarting worker processes inorder to realise the memory