Closed tsk9 closed 2 months ago
This issue is currently awaiting triage.
If Ingress contributors determines this is a relevant issue, they will accept it by applying the triage/accepted
label and provide further guidance.
The triage/accepted
label can be added by org members by writing /triage accepted
in a comment.
Any thoughts on you identifyiing processes or even comparing processes for cpu/memory.
First thought I have is could be added processing related to security. Several CVE fixes came in and more fixes are needed on alpine libs for TLS at least.
Or do you see errors or high volume of repeated log messages.
The memory consumption is the same as before. I checked the processes inside the pod and only nginx has a significant CPU usage.
From the alpine release notes I saw the change of OpenSSL 3.0 as default which could be related (we are doing SSL termination at the ingress controller). But this is only a wild guess without any evidence.
The logs are showing no related error messages.
Hello. I've also faced with similar high CPU utilization issue when I use 1.7.0 with enabled OpenTelemetry module. When I disable the OpenTelemetry module the CPU utilization goes back to normal levels. I've found that "__vdso_clock_gettime" kernel function utilizes the CPU.
This issue is marked as bug because there seems to be some data here to investigate the performance. Please wait till the developers can reduce their load from the even higher priority problems with stabilization.
We're also seeing high CPU usage if OpenTelemetry is enabled. We've also had to disable open telemetry again because there seems to be something broken. Using 0.16 cpus before it goes to use all available resources (5 cpu in my case) with 1% sampling.
@thomaschaaf is your controller installed as a daemonset
@longwuyuan no it's installed as a deployment with hpa.
I am seeing the same thing with my cluster. After enabling OpenTelemetry, my Ingress Controller pod CPU utilization jumped to 100%
My override values.yaml:
controller:
extraModules: []
opentelemetry:
enabled: true
config:
otlp-collector-host: obs-otel-collector.obs
enable-opentelemetry: "true"
otel-sampler: AlwaysOn
otel-sampler-ratio: "1.0"
admissionWebhooks:
enabled: false
cc @esigo
Maybe the high CPU load on enabled OpenTelemetry is a different issue, because on the described setup it was disabled (default).
We're also seeing high CPU usage if OpenTelemetry is enabled. We've also had to disable open telemetry again because there seems to be something broken. Using 0.16 cpus before it goes to use all available resources (5 cpu in my case) with 1% sampling.
@thomaschaaf would it be possible to share the number of requests per second? thanks
In my case, my cluster is for a very low volume of testing and it jumped to 100% CPU utilization with very few requests (maybe 10). However, it doesn't even seem tied to request volume as even after the Ingress Controller was reloaded, it still immediately jumped to high CPU usage.
Hello, using Kubernetes 1.24.10 and nginx helm chart 4.6.0 (nginx v1.7.0), I also experiment the same VERY high cpu usage. Here the configuration I use to enable opentelemetry:
config:
otlp-collector-host: tempo.tempo # using grpc endpoint
enable-opentelemetry: "true"
otel-sampler: AlwaysOn
otel-sampler-ratio: "1.0"
Each nginx process consumme almost 99% CPU on my worker node... I had to disable opentelemetry at this point and nginx process come back to "normal".
Here the top
output:
top - 15:24:50 up 5 days, 20:41, 1 user, load average: 11.27, 3.35, 1.39
Tasks: 360 total, 1 running, 359 sleeping, 0 stopped, 0 zombie
%Cpu(s): 99.1 us, 0.7 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.1 si, 0.0 st
MiB Mem : 16000.5 total, 2099.1 free, 4718.4 used, 9183.1 buff/cache
MiB Swap: 0.0 total, 0.0 free, 0.0 used. 10933.1 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
233385 systemd+ 20 0 173496 47768 9260 S 99.7 0.3 2:48.55 nginx
233382 systemd+ 20 0 173496 47768 9260 S 97.7 0.3 2:45.30 nginx
233349 systemd+ 20 0 173496 47768 9260 S 97.3 0.3 2:43.78 nginx
233386 systemd+ 20 0 159944 33788 4192 S 94.7 0.2 2:50.83 nginx
233347 systemd+ 20 0 173752 48616 9808 S 88.0 0.3 2:49.96 nginx
233346 systemd+ 20 0 173496 48096 9532 S 86.7 0.3 2:49.19 nginx
233348 systemd+ 20 0 173496 47768 9260 S 78.7 0.3 2:42.19 nginx
233354 systemd+ 20 0 176240 52660 11808 S 64.1 0.3 2:39.53 nginx
233350 systemd+ 20 0 173496 47648 9140 S 60.8 0.3 2:49.57 nginx
Opentelemetry support is closer to being completed, but technically it is still work in progress. /assign
Hello, after some "research", I have found this article and when I apply config described:
controller:
config:
enable-opentelemetry: "true"
opentelemetry-config: "/etc/nginx/opentelemetry.toml"
opentelemetry-operation-name: "HTTP $request_method $service_name $uri"
opentelemetry-trust-incoming-span: "true"
otlp-collector-host: "otel-collector.grafana.svc.cluster.local"
otlp-collector-port: "4317"
otel-max-queuesize: "2048"
otel-schedule-delay-millis: "5000"
otel-max-export-batch-size: "512"
otel-service-name: "nginx-proxy" # Opentelemetry resource name
otel-sampler: "AlwaysOn" # Also: AlwaysOff, TraceIdRatioBased
otel-sampler-ratio: "1.0"
otel-sampler-parent-based: "false"
My CPUs remain calm, no idea for moment which parameter cause the high CPU load.
@albundy83 I just tried specifying slightly different config options and the CPU is also doing fine so far.
My version is the same as yours but with this one difference:
opentelemetry-operation-name: "HTTP $request_method $service_name"
Through some experimentation, it seems like this particular setting is what makes the difference:
otel-schedule-delay-millis: "5000"
When I comment this out, the resulting /etc/nginx/opentelemetry.toml contents in the Ingress Controller container looks like this:
[processors.batch]
max_queue_size = 2048
schedule_delay_millis = 0 # <--- it gets set to 0
max_export_batch_size = 512
This orange line shows the CPU usage for my Ingress controller in my cluster when I had schedule_delay_millis
set to 5000 (flat), then 0 (spike), and then back to 5000 (flat again):
So it seems pretty clearly the otel-schedule-delay-millis
setting is key here.
Yes, I was trying to add and remove parameters to see which one is eating CPUs and, it's exactly the otel-schedule-delay-millis
Once I remove otel-schedule-delay-millis: "5000"
, the graphs are touching the sky:
I had the same problem, but open telemetry was not enabled. Using version 1.6.4 had the same behavior.
For security compliance, we were able to upgrade the version of curl and libcurl used in nginx controller v1.5.1 to v8.0.1-r0. This solution works as expected.
@longwuyuan any update on this? My team has to patch another vulnerability in the 1.5.1 image because later images are not usable.
We can't backport fixes to v1.5.1. The opentelemetry work is in-progress at a good pace so if the alpine lib updates for CVE fixes are the requirement, then they are only available in v1.7.1 of the controller.
If you can state why you can not upgrade beyond controller v1.5.1, maybe others will have comments for insight
@longwuyuan We have the same problem others were describing where nginx-ingress-controller was consuming far too much CPU, but, unlike others, we don't have open telemetry enabled. There has been no workaround mentioned so far, so we are still stuck on 1.5.1.
@ejsealms I can join a zoom session to more info or you could do some investigative drill-down on the process and the call using high CPU.
To take any action, we need a reproduce procedure.
I see that an updated version was released, was it solved? Also, is there also performance degradation or only high usage of resources?
@ravidbro version 1.8.0 of the controller contained a fix and we are now running the latest without problems. The high CPU utilization led to a service outage for my team.
@tsk9 one user reported that their problem was solved. Since there has been no activity for such a long time, it seems that either the problem was solved or the system was upgraded and the issue no longer applies.
In any case, its hard to keep an issue open without a tracking action item. The project has started deprecating unsupportable features due to lack of resources. Also need to to allocate available resources to security and Gateway-API implementation.
Since ethere is no action item being tracked here, I will close the issue.
/close
@longwuyuan: Closing this issue.
What happened: After the update of the ingress controller from v1.5.1 to v1.7.0 the CPU load increased significantly during the same amount of requests and under higher load the readyness/liveness probes are not responding in time frequently. This leads to stability issues, because sometimes the ingress controller is not able to recover under load.
What you expected to happen: The CPU load should be the same as before.
Kubernetes version (use
kubectl version
): v1.24.8Others: After testing various versions and combinations of rootfs images, I found out that the higher CPU load is a result of the updated alpine base image. Therefore I build two different version of the ingress controller v1.7.0, one with apline 3.17.3 and one with 3.16.5. The graph shows the difference between the two version (3.17.3 -> 3.16.5) during testing with a fix set of requests.