Higher CPU load and stability issues after update to v1.7.0

tsk9 commented 1 year ago

What happened: After the update of the ingress controller from v1.5.1 to v1.7.0 the CPU load increased significantly during the same amount of requests and under higher load the readyness/liveness probes are not responding in time frequently. This leads to stability issues, because sometimes the ingress controller is not able to recover under load.

What you expected to happen: The CPU load should be the same as before.

Kubernetes version (use kubectl version): v1.24.8

Others: After testing various versions and combinations of rootfs images, I found out that the higher CPU load is a result of the updated alpine base image. Therefore I build two different version of the ingress controller v1.7.0, one with apline 3.17.3 and one with 3.16.5. The graph shows the difference between the two version (3.17.3 -> 3.16.5) during testing with a fix set of requests.

diff --git a/images/nginx/rootfs/Dockerfile b/images/nginx/rootfs/Dockerfile
index 3279af5d5..faef632f1 100644
--- a/images/nginx/rootfs/Dockerfile
+++ b/images/nginx/rootfs/Dockerfile
@@ -11,7 +11,7 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-FROM alpine:3.17.2 as builder
+FROM alpine:3.16.5 as builder

 COPY . /

@@ -21,7 +21,7 @@ RUN apk update \
   && /build.sh

 # Use a multi-stage build
-FROM alpine:3.17.2
+FROM alpine:3.16.5

 ENV PATH=$PATH:/usr/local/luajit/bin:/usr/local/nginx/sbin:/usr/local/nginx/bin

k8s-ci-robot commented 1 year ago

This issue is currently awaiting triage.

If Ingress contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

longwuyuan commented 1 year ago

Any thoughts on you identifyiing processes or even comparing processes for cpu/memory.

First thought I have is could be added processing related to security. Several CVE fixes came in and more fixes are needed on alpine libs for TLS at least.

Or do you see errors or high volume of repeated log messages.

tsk9 commented 1 year ago

The memory consumption is the same as before. I checked the processes inside the pod and only nginx has a significant CPU usage.

From the alpine release notes I saw the change of OpenSSL 3.0 as default which could be related (we are doing SSL termination at the ingress controller). But this is only a wild guess without any evidence.

The logs are showing no related error messages.

wardob commented 1 year ago

Hello. I've also faced with similar high CPU utilization issue when I use 1.7.0 with enabled OpenTelemetry module. When I disable the OpenTelemetry module the CPU utilization goes back to normal levels. I've found that "__vdso_clock_gettime" kernel function utilizes the CPU.

longwuyuan commented 1 year ago

This issue is marked as bug because there seems to be some data here to investigate the performance. Please wait till the developers can reduce their load from the even higher priority problems with stabilization.

thomaschaaf commented 1 year ago

We're also seeing high CPU usage if OpenTelemetry is enabled. We've also had to disable open telemetry again because there seems to be something broken. Using 0.16 cpus before it goes to use all available resources (5 cpu in my case) with 1% sampling.

longwuyuan commented 1 year ago

@thomaschaaf is your controller installed as a daemonset

thomaschaaf commented 1 year ago

@longwuyuan no it's installed as a deployment with hpa.

js8080 commented 1 year ago

I am seeing the same thing with my cluster. After enabling OpenTelemetry, my Ingress Controller pod CPU utilization jumped to 100%

My override values.yaml:

controller:
  extraModules: []
  opentelemetry:
    enabled: true

  config:
    otlp-collector-host: obs-otel-collector.obs
    enable-opentelemetry: "true"
    otel-sampler: AlwaysOn
    otel-sampler-ratio: "1.0"

  admissionWebhooks:
    enabled: false

longwuyuan commented 1 year ago

cc @esigo

tsk9 commented 1 year ago

Maybe the high CPU load on enabled OpenTelemetry is a different issue, because on the described setup it was disabled (default).

esigo commented 1 year ago

We're also seeing high CPU usage if OpenTelemetry is enabled. We've also had to disable open telemetry again because there seems to be something broken. Using 0.16 cpus before it goes to use all available resources (5 cpu in my case) with 1% sampling.

@thomaschaaf would it be possible to share the number of requests per second? thanks

js8080 commented 1 year ago

In my case, my cluster is for a very low volume of testing and it jumped to 100% CPU utilization with very few requests (maybe 10). However, it doesn't even seem tied to request volume as even after the Ingress Controller was reloaded, it still immediately jumped to high CPU usage.

albundy83 commented 1 year ago

Hello, using Kubernetes 1.24.10 and nginx helm chart 4.6.0 (nginx v1.7.0), I also experiment the same VERY high cpu usage. Here the configuration I use to enable opentelemetry:

config:
    otlp-collector-host: tempo.tempo # using grpc endpoint
    enable-opentelemetry: "true"
    otel-sampler: AlwaysOn
    otel-sampler-ratio: "1.0"

Each nginx process consumme almost 99% CPU on my worker node... I had to disable opentelemetry at this point and nginx process come back to "normal".

albundy83 commented 1 year ago

Here the top output:

top - 15:24:50 up 5 days, 20:41,  1 user,  load average: 11.27, 3.35, 1.39
Tasks: 360 total,   1 running, 359 sleeping,   0 stopped,   0 zombie
%Cpu(s): 99.1 us,  0.7 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.1 si,  0.0 st
MiB Mem :  16000.5 total,   2099.1 free,   4718.4 used,   9183.1 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.  10933.1 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 233385 systemd+  20   0  173496  47768   9260 S  99.7   0.3   2:48.55 nginx
 233382 systemd+  20   0  173496  47768   9260 S  97.7   0.3   2:45.30 nginx
 233349 systemd+  20   0  173496  47768   9260 S  97.3   0.3   2:43.78 nginx
 233386 systemd+  20   0  159944  33788   4192 S  94.7   0.2   2:50.83 nginx
 233347 systemd+  20   0  173752  48616   9808 S  88.0   0.3   2:49.96 nginx
 233346 systemd+  20   0  173496  48096   9532 S  86.7   0.3   2:49.19 nginx
 233348 systemd+  20   0  173496  47768   9260 S  78.7   0.3   2:42.19 nginx
 233354 systemd+  20   0  176240  52660  11808 S  64.1   0.3   2:39.53 nginx
 233350 systemd+  20   0  173496  47648   9140 S  60.8   0.3   2:49.57 nginx

longwuyuan commented 1 year ago

Opentelemetry support is closer to being completed, but technically it is still work in progress. /assign

albundy83 commented 1 year ago

Hello, after some "research", I have found this article and when I apply config described:

controller:
  config:
    enable-opentelemetry: "true"
    opentelemetry-config: "/etc/nginx/opentelemetry.toml"
    opentelemetry-operation-name: "HTTP $request_method $service_name $uri"
    opentelemetry-trust-incoming-span: "true"
    otlp-collector-host: "otel-collector.grafana.svc.cluster.local"
    otlp-collector-port: "4317"
    otel-max-queuesize: "2048"
    otel-schedule-delay-millis: "5000"
    otel-max-export-batch-size: "512"
    otel-service-name: "nginx-proxy" # Opentelemetry resource name
    otel-sampler: "AlwaysOn" # Also: AlwaysOff, TraceIdRatioBased
    otel-sampler-ratio: "1.0"
    otel-sampler-parent-based: "false"

My CPUs remain calm, no idea for moment which parameter cause the high CPU load.

js8080 commented 1 year ago

@albundy83 I just tried specifying slightly different config options and the CPU is also doing fine so far.

My version is the same as yours but with this one difference:

    opentelemetry-operation-name: "HTTP $request_method $service_name"

js8080 commented 1 year ago

Through some experimentation, it seems like this particular setting is what makes the difference:

otel-schedule-delay-millis: "5000"

When I comment this out, the resulting /etc/nginx/opentelemetry.toml contents in the Ingress Controller container looks like this:

[processors.batch]
max_queue_size = 2048
schedule_delay_millis = 0   # <--- it gets set to 0
max_export_batch_size = 512

This orange line shows the CPU usage for my Ingress controller in my cluster when I had schedule_delay_millis set to 5000 (flat), then 0 (spike), and then back to 5000 (flat again):

So it seems pretty clearly the otel-schedule-delay-millis setting is key here.

albundy83 commented 1 year ago

Yes, I was trying to add and remove parameters to see which one is eating CPUs and, it's exactly the otel-schedule-delay-millis

Once I remove otel-schedule-delay-millis: "5000", the graphs are touching the sky:

ejsealms commented 1 year ago

I had the same problem, but open telemetry was not enabled. Using version 1.6.4 had the same behavior.

For security compliance, we were able to upgrade the version of curl and libcurl used in nginx controller v1.5.1 to v8.0.1-r0. This solution works as expected.

ejsealms commented 1 year ago

@longwuyuan any update on this? My team has to patch another vulnerability in the 1.5.1 image because later images are not usable.

longwuyuan commented 1 year ago

We can't backport fixes to v1.5.1. The opentelemetry work is in-progress at a good pace so if the alpine lib updates for CVE fixes are the requirement, then they are only available in v1.7.1 of the controller.

If you can state why you can not upgrade beyond controller v1.5.1, maybe others will have comments for insight

ejsealms commented 1 year ago

@longwuyuan We have the same problem others were describing where nginx-ingress-controller was consuming far too much CPU, but, unlike others, we don't have open telemetry enabled. There has been no workaround mentioned so far, so we are still stuck on 1.5.1.

longwuyuan commented 1 year ago

@ejsealms I can join a zoom session to more info or you could do some investigative drill-down on the process and the call using high CPU.

To take any action, we need a reproduce procedure.

ravidbro commented 1 year ago

I see that an updated version was released, was it solved? Also, is there also performance degradation or only high usage of resources?

ejsealms commented 1 year ago

@ravidbro version 1.8.0 of the controller contained a fix and we are now running the latest without problems. The high CPU utilization led to a service outage for my team.

longwuyuan commented 2 months ago

@tsk9 one user reported that their problem was solved. Since there has been no activity for such a long time, it seems that either the problem was solved or the system was upgraded and the issue no longer applies.

In any case, its hard to keep an issue open without a tracking action item. The project has started deprecating unsupportable features due to lack of resources. Also need to to allocate available resources to security and Gateway-API implementation.

Since ethere is no action item being tracked here, I will close the issue.

/close

k8s-ci-robot commented 2 months ago

@longwuyuan: Closing this issue.

In response to [this](https://github.com/kubernetes/ingress-nginx/issues/9848#issuecomment-2345242528): >@tsk9 one user reported that their problem was solved. Since there has been no activity for such a long time, it seems that either the problem was solved or the system was upgraded and the issue no longer applies. > >In any case, its hard to keep an issue open without a tracking action item. The project has started deprecating unsupportable features due to lack of resources. Also need to to allocate available resources to security and Gateway-API implementation. > >Since ethere is no action item being tracked here, I will close the issue. > >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.

kubernetes / ingress-nginx

Higher CPU load and stability issues after update to v1.7.0 #9848