newrelic / helm-charts

Helm charts for New Relic applications
Apache License 2.0
97 stars 210 forks source link

[newrelic-logging] Default resource limits cause out of memory errors #1500

Open hero-david opened 1 month ago

hero-david commented 1 month ago

Description

An issue has been opened about this before, and the reporter was instructed to ensure that they had upgraded their chart such that memory limit config on the input was present.

https://github.com/newrelic/helm-charts/blob/ab2d1bab9f09d94ea6ca56fed807dd20eae5444e/charts/newrelic-logging/values.yaml#L104

We have been struggling with OOM errors and restarts on our pods despite having this config present, and upping the memory allowances of the pod. We have about 50 pods per node.

fluentbit oom oom

The helm config provided for this was:

newrelic-logging:
  enabled: true
  fluentBit:
    criEnabled: true
  lowDataMode: false
  resources:
    limits:
      memory: 256Mi
  tolerations:
  - effect: NoSchedule
    key: role
    operator: Exists
Date Message
2024-10-08 05:11:23 Memory cgroup out of memory: Killed process 1360652 (flb-pipeline) total-vm:1307336kB, anon-rss:259736kB, file-rss:19648kB, shmem-rss:0kB, UID:0 pgtables:1104kB oom_score_adj:996
2024-10-08 05:11:23 Memory cgroup out of memory: Killed process 1400772 (fluent-bit) total-vm:1311176kB, anon-rss:259508kB, file-rss:19084kB, shmem-rss:0kB, UID:0 pgtables:1028kB oom_score_adj:996
2024-10-08 05:11:23 Memory cgroup out of memory: Killed process 1400790 (flb-pipeline) total-vm:1311176kB, anon-rss:259652kB, file-rss:19468kB, shmem-rss:0kB, UID:0 pgtables:1028kB oom_score_adj:996
2024-10-08 05:11:23 Memory cgroup out of memory: Killed process 1360626 (fluent-bit) total-vm:1307336kB, anon-rss:259624kB, file-rss:19264kB, shmem-rss:0kB, UID:0 pgtables:1104kB oom_score_adj:996
2024-10-08 05:11:23 Memory cgroup out of memory: Killed process 1201131 (flb-pipeline) total-vm:1483464kB, anon-rss:259504kB, file-rss:19828kB, shmem-rss:0kB, UID:0 pgtables:1324kB oom_score_adj:996
2024-10-08 05:11:23 Memory cgroup out of memory: Killed process 1201113 (fluent-bit) total-vm:1483464kB, anon-rss:259392kB, file-rss:19444kB, shmem-rss:0kB, UID:0 pgtables:1324kB oom_score_adj:996
2024-10-08 05:11:23 Memory cgroup out of memory: Killed process 1266468 (flb-pipeline) total-vm:1487560kB, anon-rss:259188kB, file-rss:19628kB, shmem-rss:0kB, UID:0 pgtables:1344kB oom_score_adj:996
2024-10-08 05:11:23 Memory cgroup out of memory: Killed process 1324063 (fluent-bit) total-vm:1487560kB, anon-rss:259368kB, file-rss:19368kB, shmem-rss:0kB, UID:0 pgtables:1348kB oom_score_adj:996
2024-10-08 05:11:23 Memory cgroup out of memory: Killed process 1324081 (flb-pipeline) total-vm:1487560kB, anon-rss:259476kB, file-rss:19752kB, shmem-rss:0kB, UID:0 pgtables:1348kB oom_score_adj:996
2024-10-08 05:11:23 Memory cgroup out of memory: Killed process 1266420 (fluent-bit) total-vm:1487560kB, anon-rss:259084kB, file-rss:19244kB, shmem-rss:0kB, UID:0 pgtables:1344kB oom_score_adj:996

Versions

Helm v3.14.4 Kubernetes (AKS) 1.29.2 Chart: nri-bundle-5.0.81 FluentBit: newrelic/newrelic-fluentbit-output:2.0.0

What happened?

The fluentbit pods were repeatedly killed for using more memory than it's limit, which is set very low. It's CPU was never highly utilised, which does not suggest that the memory increase was due to throttling / not being able to keep up.

What you expected to happen?

The fluentbit should have little to no restarts, and it should never reach 1.5GB of memory used per container.

How to reproduce it?

Using the same versions as listed above, and the same helm values.yaml, deploy an AKS cluster with 50 production workloads per node (2vcpu 8gb) and observe whether there are memory issues.

workato-integration[bot] commented 1 month ago

https://new-relic.atlassian.net/browse/NR-323574

JS-Jake commented 2 weeks ago

@hero-david Did you have any luck resolving this? We're seeing the same problem with AKS

hero-david commented 1 week ago

@hero-david Did you have any luck resolving this? We're seeing the same problem with AKS

No, we have simply upped our VM SKU to 16GB (Required for some of our workloads moving forwards anyway)