open-telemetry / opentelemetry-collector

OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
4.38k stars 1.45k forks source link

Stricter memory limiter #8694

Open cergxx opened 1 year ago

cergxx commented 1 year ago

Is your feature request related to a problem? Please describe. In my collector I have a linear pipeline: receiver -> memory_limiter -> batch -> exporter, and the exporter has an enabled queue. The queue size is set to a large number, so that the data is never dropped. Instead, I use the memory limiter to refuse the data when there is a congestion on the exporter side (queue grows in size, memory usage grows along). Existing memory limiter fails to do it properly – sometimes it does not stop accepting new records fast enough – it waits for the GC to free some memory and the collector faces the OOM.

Describe the solution you'd like I have 2 proposals:

  1. Add a "strict" mode to the memory limiter, so that it starts refusing incoming data as soon as the memory usage goes above the limit, not after calling the GC.
  2. Add some sort of limiter that would inspect the exporter queue size and refuse data in receiver, when the queue grows over a limit. Or at least to be able to inspect the exporter queue size in my custom receiver/processor and do the refuse part myself.

Describe alternatives you've considered Proposal 1 – fork the existing memory limiter. Proposal 2 – I could achieve the same behaviour by implementing such approach:

mx-psi commented 1 year ago

I think #8632 would help address this

dloucasfx commented 1 year ago

I was recently debugging a similar issue and my analysis led me to a slightly different conclusion:

In our case, relevant logs look like this:

2023/10/23 15:03:49 settings.go:392: Set config to [/conf/relay.yaml]
2023/10/23 15:03:49 settings.go:445: Set ballast to 1351 MiB
2023/10/23 15:03:49 settings.go:461: Set memory limit to 3686 MiB
2023-10-23T15:03:49.923Z        info    service/telemetry.go:84 Setting up own telemetry...
......
2023-10-23T15:03:49.925Z    info    memorylimiterprocessor@v0.82.0/memorylimiter.go:102 Memory limiter configured   {"kind": "processor", "name": "memory_limiter", "pipeline": "metrics", "limit_mib": 3686, "spike_limit_mib": 737, "check_interval": 2}
.......
2023-10-23T15:03:51.938Z    debug   memorylimiterprocessor@v0.82.0/memorylimiter.go:273 Currently used memory.  {"kind": "processor", "name": "memory_limiter", "pipeline": "metrics", "cur_mem_mib": 673}
.......
2023-10-23T15:03:54.102Z    debug   memorylimiterprocessor@v0.82.0/memorylimiter.go:273 Currently used memory.  {"kind": "processor", "name": "memory_limiter", "pipeline": "metrics", "cur_mem_mib": 1296}
........
2023-10-23T15:03:56.553Z    debug   memorylimiterprocessor@v0.82.0/memorylimiter.go:273 Currently used memory.  {"kind": "processor", "name": "memory_limiter", "pipeline": "metrics", "cur_mem_mib": 969}

Pod restarting:

lastState:
        terminated:
          exitCode: 137
          reason: OOMKilled
          startedAt: '2023-10-23T15:03:49Z'
          finishedAt: '2023-10-23T15:03:59Z'

you can see that explicit GC from memory limiter never ran and there was a burst of data that caused high memory usage between 15:03:56.553Z when the memory was 969 MB and before the next memory limiter poll had a chance to run. In addition, the regular GC will also not run as it will only run when the 969MB + 1351 MiB (ballast) is doubled , which is > than the container limit (set to 4GB)

Proposed solution is to get rid of the ballast and start using GOMEMLIMIT . Keep the memory limiter and modify it to only be used for stopping/delaying/dropping data as it's GC functionality will no longer be needed.

mx-psi commented 1 year ago

@dloucasfx For the ballast we are considering removal on #8343. I am interested in having end-users test GOMEMLIMIT and provide feedback on the transition, so that we can make an informed decision.The official Helm chart now has a builtin option for this. If you are trying this out, please leave feedback on either #8343 or open-telemetry/opentelemetry-helm-charts/issues/891 so we can move forward with this :)