Open cergxx opened 1 year ago
I think #8632 would help address this
I was recently debugging a similar issue and my analysis led me to a slightly different conclusion:
In our case, relevant logs look like this:
2023/10/23 15:03:49 settings.go:392: Set config to [/conf/relay.yaml]
2023/10/23 15:03:49 settings.go:445: Set ballast to 1351 MiB
2023/10/23 15:03:49 settings.go:461: Set memory limit to 3686 MiB
2023-10-23T15:03:49.923Z info service/telemetry.go:84 Setting up own telemetry...
......
2023-10-23T15:03:49.925Z info memorylimiterprocessor@v0.82.0/memorylimiter.go:102 Memory limiter configured {"kind": "processor", "name": "memory_limiter", "pipeline": "metrics", "limit_mib": 3686, "spike_limit_mib": 737, "check_interval": 2}
.......
2023-10-23T15:03:51.938Z debug memorylimiterprocessor@v0.82.0/memorylimiter.go:273 Currently used memory. {"kind": "processor", "name": "memory_limiter", "pipeline": "metrics", "cur_mem_mib": 673}
.......
2023-10-23T15:03:54.102Z debug memorylimiterprocessor@v0.82.0/memorylimiter.go:273 Currently used memory. {"kind": "processor", "name": "memory_limiter", "pipeline": "metrics", "cur_mem_mib": 1296}
........
2023-10-23T15:03:56.553Z debug memorylimiterprocessor@v0.82.0/memorylimiter.go:273 Currently used memory. {"kind": "processor", "name": "memory_limiter", "pipeline": "metrics", "cur_mem_mib": 969}
Pod restarting:
lastState:
terminated:
exitCode: 137
reason: OOMKilled
startedAt: '2023-10-23T15:03:49Z'
finishedAt: '2023-10-23T15:03:59Z'
you can see that explicit GC from memory limiter never ran and there was a burst of data that caused high memory usage between 15:03:56.553Z
when the memory was 969 MB
and before the next memory limiter poll had a chance to run.
In addition, the regular GC will also not run as it will only run when the 969MB
+ 1351 MiB
(ballast) is doubled , which is > than the container limit (set to 4GB)
Proposed solution is to get rid of the ballast and start using GOMEMLIMIT
.
Keep the memory limiter and modify it to only be used for stopping/delaying/dropping data as it's GC functionality will no longer be needed.
@dloucasfx For the ballast we are considering removal on #8343. I am interested in having end-users test GOMEMLIMIT
and provide feedback on the transition, so that we can make an informed decision.The official Helm chart now has a builtin option for this. If you are trying this out, please leave feedback on either #8343 or open-telemetry/opentelemetry-helm-charts/issues/891 so we can move forward with this :)
Is your feature request related to a problem? Please describe. In my collector I have a linear pipeline:
receiver -> memory_limiter -> batch -> exporter
, and the exporter has an enabled queue. The queue size is set to a large number, so that the data is never dropped. Instead, I use the memory limiter to refuse the data when there is a congestion on the exporter side (queue grows in size, memory usage grows along). Existing memory limiter fails to do it properly – sometimes it does not stop accepting new records fast enough – it waits for the GC to free some memory and the collector faces the OOM.Describe the solution you'd like I have 2 proposals:
Describe alternatives you've considered Proposal 1 – fork the existing memory limiter. Proposal 2 – I could achieve the same behaviour by implementing such approach:
batch
) and return an error there if queue is full. But this solution looks too complex.