open-telemetry / opentelemetry-helm-charts

OpenTelemetry Helm Charts
https://opentelemetry.io
Apache License 2.0
396 stars 485 forks source link

[opentelemetry-collector] Remove `memory_ballast` extension from default template and replace by `GOMEMLIMIT` #891

Open mx-psi opened 1 year ago

mx-psi commented 1 year ago

We want to remove the memory_ballast extension in favor of Go 1.19+ GOMEMLIMIT environment variable as stated in open-telemetry/opentelemetry-collector/issues/8343.

The first step to verify that we can effectively remove the extension is to change this on the OpenTelemetry Collector Helm chart. We can replace the extension by a value of GOMEMLIMIT that is something like 80-90% of the total memory limit for the Collector container. Once we release this change and wait some time to gather feedback, we can move on with deprecation of the memory ballast extension.

mx-psi commented 1 year ago

cc @dmitryax @TylerHelmuth

TylerHelmuth commented 1 year ago

@mx-psi @dmitryax i think we should implement this via a feature-gate-like config.

The first PR can add the option to use the Go setting instead of the extension, but disabled by default. After some time, we can enable it by default, but still allow users who want to use the extension to disable it. Then we'd wait until the extension is deprecated to remove the option.

mx-psi commented 1 year ago

i think we should implement this via a feature-gate-like config.

Have we done anything similar before on the Collector Helm chart? The plan sounds reasonable to me

mx-psi commented 1 year ago

@TylerHelmuth should this be a usual configuration option or is there an alternative mechanism in the Helm chart for temporary toggles?

TylerHelmuth commented 1 year ago

We dont have anything like feature gates in the chart, we normally do it via a configuration option (or 2).

povilasv commented 1 year ago

Question: what are we planning to do we do when request limits are not set?

ATM we default memorybalast to (https://github.com/open-telemetry/opentelemetry-helm-charts/blob/main/charts/opentelemetry-collector/templates/_config.tpl#L25C1-L27C1) :

      memory_ballast:
        size_in_percentage: 40

and requests / limits to empty:

resources: {}
# resources:
#   limits:
#     cpu: 250m
#     memory: 512Mi

I think the only thing we can do is not set the GOMEMLIMIT env?

This would be a different behaviour from current one.

mx-psi commented 1 year ago

I think the only thing we can do is not set the GOMEMLIMIT env?

I am personally fine with that, if you explicitly unset resource limits then you should be on your own to configure this

TylerHelmuth commented 1 year ago

I have started work on this

mx-psi commented 1 year ago

How much should we wait until we enable the GOMEMLIMIT option by default? Is something like one month enough?

TylerHelmuth commented 1 year ago

@mx-psi is this helm chart feature critical for the decision on whether or not to deprecate the extension or has the decision already been made to deprecate it?

mx-psi commented 1 year ago

@mx-psi is this helm chart feature critical for the decision on whether or not to deprecate the extension or has the decision already been made to deprecate it?

I would want to validate the approach with the Helm chart before committing to deprecate the extension. I don't feel like I personally have enough knowledge about the Go garbage collector to ensure this replacement will be satisfactory for all use cases (it probably will, that's why I pushed for this, but I want to be sure before deprecating the extension).

If anybody else feels like this is clear cut, then we don't have to wait for this, but personally I feel like this is safer.

TylerHelmuth commented 1 year ago

In that case I think we need to leave it around as long as it take to make a decision. We won't move forward with turning it always on until we're sure we are deprecating the extension.

I can enable it on our internal use of the collector, but we'll need other examples as well.

mx-psi commented 1 year ago

@JaredTan95 based on #912 it seems like you are testing this out. Would you mind sharing any findings about the setting (memory and CPU usage patterns before/after?)

I also posted a call out for more testing on a few channels on the CNCF Slack. Once we have at least two I think we can proceed with the deprecation and enable this by default.

mx-psi commented 1 year ago

We have been running Collectors using the Helm chart with useGOMEMLIMIT enabled for a few weeks on Datadog's infrastructure with a similar CPU usage as with the memory ballast and lower memory consumption (because of the memory ballast).

krisztianfekete commented 1 year ago

Can you share some stats/numbers on the improvements?

TylerHelmuth commented 1 year ago

We see similar stats at Honeycomb

JaredTan95 commented 1 year ago

@JaredTan95 based on #912 it seems like you are testing this out. Would you mind sharing any findings about the setting (memory and CPU usage patterns before/after?)

I also posted a call out for more testing on a few channels on the CNCF Slack. Once we have at least two I think we can proceed with the deprecation and enable this by default.

We switched to this(useGOMEMLIMIT) configuration because of an issue with memory ballast's compatibility with our arm architecture. We haven't verified this configuration in production, but in our test environment, this feature solves the arm compatibility problem and the effect is similar to that of the original memory ballast.

So, I support it.

TylerHelmuth commented 1 year ago

@mx-psi I'd still like to see the extension officially deprecated in Core before we make this a default value in the helm chart.

dmitryax commented 1 year ago

@JaredTan95 can you please share more details about memory ballast's compatibility issues with our arm architecture? It be nice to post them in https://github.com/open-telemetry/opentelemetry-collector/issues/8343 as well

mx-psi commented 1 year ago

Based on the discussion on https://github.com/open-telemetry/opentelemetry-collector/issues/8343 and this issue, I filed open-telemetry/opentelemetry-collector#8803 to deprecate the memory ballast extension.

dmitryax commented 11 months ago

I'd prefer the next step before deprecating the memory_ballast extension to be switching the useGOMEMLIMIT feature gate to true in the helm chart. That way we are not forced to do it and will have a chance to get some more feedback before the actual deprecation. I'd think we should keep useGOMEMLIMIT: true for a couple of versions, then deprecate memory_ballast extension. @mx-psi @JaredTan95 @povilasv @TylerHelmuth WDYT?

mx-psi commented 11 months ago

I am fine with that, although I think @TylerHelmuth wanted to go the other way around :smile: Up to the Helm chart maintainers I guess :)

TylerHelmuth commented 11 months ago

Ya I was imagining after the initial testing that the helm chart would follow the guidance from Core. My though being that I didn't want to disrupt users until the component was actually deprecated.

But we could enable it by default and use helm chart users as testers (unless they switch it back to false).

dmitryax commented 11 months ago

But we could enable it by default and use helm chart users as testers (unless they switch it back to false).

That's what I thought. I wanted to gather as much feedback as possible before the deprecation. I'm thinking of useGOMEMLIMIT as an improvement anyway

JaredTan95 commented 11 months ago

I'd think we should keep useGOMEMLIMIT: true for a couple of versions, then deprecate memory_ballast extension

I agree! 👏

JaredTan95 commented 11 months ago

@JaredTan95 can you please share more details about memory ballast's compatibility issues with our arm architecture? It be nice to post them in open-telemetry/opentelemetry-collector#8343 as well

The compatibility problem we encountered was only tested on the Chinese operating system(Kylin OS), so I think it has little reference value for users using international operating systems, so I did not put it up :-P

mx-psi commented 10 months ago

open-telemetry/opentelemetry-collector/pull/8803 got merged, we can revert before Jan 8th if we find issues