open-telemetry / opentelemetry-operator

Kubernetes Operator for OpenTelemetry Collector
Apache License 2.0
1.21k stars 440 forks source link

Prevent instrumentation resource limits to be re-configured during Operator upgrade #2483

Closed andresm53 closed 9 months ago

andresm53 commented 10 months ago

Component(s)

operator, instrumentation

Describe the issue you're reporting

Hi All,

My instrumentation resource look like this:

  java:
    env:
      - name: OTEL_METRICS_EXPORTER
        value: none
    image: 'public.ecr.aws/aws-observability/adot-autoinstrumentation-java:v1.28.1'
    resources:
      limits:
        cpu: 500m
        memory: 64Mi
      requests:
        cpu: 50m
        memory: 64Mi

Those resource limits are used by the init container, and because it has cpu limits, in Openshift, my pod metrics chart looks like this:

image

Problem is, my application container doesn't have cpu limits, but as you can see in the above print screen, Openshift think it has (because the init container has) and reports the pod limit metric incorrectly (actually, it's not Openshift fault; it's just using the kube state metric kube_pod_resource_limit)

To fix this, I can just remove cpu limit from the instrumentation resource:

  java:
    env:
      - name: OTEL_METRICS_EXPORTER
        value: none
    image: 'public.ecr.aws/aws-observability/adot-autoinstrumentation-java:v1.28.1'
    resources:
      limits:
        memory: 64Mi
      requests:
        cpu: 50m
        memory: 64Mi

The problem is that during OpenTelemetry Operator automatic upgrades, the cpu limits are added back again. Is there a way to prevent this? I tried configuring .Spec.UpgradeStrategy to nonein the instrumentation resource, but it doesn't support it.

Note, this issue doesn't affect my pod performance, as I mentioned, the app container doesn't have cpu limits. This issue is more a cosmetic issue, it affects the Openshift metrics cpu charts.

Thanks.

changexd commented 10 months ago

Hi @jaronoff97, I'm interested in this issue, if UpgradeStrategy is expected in the InstrumentationSpec, I can help this out with similar upgrade logic implemented for OpentelemetryCollector.

pavolloffay commented 10 months ago

The problem is that during OpenTelemetry Operator automatic upgrades, the cpu limits are added back again

I didn't find a code occurrence where the CPI limit is defaulted when it is nil. The limits are default when they are missing for both CPU and memory https://github.com/open-telemetry/opentelemetry-operator/blob/f2b5e3c6c9bbc626af32101b2a69c22c6e7c732e/apis/v1alpha1/instrumentation_webhook.go#L106

pavolloffay commented 10 months ago

Maybe you can try to set the CPU limit to zero. The operator should not override it

. If it is set to 0, it means no resource limit for the container. In particular, if you set limits without specifying requests, Kubernetes considers the value of requests is the same as that of limits by default.

https://kubesphere.io/blogs/understand-requests-and-limits-in-kubernetes/#:~:text=If%20it%20is%20set%20to,that%20of%20limits%20by%20default.

jaronoff97 commented 10 months ago

@changexd sure, if you want to take a look here that would be great. I think Pavol's suggestion is accurate, may be start with testing that?

changexd commented 10 months ago

@andresm53 Is it possible to provide steps to reproduce this issue with more information? I've tried ways to reproduce the issue, but it all leads to what @pavolloffay has mentioned, the instrumentation resource itself is only defaulted when resource.limit == nil

meaning after the upgrade, nothing from the instrumentation is changed with this configuration

    resources:
      limits:
        memory: 64Mi
      requests:
        cpu: 50m
        memory: 64Mi
andresm53 commented 10 months ago

@changexd I just tried again. And I wasn't able to repro the issue either. What I did was (this is Openshift 4.12):

  1. Install Opentelemetry Operator v0.83

  2. Configure instrumentation resource like you did:

    resources:
      limits:
        memory: 64Mi
      requests:
        cpu: 50m
        memory: 64Mi
  3. Upgrade Operator to 0.84 then 0.85 etc. until 0.90

  4. The cpu limit was not re-configured during or after the upgrade process.

So, going back to my original problem, something else must have re-configured the cpu limits, and now it seems that it wasn't the Operator upgrade. I can't think of anything else, in my environment, besides the Operator, that could have re-configured that...

andresm53 commented 10 months ago

Just in case I tried again, and (again) I wasn't able to repo this. Sorry about that. I think I can close this issue as "not reproducible", if you agree.

changexd commented 9 months ago

Since this seems to be non reproducible on both sides, we can probably close this issue, feel free to open issues in the future if you encounter any. Thank you : )

jaronoff97 commented 9 months ago

@changexd @andresm53 thank you both for working on this! :bow: