open-telemetry / opentelemetry-operator

Kubernetes Operator for OpenTelemetry Collector
Apache License 2.0
1.21k stars 439 forks source link

spec.podAnnotations entries can not be deleted in reconcile #2795

Open THUzxj opened 7 months ago

THUzxj commented 7 months ago

Component(s)

collector

What happened?

Description

Hi, I tested opentelemetry operator with testing tool acto, which automatically generates otelcol instance definition yaml files and performs end-to-end testing. Configurations of acto to test opentelemetry operator are here.

I found that spec.podAnnotations entries can not be deleted when the user applies new definitions to reconcile.

Steps to Reproduce

  1. create a kind cluster
  2. install cert-manager and opentelemetry-operator with kubectl apply -f
  3. create and change the definition of opentelemetry collector one by one as follows (mutation 1 and 2 are the most important):
# mutation 0
apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
  name: test-cluster
spec:
  config: "receivers:\n  otlp:\n    protocols:\n      grpc:\n      http:\nprocessors:\n\
    \  memory_limiter:\n    check_interval: 1s\n    limit_percentage: 75\n    spike_limit_percentage:\
    \ 15\n  batch:\n    send_batch_size: 10000\n    timeout: 10s\n\nexporters:\n \
    \ debug:\n\nservice:\n  pipelines:\n    traces:\n      receivers: [otlp]\n   \
    \   processors: []\n      exporters: [debug]"
# mutation 1
apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
  name: test-cluster
spec:
  config: "receivers:\n  otlp:\n    protocols:\n      grpc:\n      http:\nprocessors:\n\
    \  memory_limiter:\n    check_interval: 1s\n    limit_percentage: 75\n    spike_limit_percentage:\
    \ 15\n  batch:\n    send_batch_size: 10000\n    timeout: 10s\n\nexporters:\n \
    \ debug:\n\nservice:\n  pipelines:\n    traces:\n      receivers: [otlp]\n   \
    \   processors: []\n      exporters: [debug]"
  podAnnotations:
    actokey: actokey
# mutation 2
apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
  name: test-cluster
spec:
  config: "receivers:\n  otlp:\n    protocols:\n      grpc:\n      http:\nprocessors:\n\
    \  memory_limiter:\n    check_interval: 1s\n    limit_percentage: 75\n    spike_limit_percentage:\
    \ 15\n  batch:\n    send_batch_size: 10000\n    timeout: 10s\n\nexporters:\n \
    \ debug:\n\nservice:\n  pipelines:\n    traces:\n      receivers: [otlp]\n   \
    \   processors: []\n      exporters: [debug]"
  podAnnotations: {}
# mutation 3
apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
  name: test-cluster
spec:
  config: "receivers:\n  otlp:\n    protocols:\n      grpc:\n      http:\nprocessors:\n\
    \  memory_limiter:\n    check_interval: 1s\n    limit_percentage: 75\n    spike_limit_percentage:\
    \ 15\n  batch:\n    send_batch_size: 10000\n    timeout: 10s\n\nexporters:\n \
    \ debug:\n\nservice:\n  pipelines:\n    traces:\n      receivers: [otlp]\n   \
    \   processors: []\n      exporters: [debug]"
  podAnnotations:
    actokey2: actokey2

Expected Result

In the state mutation 2 and mutation 3, the spec.podAnnotations entry actokey: actokey in user specification is been deleted, and the annotations of pod and deployment should be changed as what spec.podAnnotations defines.

Actual Result

When I describe the pod and deployment test-cluster-collector after applying mutation 2 and mutation 3, the actokey: actokey still exists. In mutation 3 I can see both actokey: actokey and actokey2: actokey2 in the annotations.

Kubernetes Version

1.28.0

Operator version

0.95.0

Collector version

0.95.0

Environment information

Environment

OS: Ubuntu 22.04

Log output

No response

Additional context

When logging the variables related to the annotations in the operator (like desiredObjects in this line(https://github.com/open-telemetry/opentelemetry-operator/blob/2ddfe3193a2dad57836ebbca5fb92481c35dbeff/controllers/common.go#L81)), we can find that the data are changed as expected. Thus maybe the problem is related to the k8s controller-runtime API in this line(https://github.com/open-telemetry/opentelemetry-operator/blob/2ddfe3193a2dad57836ebbca5fb92481c35dbeff/controllers/common.go#L101).

swiatekm commented 7 months ago

I didn't manage to attend the Acto session at KubeCon EU, but it sounded really interesting. Great to see it can actually find bugs in our operator!

With that said, this is something of a misbehaving feature, in that we intentionally keep the annotation. See: https://github.com/open-telemetry/opentelemetry-operator/blob/dab898f6bb45d654fb138eb6c4860e15ee5eb59b/internal/manifests/mutate.go#L63. The intent of that logic is to avoid deleting annotations and labels added manually by users, but we can't easily distinguish those from metadata added by the operator itself in a previous reconcilation, and this is the result. I'm not sure we can do anything about this, other than actually deleting the user metadata. WDYT @jaronoff97 ?

jaronoff97 commented 7 months ago

hm... i have to think more on this, but i think this may be a legit bug as this is from the podAnnotations field, not the annotations on the CRD itself. I was under the impression that this should be overridden by this logic (in deployment for example) which does a merge with override on the template. When we generate the initial manifest we're not pulling in from existing there so that shouldn't effect it. I don't think the annotation wonkiness above would affect this because this merging would be the result of manifest generation gone awry or bad merging on the spec.

I'll need to investigate this further, we should be able to reproduce in an integration test with envtest locally.