open-telemetry / opentelemetry-helm-charts

OpenTelemetry Helm Charts
https://opentelemetry.io
Apache License 2.0
397 stars 486 forks source link

[operator] Update to 0.99 breaks certificates of both webhooks #1187

Open jan-kantert opened 6 months ago

jan-kantert commented 6 months ago

The update breaks certificates for the ValidatingWebhookConfiguration and MutatingWebhookConfiguration. SecretName in the Certificate does not match the cert-manager.io/inject-ca-from in the ValidatingWebhookConfiguration and MutatingWebhookConfiguration:

apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  labels:
    app.kubernetes.io/component: webhook
    app.kubernetes.io/instance: opentelemetry-operator
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: opentelemetry-operator
    app.kubernetes.io/version: 0.99.0
    helm.sh/chart: opentelemetry-operator-0.58.0
    helm.toolkit.fluxcd.io/name: opentelemetry-operator
    helm.toolkit.fluxcd.io/namespace: opentelemetry
  name: opentelemetry-operator-serving-cert
  namespace: opentelemetry
spec:
  dnsNames:
  - opentelemetry-operator-webhook.opentelemetry.svc
  - opentelemetry-operator-webhook.opentelemetry.svc.cluster.local
  issuerRef:
    kind: Issuer
    name: opentelemetry-operator-selfsigned-issuer
  secretName: opentelemetry-operator-controller-manager-service-cert      # NOTE THIS SECRET NAME
  subject:
    organizationalUnits:
    - opentelemetry-operator
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  annotations:
    cert-manager.io/inject-ca-from: opentelemetry/opentelemetry-operator-serving-cert     # COMPARE TO SECRET HERE
    controller-gen.kubebuilder.io/version: v0.14.0
  creationTimestamp: null
  labels:
    app.kubernetes.io/name: opentelemetry-operator
    helm.toolkit.fluxcd.io/name: opentelemetry-operator
    helm.toolkit.fluxcd.io/namespace: opentelemetry
  name: opampbridges.opentelemetry.io
spec
apiVersion: admissionregistration.k8s.io/v1
kind: MutatingWebhookConfiguration
metadata:
  annotations:
    cert-manager.io/inject-ca-from: opentelemetry/opentelemetry-operator-serving-cert    # COMPARE TO SECRET HERE
  labels:
    app.kubernetes.io/component: webhook
    app.kubernetes.io/instance: opentelemetry-operator
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: opentelemetry-operator
    app.kubernetes.io/version: 0.99.0
    helm.sh/chart: opentelemetry-operator-0.58.0
    helm.toolkit.fluxcd.io/name: opentelemetry-operator
    helm.toolkit.fluxcd.io/namespace: opentelemetry
  name: opentelemetry-operator-mutation
webhooks:
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingWebhookConfiguration
metadata:
  annotations:
    cert-manager.io/inject-ca-from: opentelemetry/opentelemetry-operator-serving-cert   # COMPARE TO SECRET HERE
  labels:
    app.kubernetes.io/component: webhook
    app.kubernetes.io/instance: opentelemetry-operator
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: opentelemetry-operator
    app.kubernetes.io/version: 0.99.0
    helm.sh/chart: opentelemetry-operator-0.58.0
    helm.toolkit.fluxcd.io/name: opentelemetry-operator
    helm.toolkit.fluxcd.io/namespace: opentelemetry
  name: opentelemetry-operator-validation
webhooks:

This is the reason why other people complain that their migrations fails. I am note sure if it is correct that those webhooks only handle v1alpha1 and not v1beta1. Are they just for the migration or do we need to worry that they do not work at all?

jan-kantert commented 6 months ago

We found this by monitoring the prometheus metric certmanager_cainjector_missing_cert from certmanager which instantly hightlights the issue. Otherwise its quite hard to see unless you check the kube-apiserver logs.

mtruscello commented 6 months ago

The cert-manager.io/inject-ca-from annotation is meant to point to the name of a Certificate (compared to cert-manager.io/inject-ca-from-secret which is meant to point to a Secret).

The webhooks use cert-manager.io/inject-ca-from so the value must be a Certificate name, not a Secret name. In this case, the value they're using (opentelemetry/opentelemetry-operator-serving-cert) appears to be correct.

TylerHelmuth commented 6 months ago

/cc @swiatekm-sumo

TylerHelmuth commented 6 months ago

/cc @Allex1

jabdoa2 commented 6 months ago

The cert-manager.io/inject-ca-from annotation is meant to point to the name of a Certificate (compared to cert-manager.io/inject-ca-from-secret which is meant to point to a Secret).

The webhooks use cert-manager.io/inject-ca-from so the value must be a Certificate name, not a Secret name. In this case, the value they're using (opentelemetry/opentelemetry-operator-serving-cert) appears to be correct.

Well kubernetes disagrees. It does not work that way and cert-manager complains about this setup. Upgrade webhooks do not work.

We manually upgraded the CRDs and it works now. But the upgrade totally failed on both Kops AWS and Azure AKS clusters.

swiatekm commented 6 months ago

I'd like to understand what the problem is exactly. Is it just the conversion webhook misbehaving, or all of them? The latter will be very confusing, as the changes in #1175 and #1176 haven't actually changed anything about the certmanager setup for ValidatingWebhookConfiguration and MutatingWebhookConfiguration. The CRDs have changed, and I can see the conversion webhook causing problems, but everything else should be fine.

Can everyone experiencing problems post more details of their environment? The following would help:

parkedwards commented 5 months ago

@swiatekm-sumo I'm not the OP, but I am experiencing a cert-related issue with the conversion webhook specifically

after upgrading to operator-0.58.2 / operator-chart-0.99.0 + following the upgrade instructions here, we're noticing this specific dry-run failure:

failed to prune fields: failed add back owned items: failed to convert pruned object at version opentelemetry.io/v1beta1: conversion webhook for opentelemetry.io/v1alpha1, Kind=OpenTelemetryCollector returned invalid metadata: invalid metadata of type <nil> in input object

When I log out the operator pod, I see many TLS handshake errors:

...
2024/05/29 16:52:13 http: TLS handshake error from 127.0.0.6:47149: EOF
2024/05/29 16:52:13 http: TLS handshake error from 127.0.0.6:60753: EOF
2024/05/29 16:59:28 http: TLS handshake error from 127.0.0.6:43895: EOF
2024/05/29 17:06:25 http: TLS handshake error from 127.0.0.6:43385: EOF
2024/05/29 17:11:11 http: TLS handshake error from 127.0.0.6:40295: EOF
2024/05/29 17:11:11 http: TLS handshake error from 127.0.0.6:56065: EOF
2024/05/29 17:11:11 http: TLS handshake error from 127.0.0.6:59265: EOF
2024/05/29 17:11:11 http: TLS handshake error from 127.0.0.6:45403: EOF
...

Your requested info:

Operator Helm Chart values.yaml

➜ helm get values otel-operator
USER-SUPPLIED VALUES:
kubeRBACProxy:
  enabled: false
manager:
  collectorImage:
    repository: otel/opentelemetry-collector-contrib
nodeSelector:
  node_pool: observability
replicaCount: 2
topologySpreadConstraints:
- labelSelector:
    matchLabels:
      app.kubernetes.io/component: controller-manager
      app.kubernetes.io/instance: otel-operator
      app.kubernetes.io/name: opentelemetry-operator
  maxSkew: 1
  topologyKey: kubernetes.io/hostname
  whenUnsatisfiable: ScheduleAnyway
- labelSelector:
    matchLabels:
      app.kubernetes.io/component: controller-manager
      app.kubernetes.io/instance: otel-operator
      app.kubernetes.io/name: opentelemetry-operator
  maxSkew: 1
  topologyKey: topology.kubernetes.io/zone
  whenUnsatisfiable: ScheduleAnyway

Previous Operator Helm Chart version

0.57.0 -> 0.58.2

cert-manager version

quay.io/jetstack/cert-manager-controller:v1.14.5

K8s Platform and version

GKE / 1.28.9
swiatekm commented 5 months ago

@parkedwards just to be sure I understand correctly, you upgrade the operator chart to 0.58.2, and then you get the error when trying to apply v1alpha1.OpenTelemetryCollector resources? Can you also post your v1alpha1.OpenTelemetryCollector resource that you get the error for?

jabdoa2 commented 5 months ago

@parkedwards just to be sure I understand correctly, you upgrade the operator chart to 0.58.2, and then you get the error when trying to apply v1alpha1.OpenTelemetryCollector resources? Can you also post your v1alpha1.OpenTelemetryCollector resource that you get the error for?

Exactly. The webhook TLS configuration is not right. Kubeapi server complains that it cannot load the cert. Cert-manager complains as well and reports the it as unready.

parkedwards commented 5 months ago

@parkedwards just to be sure I understand correctly, you upgrade the operator chart to 0.58.2, and then you get the error when trying to apply v1alpha1.OpenTelemetryCollector resources? Can you also post your v1alpha1.OpenTelemetryCollector resource that you get the error for?

@swiatekm-sumo - small correction - we performed the operator/chart upgrade to 0.58.2 AND the v1beta1.OpenTelemetryCollector version update in the manifests (we use gitops to manage), as described in these instructions

  1. Update any manifests you have stored outside the cluster, for example in your infrastructure git repository.
  2. Apply them, so they're all stored as v1beta1.
  3. Update the OpenTelemetryCollector CRD to only store v1beta1

I can share one of my v1beta1 CR manifest here:

apiVersion: opentelemetry.io/v1beta1
kind: OpenTelemetryCollector
metadata:
  creationTimestamp: "2023-09-26T14:44:25Z"
  generation: 76
  labels:
    app: otelcol-loadbalancer
    app.kubernetes.io/managed-by: opentelemetry-operator
    kustomize.toolkit.fluxcd.io/name: opentelemetry-collector
    kustomize.toolkit.fluxcd.io/namespace: flux-system
  name: otelcol-loadbalancer
  namespace: opentelemetry
  resourceVersion: "543736253"
  uid: f4886fef-31a5-49cc-9f5a-3418913cdc5a
spec:
  config:
    exporters:
      loadbalancing:
        protocol:
          otlp:
            sending_queue:
              num_consumers: 100
              queue_size: 300
            timeout: 30s
            tls:
              insecure: true
        resolver:
          dns:
            hostname: otelcol-gateway-collector.opentelemetry.svc.cluster.local
        routing_key: traceID
      prometheus:
        add_metric_suffixes: false
        endpoint: 0.0.0.0:9090
    receivers:
      otlp:
        protocols:
          grpc: {}
          http: {}
      prometheus/internal:
        config:
          scrape_configs:
          - job_name: otelcol-loadbalancer
            scrape_interval: 15s
            static_configs:
            - targets:
              - 0.0.0.0:8888
    service:
      pipelines:
        metrics/prometheus:
          exporters:
          - prometheus
          processors: []
          receivers:
          - prometheus/internal
        traces:
          exporters:
          - loadbalancing
          processors: []
          receivers:
          - otlp
      telemetry:
        logs:
          level: panic
          output_paths:
          - stdout
  daemonSetUpdateStrategy: {}
  deploymentUpdateStrategy: {}
  ingress:
    route: {}
  managementState: managed
  mode: deployment
  nodeSelector:
    node_pool: application
  observability:
    metrics: {}
  podAnnotations:
    sidecar.istio.io/proxyCPU: 850m
    sidecar.istio.io/proxyCPULimit: 1000m
    sidecar.istio.io/proxyMemory: 400Mi
    sidecar.istio.io/proxyMemoryLimit: 700Mi
  podDisruptionBudget:
    maxUnavailable: 1
  podSecurityContext:
    runAsNonRoot: true
    runAsUser: 10001
    seccompProfile:
      type: RuntimeDefault
  replicas: 10
  resources:
    limits:
      cpu: 2500m
      memory: 1Gi
    requests:
      cpu: 1500m
      memory: 600Mi
  securityContext:
    allowPrivilegeEscalation: false
    capabilities:
      drop:
      - ALL
    readOnlyRootFilesystem: true
    runAsNonRoot: true
    runAsUser: 10001
    seccompProfile:
      type: RuntimeDefault
  serviceAccount: otelcol-loadbalancer
  targetAllocator:
    allocationStrategy: consistent-hashing
    filterStrategy: relabel-config
    observability:
      metrics: {}
    prometheusCR:
      podMonitorSelector: {}
      scrapeInterval: 30s
      serviceMonitorSelector: {}
    resources: {}
  upgradeStrategy: automatic
status:
  image: otel/opentelemetry-collector-contrib:0.99.0
  scale:
    replicas: 10
    selector: app=otelcol-loadbalancer,app.kubernetes.io/component=opentelemetry-collector,app.kubernetes.io/instance=opentelemetry.otelcol-loadbalancer,app.kubernetes.io/managed-by=opentelemetry-operator,app.kubernetes.io/name=otelcol-loadbalancer-collector,app.kubernetes.io/part-of=opentelemetry,app.kubernetes.io/version=latest,kustomize.toolkit.fluxcd.io/name=opentelemetry-collector,kustomize.toolkit.fluxcd.io/namespace=flux-system
    statusReplicas: 10/10
  version: 0.99.0

and then for context, this is the v1alpha1 version (can ignore the template functions and be assured that they're being interpolated properly at apply time):

apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
  name: otelcol-loadbalancer
  namespace: opentelemetry
  labels:
    app: otelcol-loadbalancer
spec:
  mode: deployment
  nodeSelector:
    node_pool: application
  podAnnotations:
    sidecar.istio.io/proxyCPU: 850m
    sidecar.istio.io/proxyCPULimit: 1000m
    sidecar.istio.io/proxyMemory: 400Mi
    sidecar.istio.io/proxyMemoryLimit: 700Mi
  replicas: ${otelcol_loadbalancer_replica_count:=1}
  serviceAccount: otelcol-loadbalancer
  config: |

    receivers:
      otlp:
        protocols:
          grpc:
          http:
      prometheus/internal:
        config:
          scrape_configs:
            - job_name: otelcol-loadbalancer
              scrape_interval: 15s
              static_configs:
                - targets: [0.0.0.0:8888]
    exporters:
      prometheus:
        endpoint: 0.0.0.0:9090
        add_metric_suffixes: false
      loadbalancing:
        # ensures that spans are consistently routed to
        # the same gateway replica based on traceID
        routing_key: traceID
        protocol:
          # all configuration options from the OTLP exporter
          # apply here (except the endpoint)
          otlp:
            timeout: 30s
            sending_queue:
              queue_size: ${otelcol_loadbalancer_exporter_queue_size:=500}
              num_consumers: ${otelcol_loadbalancer_exporter_num_consumers:=10}
            tls:
              insecure: true
        # uses the DNS resolver to find the gateway collector pods
        resolver:
          dns:
            hostname: otelcol-gateway-collector.opentelemetry.svc.cluster.local
    service:
      pipelines:
        traces:
          receivers: [otlp]
          processors: []
          exporters: [loadbalancing]
        metrics/prometheus:
          receivers: [prometheus/internal]
          processors: []
          exporters: [prometheus]
      telemetry:
        logs:
          # https://pkg.go.dev/go.uber.org/zap/zapcore#Level
          level: ${otelcol_log_level:=panic}
          output_paths: [stdout]
  resources:
    requests:
      cpu: ${otelcol_loadbalancer_cpu_req}
      memory: ${otelcol_loadbalancer_mem_req}
    limits:
      cpu: ${otelcol_loadbalancer_cpu_limit}
      memory: ${otelcol_loadbalancer_mem_limit}
  podSecurityContext:
    runAsNonRoot: true
    runAsUser: 10001
    seccompProfile:
      type: RuntimeDefault
  securityContext:
    allowPrivilegeEscalation: false
    capabilities:
      drop:
        - ALL
    readOnlyRootFilesystem: true
    runAsNonRoot: true
    runAsUser: 10001
    seccompProfile:
      type: RuntimeDefault
swiatekm commented 5 months ago

It would really help me if someone were able to reproduce this problem in a KiND cluster and post their reproduction steps. I've tried several combinations, including upgrading 0.56.0 -> 0.57.0 -> 0.58.0, with cert-manager 1.6.1 and 1.14.1, and always ended up in a valid state with the conversion webhook working correctly.

@parkedwards just to be sure I understand correctly, you upgrade the operator chart to 0.58.2, and then you get the error when trying to apply v1alpha1.OpenTelemetryCollector resources? Can you also post your v1alpha1.OpenTelemetryCollector resource that you get the error for?

Exactly. The webhook TLS configuration is not right. Kubeapi server complains that it cannot load the cert. Cert-manager complains as well and reports the it as unready.

Thing is, this configuration hasn't changed, and as far I can see has always been this way. It'd have to be a bug (as our current use of the cert-manager annotation is correct as per their documentation) in CA injection specifically for CRDs, and likely only affecting specific cert-manager versions, and possibly only a specific platform.

This is why I'm asking for as much information as possible about your environments. I'm hoping to see some commonalities that would let me get to the bottom of things. I'd also really like to see the error logs you and @jan-kantert mentioned.

And on a related note, I see @parkedwards using istio. Anyone else in this thread also in that boat?

jabdoa2 commented 5 months ago

It would really help me if someone were able to reproduce this problem in a KiND cluster and post their reproduction steps. I've tried several combinations, including upgrading 0.56.0 -> 0.57.0 -> 0.58.0, with cert-manager 1.6.1 and 1.14.1, and always ended up in a valid state with the conversion webhook working correctly.

@parkedwards just to be sure I understand correctly, you upgrade the operator chart to 0.58.2, and then you get the error when trying to apply v1alpha1.OpenTelemetryCollector resources? Can you also post your v1alpha1.OpenTelemetryCollector resource that you get the error for?

Exactly. The webhook TLS configuration is not right. Kubeapi server complains that it cannot load the cert. Cert-manager complains as well and reports the it as unready.

Thing is, this configuration hasn't changed, and as far I can see has always been this way. It'd have to be a bug (as our current use of the cert-manager annotation is correct as per their documentation) in CA injection specifically for CRDs, and likely only affecting specific cert-manager versions, and possibly only a specific platform.

This is why I'm asking for as much information as possible about your environments. I'm hoping to see some commonalities that would let me get to the bottom of things. I'd also really like to see the error logs you and @jan-kantert mentioned.

And on a related note, I see @parkedwards using istio. Anyone else in this thread also in that boat?

We use linkerd. I have to check the cert-manager version. We run Kubernetes 1.26 and the latest compatible cert-manager version.