open-telemetry / opentelemetry-operator

Kubernetes Operator for OpenTelemetry Collector
Apache License 2.0
1.21k stars 440 forks source link

Cannot deploy OpenTelemetryCollector to operator #3452

Open frmrm opened 4 days ago

frmrm commented 4 days ago

Component(s)

No response

What happened?

Description

We are attempting to upgrade our Open Telemetry Connector deployment from 0.92 to 0.113 and are having issues getting an Open Telemetry Collector that works fine in the older version to work in the newer version of the operator. The error message that comes back is quite opaque and so far looking through the source hasn't yielded much insight for me in terms of what's going on.

I did update the collector definition slightly to be compatible with the v1beta1 api syntax, but otherwise left it untouched from the version that deploys just fine in 0.92.

Steps to Reproduce

  1. Deploy otel 0.113
  2. Attempt to deploy the collector below:
apiVersion: opentelemetry.io/v1beta1
kind: OpenTelemetryCollector
metadata:
  name: cluster
  namespace: otel
spec:
  config:
    receivers:
      otlp:
        protocols:
          grpc:
          http:
      zipkin:
      jaeger:
        protocols:
          grpc:
          thrift_binary:
          thrift_compact:
          thrift_http:
    processors:
      filter/spans:
        spans:
          exclude:
            match_type: regexp
            services:
              - "[redacted]"
    exporters:
      logging:
      otlp/tempo:
        endpoint: "[redacted]"
        tls:
          insecure_skip_verify: true
    service:
      telemetry:
        logs:
          level: "debug"
      pipelines:
        traces/internal:
          receivers: [otlp, zipkin, jaeger]
          processors: []
          exporters: [logging]
        traces/tempo:
          receivers: [otlp, zipkin, jaeger]
          processors: []
          exporters: [otlp/tempo]

Expected Result

It's expected that this would work because it works in 0.92.

Actual Result

We see the following error while attempting to push the collector:

admission webhook "mopentelemetrycollectorbeta.kb.io" denied the request: src and dst must not be nil

Kubernetes Version

1.30.5

Operator version

0.113

Collector version

0.113

Environment information

Environment

Deployed from Helm charts.

Log output

2024/11/12 19:43:39 http: TLS handshake error from X.X.8.21:59048: EOF
2024/11/12 19:40:00 http: TLS handshake error from X.X.8.21:47244: EOF
2024/11/12 19:30:00 http: TLS handshake error from X.X.8.21:58650: EOF
2024/11/12 19:30:00 http: TLS handshake error from X.X.8.21:58648: EOF
{"level":"ERROR","timestamp":"2024-11-12T19:19:55Z","message":"Reconciler error","controller":"opentelemetrycollector","controllerGroup":"opentelemetry.io","controllerKind":"OpenTelemetryCollector","OpenTelemetryCollector":{"name":"cluster","namespace":"otel"},"namespace":"otel","name":"cluster","reconcileID":"9f59c733-a65c-4934-a42b-be87f4a26896","error":"admission webhook \"mopentelemetrycollectorbeta.kb.io\" denied the request: src and dst must not be nil","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.19.1/pkg/internal/controller/controller.go:316\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.19.1/pkg/internal/controller/controller.go:263\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2.2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.19.1/pkg/internal/controller/controller.go:224"}
2024/11/12 19:10:00 http: TLS handshake error from X.X.12.16:40416: EOF
{"level":"ERROR","timestamp":"2024-11-12T19:03:15Z","message":"Reconciler error","controller":"opentelemetrycollector","controllerGroup":"opentelemetry.io","controllerKind":"OpenTelemetryCollector","OpenTelemetryCollector":{"name":"cluster","namespace":"otel"},"namespace":"otel","name":"cluster","reconcileID":"8be80529-01a7-4228-ac07-ac05514b63c6","error":"admission webhook \"mopentelemetrycollectorbeta.kb.io\" denied the request: src and dst must not be nil","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.19.1/pkg/internal/controller/controller.go:316\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.19.1/pkg/internal/controller/controller.go:263\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2.2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.19.1/pkg/internal/controller/controller.go:224"}
2024/11/12 19:00:00 http: TLS handshake error from X.X.5.10:36190: EOF
2024/11/12 19:00:00 http: TLS handshake error from X.X.8.18:33402: EOF
2024/11/12 18:50:00 http: TLS handshake error from X.X.11.8:36554: EOF
2024/11/12 18:50:00 http: TLS handshake error from X.X.13.21:32836: EOF

Additional context

No response

swiatekm commented 3 days ago

@iblancasa could this be related to #3281?

iblancasa commented 3 days ago

I don't think so. @frmrm is this happening with all your collectors or just with a set of them?

frmrm commented 2 days ago

is this happening with all your collectors or just with a set of them?

This is the only collector we deploy directly with a CRD. The rest are injected as sidecars, ship to this collector, and this collector "fans out" to some different services / does some filtering before things leave the cluster. Because we weren't able to deploy this we had to roll back, which got us working again so it's certainly something that was introduced in one of the recent versions. (Or something unique to upgrading between them, I'm unsure.)