open-telemetry / opentelemetry-operator

Kubernetes Operator for OpenTelemetry Collector
Apache License 2.0
1.18k stars 419 forks source link

opentelemetry-operator manager crashes during instrumentation injection attempt #3303

Open sergeykad opened 1 day ago

sergeykad commented 1 day ago

Component(s)

auto-instrumentation

What happened?

Description

opentelemetry-operator manager crashes

Steps to Reproduce

  1. Install opentelemetry-operator on Kubernetes cluster
  2. Restart a pod that has the following configuration:
  annotations:
    instrumentation.opentelemetry.io/inject-java: "true"

Expected Result

A side-car is added to the pod and the service is instrumented with open-telemetry.

Actual Result

opentelemetry-operator crashes with the log seen below.

Kubernetes Version

1.25

Operator version

v0.109.0

Collector version

v0.69.0

Environment information

Environment

OS: Rocky Linux 9.3

Log output

{"level":"INFO","timestamp":"2024-09-24T13:57:31Z","message":"Starting EventSource","controller":"opentelemetrycollector","controllerGroup":"opentelemetry.io","controllerKind":"OpenTelemetryCollector","source":"kind source: *v1.ConfigMap"}
{"level":"INFO","timestamp":"2024-09-24T13:57:31Z","message":"Starting EventSource","controller":"opentelemetrycollector","controllerGroup":"opentelemetry.io","controllerKind":"OpenTelemetryCollector","source":"kind source: *v1.ServiceAccount"}
{"level":"INFO","timestamp":"2024-09-24T13:57:31Z","message":"Starting EventSource","controller":"opentelemetrycollector","controllerGroup":"opentelemetry.io","controllerKind":"OpenTelemetryCollector","source":"kind source: *v1.Service"}
{"level":"INFO","timestamp":"2024-09-24T13:57:31Z","message":"Starting EventSource","controller":"opentelemetrycollector","controllerGroup":"opentelemetry.io","controllerKind":"OpenTelemetryCollector","source":"kind source: *v1.Deployment"}
{"level":"INFO","timestamp":"2024-09-24T13:57:31Z","message":"Starting EventSource","controller":"opentelemetrycollector","controllerGroup":"opentelemetry.io","controllerKind":"OpenTelemetryCollector","source":"kind source: *v1.DaemonSet"}
{"level":"INFO","timestamp":"2024-09-24T13:57:31Z","message":"Starting EventSource","controller":"opentelemetrycollector","controllerGroup":"opentelemetry.io","controllerKind":"OpenTelemetryCollector","source":"kind source: *v1.StatefulSet"}
{"level":"INFO","timestamp":"2024-09-24T13:57:31Z","message":"Starting EventSource","controller":"opentelemetrycollector","controllerGroup":"opentelemetry.io","controllerKind":"OpenTelemetryCollector","source":"kind source: *v1.Ingress"}
{"level":"INFO","timestamp":"2024-09-24T13:57:31Z","message":"Starting EventSource","controller":"opentelemetrycollector","controllerGroup":"opentelemetry.io","controllerKind":"OpenTelemetryCollector","source":"kind source: *v2.HorizontalPodAutoscaler"}
{"level":"INFO","timestamp":"2024-09-24T13:57:31Z","message":"Starting EventSource","controller":"opentelemetrycollector","controllerGroup":"opentelemetry.io","controllerKind":"OpenTelemetryCollector","source":"kind source: *v1.PodDisruptionBudget"}
{"level":"INFO","timestamp":"2024-09-24T13:57:31Z","message":"Starting EventSource","controller":"opentelemetrycollector","controllerGroup":"opentelemetry.io","controllerKind":"OpenTelemetryCollector","source":"kind source: *v1.ServiceMonitor"}
{"level":"INFO","timestamp":"2024-09-24T13:57:31Z","message":"Starting EventSource","controller":"opentelemetrycollector","controllerGroup":"opentelemetry.io","controllerKind":"OpenTelemetryCollector","source":"kind source: *v1.PodMonitor"}
{"level":"INFO","timestamp":"2024-09-24T13:57:31Z","message":"Starting Controller","controller":"opentelemetrycollector","controllerGroup":"opentelemetry.io","controllerKind":"OpenTelemetryCollector"}
{"level":"INFO","timestamp":"2024-09-24T13:57:32Z","logger":"collector-upgrade","message":"no instances to upgrade"}
{"level":"DEBUG","timestamp":"2024-09-24T13:57:32Z","logger":"controller-runtime.certwatcher","message":"certificate event","event":"CHMOD     \"/tmp/k8s-webhook-server/serving-certs/tls.key\""}
{"level":"INFO","timestamp":"2024-09-24T13:57:32Z","logger":"controller-runtime.certwatcher","message":"Updated current TLS certificate"}
{"level":"DEBUG","timestamp":"2024-09-24T13:57:32Z","logger":"controller-runtime.certwatcher","message":"certificate event","event":"CHMOD     \"/tmp/k8s-webhook-server/serving-certs/tls.crt\""}
{"level":"INFO","timestamp":"2024-09-24T13:57:32Z","logger":"controller-runtime.certwatcher","message":"Updated current TLS certificate"}
{"level":"INFO","timestamp":"2024-09-24T13:57:36Z","message":"Starting workers","controller":"opentelemetrycollector","controllerGroup":"opentelemetry.io","controllerKind":"OpenTelemetryCollector","worker count":1}
{"level":"INFO","timestamp":"2024-09-24T13:57:36Z","message":"Starting workers","controller":"opampbridge","controllerGroup":"opentelemetry.io","controllerKind":"OpAMPBridge","worker count":1}
{"level":"DEBUG","timestamp":"2024-09-24T13:58:52Z","logger":"controller-runtime.certwatcher","message":"certificate event","event":"CHMOD     \"/tmp/k8s-webhook-server/serving-certs/tls.key\""}
{"level":"INFO","timestamp":"2024-09-24T13:58:52Z","logger":"controller-runtime.certwatcher","message":"Updated current TLS certificate"}
{"level":"DEBUG","timestamp":"2024-09-24T13:58:52Z","logger":"controller-runtime.certwatcher","message":"certificate event","event":"CHMOD     \"/tmp/k8s-webhook-server/serving-certs/tls.crt\""}
{"level":"INFO","timestamp":"2024-09-24T13:58:52Z","logger":"controller-runtime.certwatcher","message":"Updated current TLS certificate"}
{"level":"DEBUG","timestamp":"2024-09-24T13:59:26Z","message":"annotation not present in deployment, skipping sidecar injection","namespace":"optimus","name":""}
{"level":"DEBUG","timestamp":"2024-09-24T13:59:26Z","message":"injecting Java instrumentation into pod","otelinst-namespace":"optimus","otelinst-name":"instrumentation"}

Additional context

There are no additional log messages. The manager just disappears.

jaronoff97 commented 6 hours ago

does the manager pod have any reason for its crash? OOMKilled maybe? I haven't been able to reproduce this.

sergeykad commented 5 hours ago

There was no reason at all. It just died and a new pod started. It looks like something crashed during instrumentation injection since it's the last message and it never added the sidecar.

I performed a similar deployment on a Minikube and it works fine but crashes on our production Kubernetes. If there is an option to enable more detailed logs or do some other test I can try it.

jaronoff97 commented 5 hours ago

you can follow the guide here on how to enable debug logs https://github.com/open-telemetry/opentelemetry-operator/blob/main/DEBUG.md, is it possible the operator doesn't have the permission to do mutation?