open-telemetry / opentelemetry-operator

Kubernetes Operator for OpenTelemetry Collector
Apache License 2.0
1.23k stars 442 forks source link

Report deploment, sts, ds, status up to OpenTelemetry Collector for health checks. #3372

Open adrielp opened 1 month ago

adrielp commented 1 month ago

Component(s)

collector

Describe the issue you're reporting

TL;DR - Add status conditions to the OpenTelemetryCollector resource so that on deployment native Kubernetes features can check health of custom resource knowing that it was deployed successfully. This would be prior to collector telemetry being exported.

At time of writing, a collector can be deployed, the pod get stuck in CrashLoopBackOff, the "Deployment" resource will show as unhealthy, but the OpenTelemetry Collector resource will show as healthy and deployed.

In the screenshot below, a collector can fail, and produce no telemetry (i purposefully broke the config so that it would get into this state). A Gitops tool that handles k8s syncing of resources (in this case argo) will view this as a "successful deployment" in that it Synced correctly, and the app health check against the custom resource was fine. Therefor, it's synced and viewed as healthy. You can see though that the underlying deployment does correctly reflect health, but it's not bubbled up. If it were to get bubbled up, the resource should have a set of status conditions like:

status:
  conditions:
    - lastTransitionTime: '2024-10-16T16:26:42Z'
      lastUpdateTime: '2024-10-16T16:26:42Z'
      message: >-
        ReplicaSet "otel-delivery-collector-5467cfd54f" has successfully
        progressed.
      reason: NewReplicaSetAvailable
      status: 'True'
      type: Progressing
    - lastTransitionTime: '2024-10-16T16:26:43Z'
      lastUpdateTime: '2024-10-16T16:26:43Z'
      message: Deployment does not have minimum availability.
      reason: MinimumReplicasUnavailable
      status: 'False'
      type: Available

From there, you can perform health checks on sync/deployment, and then alert early on failure when there won't be telemetry from the latest deployment. See how other resource health checks work in this argo doc.

image

iblancasa commented 1 month ago

I'll take it.

adrielp commented 1 month ago

Appreciate the quick response, thanks @iblancasa!