open-telemetry / opentelemetry-operator

Kubernetes Operator for OpenTelemetry Collector
Apache License 2.0
1.17k stars 416 forks source link

Reconciliation of DaemonSets errors frequently in highly volatile environments #3205

Open diranged opened 1 month ago

diranged commented 1 month ago

Component(s)

collector

What happened?

Description

In our largest environment - we scale up and down thousands (sometimes tens of thousands) of nodes a day. We use the Otel Operator to manage several Otel components, including a DaemonSet named otel-collector-agent-collector. It seems that periodically we get reconciliation errors that then result in the controller_runtime_reconcile_total{result="error"} metric incrementing .. but these errors are not correlated with anything persistent. Instead they seem to be very transient.

Looking into the logs, this is what we typically see:

{
  "level": "ERROR",
  "timestamp": "2024-08-07T15:08:34Z",
  "logger": "controllers.OpenTelemetryCollector",
  "message": "failed to configure desired",
  "opentelemetrycollector": {
    "name": "otel-collector-agent",
    "namespace": "otel"
  },
  "object_name": "otel-collector-agent-collector",
  "object_kind": "&TypeMeta{Kind:,APIVersion:,}",
  "error": "Operation cannot be fulfilled on daemonsets.apps \"otel-collector-agent-collector\": the object has been modified; please apply your changes to the latest version and try again",
  "stacktrace": "github.com/open-telemetry/opentelemetry-operator/controllers.reconcileDesiredObjects\n\t/home/runner/work/opentelemetry-operator/opentelemetry-operator/controllers/common.go:166\ngithub.com/open-telemetry/opentelemetry-operator/controllers.(*OpenTelemetryCollectorReconciler).Reconcile\n\t/home/runner/work/opentelemetry-operator/opentelemetry-operator/controllers/opentelemetrycollector_controller.go:286\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.4/pkg/internal/controller/controller.go:114\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.4/pkg/internal/controller/controller.go:311\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.4/pkg/internal/controller/controller.go:261\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.4/pkg/internal/controller/controller.go:222"
}
{
  "level": "ERROR",
  "timestamp": "2024-08-07T15:08:35Z",
  "message": "Reconciler error",
  "controller": "opentelemetrycollector",
  "controllerGroup": "opentelemetry.io",
  "controllerKind": "OpenTelemetryCollector",
  "OpenTelemetryCollector": {
    "name": "otel-collector-agent",
    "namespace": "otel"
  },
  "namespace": "otel",
  "name": "otel-collector-agent",
  "reconcileID": "66647d40-faff-4b4c-a125-0b2f6875bb1e",
  "error": "failed to create objects for otel-collector-agent: Operation cannot be fulfilled on daemonsets.apps \"otel-collector-agent-collector\": the object has been modified; please apply your changes to the latest version and try again",
  "stacktrace": "sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.4/pkg/internal/controller/controller.go:324\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.4/pkg/internal/controller/controller.go:261\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.4/pkg/internal/controller/controller.go:222"
}
{
  "level": "ERROR",
  "timestamp": "2024-08-07T15:08:45Z",
  "logger": "controllers.OpenTelemetryCollector",
  "message": "failed to configure desired",
  "opentelemetrycollector": {
    "name": "otel-collector-agent",
    "namespace": "otel"
  },
  "object_name": "otel-collector-agent-collector",
  "object_kind": "&TypeMeta{Kind:,APIVersion:,}",
  "error": "Operation cannot be fulfilled on daemonsets.apps \"otel-collector-agent-collector\": the object has been modified; please apply your changes to the latest version and try again",
  "stacktrace": "github.com/open-telemetry/opentelemetry-operator/controllers.reconcileDesiredObjects\n\t/home/runner/work/opentelemetry-operator/opentelemetry-operator/controllers/common.go:166\ngithub.com/open-telemetry/opentelemetry-operator/controllers.(*OpenTelemetryCollectorReconciler).Reconcile\n\t/home/runner/work/opentelemetry-operator/opentelemetry-operator/controllers/opentelemetrycollector_controller.go:286\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.4/pkg/internal/controller/controller.go:114\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.4/pkg/internal/controller/controller.go:311\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.4/pkg/internal/controller/controller.go:261\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.4/pkg/internal/controller/controller.go:222"
}
{
  "level": "ERROR",
  "timestamp": "2024-08-07T15:08:45Z",
  "message": "Reconciler error",
  "controller": "opentelemetrycollector",
  "controllerGroup": "opentelemetry.io",
  "controllerKind": "OpenTelemetryCollector",
  "OpenTelemetryCollector": {
    "name": "otel-collector-agent",
    "namespace": "otel"
  },
  "namespace": "otel",
  "name": "otel-collector-agent",
  "reconcileID": "2171a8c3-1382-4237-9e19-7eb80df9958a",
  "error": "failed to create objects for otel-collector-agent: Operation cannot be fulfilled on daemonsets.apps \"otel-collector-agent-collector\": the object has been modified; please apply your changes to the latest version and try again",
  "stacktrace": "sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.4/pkg/internal/controller/controller.go:324\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.4/pkg/internal/controller/controller.go:261\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.4/pkg/internal/controller/controller.go:222"
}

Steps to Reproduce

Create a DaemonSet in an environment where nodes come and go all the time - then monitor that metric. Look for bursts like this:

image

Expected Result

This shouldn't result in an error.. my only thinking is that somehow there is a patch conflict happening while the cluster is changing the # of nodes that are running.

Actual Result

Kubernetes Version

1.30.0

Operator version

0.106.0

Collector version

0.106.xx

Environment information

Environment

Kubernetes: EKS 1.30.0 Nodes: BottleRocket OS

Log output

No response

Additional context

No response

iblancasa commented 1 month ago

This shouldn't result in an error.. my only thinking is that somehow there is a patch conflict happening while the cluster is changing the # of nodes that are running.

You mean that it should not be counted as an error in the metrics, right?

LaikaN57 commented 3 weeks ago

You mean that it should not be counted as an error in the metrics, right?

@iblancasa Correct, since this seems like a retryable event. Or if there is some other way to tune this out then please advise.

cc: @diranged @schahal

iblancasa commented 3 weeks ago

The metric comes from controller-runtime: https://github.com/kubernetes-sigs/controller-runtime/blob/e6c3d139d2b6c286b1dbba6b6a95919159cfe655/pkg/internal/controller/metrics/metrics.go#L30-L33

I'm not sure if we can do too much with that. Maybe you should open the issue in their repository.