Open diranged opened 1 month ago
This shouldn't result in an error.. my only thinking is that somehow there is a patch conflict happening while the cluster is changing the # of nodes that are running.
You mean that it should not be counted as an error in the metrics, right?
You mean that it should not be counted as an error in the metrics, right?
@iblancasa Correct, since this seems like a retryable event. Or if there is some other way to tune this out then please advise.
cc: @diranged @schahal
The metric comes from controller-runtime
:
https://github.com/kubernetes-sigs/controller-runtime/blob/e6c3d139d2b6c286b1dbba6b6a95919159cfe655/pkg/internal/controller/metrics/metrics.go#L30-L33
I'm not sure if we can do too much with that. Maybe you should open the issue in their repository.
Component(s)
collector
What happened?
Description
In our largest environment - we scale up and down thousands (sometimes tens of thousands) of nodes a day. We use the Otel Operator to manage several Otel components, including a DaemonSet named
otel-collector-agent-collector
. It seems that periodically we get reconciliation errors that then result in thecontroller_runtime_reconcile_total{result="error"}
metric incrementing .. but these errors are not correlated with anything persistent. Instead they seem to be very transient.Looking into the logs, this is what we typically see:
Steps to Reproduce
Create a DaemonSet in an environment where nodes come and go all the time - then monitor that metric. Look for bursts like this:
Expected Result
This shouldn't result in an error.. my only thinking is that somehow there is a patch conflict happening while the cluster is changing the # of nodes that are running.
Actual Result
Kubernetes Version
1.30.0
Operator version
0.106.0
Collector version
0.106.xx
Environment information
Environment
Kubernetes: EKS 1.30.0 Nodes: BottleRocket OS
Log output
No response
Additional context
No response