When we see reconcile failures in the metrics, we currently have no way (other than via log aggregation) of determining whether they're due to kubernetes object update conflicts or something more severe.
Basically: We expect some base rate of update conflicts. We don't expect some base rate of other failures, but failing to distinguish these in the metrics makes it difficult to notice when something's wrong.
Feature idea(s) / DoD
It should be possible to, from metrics alone, see the rate of reconcile failures over time, excluding update conflicts.
Implementation ideas
A few options come to mind:
Completely ignore conflicts in the metrics
Use a separate metric for conflicts
Add a label within the existing reconcile failures metric
Problem description / Motivation
When we see reconcile failures in the metrics, we currently have no way (other than via log aggregation) of determining whether they're due to kubernetes object update conflicts or something more severe.
Basically: We expect some base rate of update conflicts. We don't expect some base rate of other failures, but failing to distinguish these in the metrics makes it difficult to notice when something's wrong.
Feature idea(s) / DoD
It should be possible to, from metrics alone, see the rate of reconcile failures over time, excluding update conflicts.
Implementation ideas
A few options come to mind: