neondatabase / autoscaling

Postgres vertical autoscaling in k8s
Apache License 2.0
144 stars 18 forks source link

neonvm-controller reconcile failure metrics should differentiate k8s conflicts #918

Closed sharnoff closed 2 months ago

sharnoff commented 2 months ago

Problem description / Motivation

When we see reconcile failures in the metrics, we currently have no way (other than via log aggregation) of determining whether they're due to kubernetes object update conflicts or something more severe.

Basically: We expect some base rate of update conflicts. We don't expect some base rate of other failures, but failing to distinguish these in the metrics makes it difficult to notice when something's wrong.

Feature idea(s) / DoD

It should be possible to, from metrics alone, see the rate of reconcile failures over time, excluding update conflicts.

Implementation ideas

A few options come to mind:

  1. Completely ignore conflicts in the metrics
  2. Use a separate metric for conflicts
  3. Add a label within the existing reconcile failures metric