neondatabase / autoscaling

Postgres vertical autoscaling in k8s
Apache License 2.0
144 stars 18 forks source link

neonvm: separate maps / counts for failure vs conflict #932

Closed Omrigan closed 1 month ago

Omrigan commented 2 months ago

This might help to debug the issues when we have a lot of VM failing to reconcile. Although, it is unclear if repeated conflicts for the same VM is likely failure scenario.

_Originally posted by @sharnoff in https://github.com/neondatabase/autoscaling/pull/920#discussion_r1595561357_

sharnoff commented 2 months ago

To add onto this, I think in particular, this would help with making our alerting more sensitive — having 10 minutes of >1 VM failing to reconcile may be expected as there's always something affected by conflicts; but having 10 minutes of >1 VM truly failing may not be expected.

Alternatively -- something I'd discussed as part of #757 is that we may be better off having metrics like "number of VMs failing reconcile for N seconds" or something — that's probably much easier to have higher-quality alerting for, rather than our gauge of binary "is it stuck" approach we currently have.