neondatabase / autoscaling

Postgres vertical autoscaling in k8s
Apache License 2.0
157 stars 21 forks source link

Epic: Tracking issue for problems noticed by neonvm-controller metrics #777

Closed sharnoff closed 8 months ago

sharnoff commented 8 months ago

Related PRs and discussions:

## Tasks
- [ ] https://github.com/neondatabase/autoscaling/pull/773
- [ ] https://github.com/neondatabase/autoscaling/pull/779
- [ ] https://github.com/neondatabase/autoscaling/pull/783
- [ ] ~~Alerting for reconcile workers saturation~~
- [ ] ~~Alerting for reconcile error rate~~
- [x] Alerting for (a) many objects failing to reconcile, or (b) extended period of object(s) failing to reconcile
- [ ] Alerting for p90 workqueue wait duration
- [ ] Investigate why increasing max reconcile workers [decreases p50-p90 reconcile durations](https://neondb.slack.com/archives/C03TN5G758R/p1706652954423079?thread_ts=1706160071.213319&cid=C03TN5G758R)
- [ ] Consistent baseline of reconcile operations taking 1s (probably related to sleeps during memory unplug?)
sharnoff commented 8 months ago

Some things have already been done. The remaining items are mostly handled by the alerting referenced here: https://github.com/neondatabase/cloud/issues/9629#issuecomment-1938070183