neondatabase / autoscaling

Postgres vertical autoscaling in k8s
Apache License 2.0
142 stars 16 forks source link

Epic: Scaling latency metrics #594

Open sharnoff opened 8 months ago

sharnoff commented 8 months ago

Motivation

Two reasons:

  1. We're currently flying blind w.r.t. how long scaling takes
  2. Scaling latency should be part of any autoscaling SLOs

DoD

We should have histogram metrics recording:

Implementation ideas

AFAICT the basic idea is that we store some extra info in agent/core.State and add some extra callbacks in agent/core.Config to increment some metrics when we determine that various parts of scaling (and the entire thing) have occurred.

More design work is required, because the edge cases are quite subtle.

Tasks

# Blockers
- [ ] #453
- [ ] #462
- [ ] #592
# Implementation
- [ ] Decide how we should measure latency (difficult due to eventual consistency)
- [ ] agent/core: Measure scaling latency

Other related tasks, Epics, and links

sharnoff commented 8 months ago

Latency metrics measured by pkg/agent/core.State would have allowed us to notice the effects of #614 (i.e. core.State believed there was a super long-running request).

sharnoff commented 3 weeks ago

Status: waiting on @sharnoff and @stradig to review the internal RFC.