Epic: Scaling latency metrics

sharnoff commented 8 months ago

Motivation

Two reasons:

We're currently flying blind w.r.t. how long scaling takes
Scaling latency should be part of any autoscaling SLOs

DoD

We should have histogram metrics recording:

end-to-end latency of scaling (down and up; cpu and memory)
latency of all the components:
- requests to scheduler plugin (including retries)
- requests to vm-monitor (including retries)
- delay between initial NeonVM patch request and when status was updated

Implementation ideas

AFAICT the basic idea is that we store some extra info in agent/core.State and add some extra callbacks in agent/core.Config to increment some metrics when we determine that various parts of scaling (and the entire thing) have occurred.

More design work is required, because the edge cases are quite subtle.

Tasks

# Blockers
- [ ] #453
- [ ] #462
- [ ] #592

# Implementation
- [ ] Decide how we should measure latency (difficult due to eventual consistency)
- [ ] agent/core: Measure scaling latency

Other related tasks, Epics, and links

Proposed RFC

sharnoff commented 8 months ago

Latency metrics measured by pkg/agent/core.State would have allowed us to notice the effects of #614 (i.e. core.State believed there was a super long-running request).

sharnoff commented 3 weeks ago

Status: waiting on @sharnoff and @stradig to review the internal RFC.

neondatabase / autoscaling