neondatabase / autoscaling

Postgres vertical autoscaling in k8s
Apache License 2.0
166 stars 21 forks source link

Bug: autoscaler-agent scaling algorithm is too volatile for larger computes #729

Open sharnoff opened 10 months ago

sharnoff commented 10 months ago

Problem description / Motivation

One of the blockers for allowing larger computes (ref neondatabase/cloud#9103) is improving the scaling algorithm.

Currently, because the scaling algorithm (a) recalculates the "goal" CU every 5s, via updated metrics, and (b) does not factor into account past metrics when calculating the "goal" CU:

  1. It's easy to cause the goal CU to oscillate, resulting in a lot of effort spent scaling, with little net benefit
  2. As computes get larger, the same percentage change in metrics is more likely to produce a change in (integer) goal CU — meaning that each 5s the metrics update is more likely to prompt scaling, and by a larger amount

In a perfect world, maybe this'd be fine. But in practice, the process of scaling actually consumes resources, and so is generally something we want to avoid doing frivolously.

See also: https://neondb.slack.com/archives/C03ETHV2KD1/p1704319422570509?thread_ts=1704316837.680979

Feature idea(s) / DoD

Scaling algorithm should be more stable over some time period, under some conditions.

This isn't a super well-defined goal — so this issue mostly just exists to track some improvement.

Implementation ideas

There's a couple directions we could take this.

One is to still not include any scaling history, and instead limit the size of a change (e.g., by no more than 1 CU at a time) and introduce rate-limiting on scaling. This wouldn't necessarily stop oscillation, but may reduce the impact.

The other is to include some history around recent metrics so that we have a longer time period to use for decision-making. This solution would probably be harder, but likely easier to understand and easier to produce better outcomes.

One possibly annoying piece of this is that we may need to change a substantial portion of the tests for pkg/agent/core. We probably want a way to override the "goal CU" and directly provide that.

Tasks

# Pre-requisites
- [ ] #1129
# Implementation
- [ ] ... add tasks here as they come up
# Follow-ups
- [ ] ... add tasks here as they come up
sharnoff commented 2 weeks ago

There's an open RFC that will partially address this issue here: https://www.notion.so/neondatabase/131f189e004780b2915ef2fdb95bae6a

In short: the approach should reduce volatility by ~60% from what we have today, but it's only a fractional decrease — probably insufficient for very volatile workloads on much larger computes.