Open sharnoff opened 10 months ago
There's an open RFC that will partially address this issue here: https://www.notion.so/neondatabase/131f189e004780b2915ef2fdb95bae6a
In short: the approach should reduce volatility by ~60% from what we have today, but it's only a fractional decrease — probably insufficient for very volatile workloads on much larger computes.
Problem description / Motivation
One of the blockers for allowing larger computes (ref neondatabase/cloud#9103) is improving the scaling algorithm.
Currently, because the scaling algorithm (a) recalculates the "goal" CU every 5s, via updated metrics, and (b) does not factor into account past metrics when calculating the "goal" CU:
In a perfect world, maybe this'd be fine. But in practice, the process of scaling actually consumes resources, and so is generally something we want to avoid doing frivolously.
See also: https://neondb.slack.com/archives/C03ETHV2KD1/p1704319422570509?thread_ts=1704316837.680979
Feature idea(s) / DoD
Scaling algorithm should be more stable over some time period, under some conditions.
This isn't a super well-defined goal — so this issue mostly just exists to track some improvement.
Implementation ideas
There's a couple directions we could take this.
One is to still not include any scaling history, and instead limit the size of a change (e.g., by no more than 1 CU at a time) and introduce rate-limiting on scaling. This wouldn't necessarily stop oscillation, but may reduce the impact.
The other is to include some history around recent metrics so that we have a longer time period to use for decision-making. This solution would probably be harder, but likely easier to understand and easier to produce better outcomes.
One possibly annoying piece of this is that we may need to change a substantial portion of the tests for
pkg/agent/core
. We probably want a way to override the "goal CU" and directly provide that.Tasks