neondatabase / autoscaling

Postgres vertical autoscaling in k8s
Apache License 2.0
166 stars 21 forks source link

agent/core: Use cached LFC for memory scaling signal #1031

Closed sharnoff closed 2 months ago

sharnoff commented 3 months ago

In short: In addition to scaling when there's a lot of memory used by postgres, we should also scale up to make sure that enough of the LFC is able to fit into the page cache alongside it.

To answer "how much is enough of the LFC", this PR takes the minimum of the estimated LFC working set size (from window size) and the cached memory (from the Cached field of /proc/meminfo, via vector's host metrics).

This PR also adds the memoryTotalFractionTarget field to the scaling config, serving a similar purpose to memoryUsageFractionTarget, but applying to the total of memory usage and cached data.

This PR is part of #1030 and must be deployed before the vm-monitor changes in order to make sure we don't regress performance for workloads that are both memory-heavy and rely on LFC being in the VM's page cache.

For more info, see: https://www.notion.so/neondatabase/0f75b15d47ad479094861302a99114af.


Notes for review:

  1. This PR is broken into two commits -- the first is a refactor to make the second one easier to read.
  2. Planning to test on staging with neondatabase/neon#8668 using some familiar workloads (in particular: LFC-aware scaling tests and pgvector index build)
github-actions[bot] commented 2 months ago

Merging this branch changes the coverage (1 decrease, 1 increase)

Impacted Packages Coverage Δ :robot:
github.com/neondatabase/autoscaling/pkg/agent/core 68.76% (+0.93%) :thumbsup:
github.com/neondatabase/autoscaling/pkg/api 3.06% (-0.10%) :thumbsdown:

Coverage by file ### Changed files (no unit tests) | Changed File | Coverage Δ | Total | Covered | Missed | :robot: | |--------------|------------|-------|---------|--------|---------| | github.com/neondatabase/autoscaling/pkg/agent/core/goalcu.go | 87.76% (**+87.76%**) | 49 (+49) | 43 (+43) | 6 (+6) | :star2: | | github.com/neondatabase/autoscaling/pkg/agent/core/metrics.go | 1.28% (**-0.02%**) | 78 (+1) | 1 | 77 (+1) | :thumbsdown: | | github.com/neondatabase/autoscaling/pkg/agent/core/state.go | 87.84% (**+0.41%**) | 329 (-29) | 289 (-24) | 40 (-5) | :thumbsup: | | github.com/neondatabase/autoscaling/pkg/api/vminfo.go | 1.55% (**-0.09%**) | 579 (+31) | 9 | 570 (+31) | :thumbsdown: | _Please note that the "Total", "Covered", and "Missed" counts above refer to ***code statements*** instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code._ ### Changed unit test files - github.com/neondatabase/autoscaling/pkg/agent/core/state_test.go

HTML Report

Click to open