Memory-heavy workloads may be scaled too high

sharnoff commented 3 months ago

Problem description / Motivation

Currently, the vm-monitor:

Reserves ~75% of memory for LFC
Asks for scale-up when postgres exceeds the remainder (without looking at what is actually being used by the LFC)

This works ok as a naive solution for most OLTP workloads, but happens to mean that certain memory-heavy workloads can be scaled higher than they need (note: excluding cache usage by LFC, so allocations are elsewhere — like in pgvector index build).

Meanwhile the autoscaler-agent triggers upscaling based on memory usage if postgres' memory usage exceeds 75% of memory... so it's basically always handled by the vm-monitor first in practice.

This came up in this thread: https://neondb.slack.com/archives/C03TN5G758R/p1723127762991289

Feature idea(s) / DoD

We should be more careful about how we treat memory usage as a scaling signal, so that memory-heavy workloads are no longer scaled up beyond what's necessary, while also making sure that we don't harm performance for workloads that are memory-heavy and also rely on LFC being in the OS page cache.

Implementation ideas

See https://www.notion.so/neondatabase/0f75b15d47ad479094861302a99114af

- [ ] #1031
- [ ] neondatabase/neon#8668
- [ ] neondatabase/docs#222

sharnoff commented 1 month ago

Now that neondatabase/neon#8668 has been merged, this will be fixed with the next compute release containing it.

sharnoff commented 1 month ago

This has since been released, and should be fixed for new computes.

neondatabase / autoscaling