I tried to keep the user-visible behavior unchanged from what we have
today. Improving the autoscaling algorithm is a separate topic, the
point of this work is just to move the algorihm from the autoscaler
agent to the VM monitor. That lays the groundwork for improving it
later, based on more metrics and signals inside the VM.
Some notable changes:
I removed all the cgroup managing stuff. Instead of polling the
cgroup memory threshold, this polls the overall system memory usage.
The scaling algorithm is based on sliding window of load average and
memory usage over the last minute. I'm not sure how close that is to
the algorithm used by the autoscaler agent, I couldn't find a
description of what exactly the algorithm used there is. I think
this is close, but if not, it can be changed to match the agent's
current algorithm more closely. I copied the LoadAverageFractionTarget
and MemoryUsageFractionTarget settings from the autoscaler agent, with
the defaults I found in the repo, but I'm not sure if we use different
settings in production.
I also didn't fully understand how the memory history logging in VM
monitor, which was used to trigger upscaling. There is only one
memory scaling codepath now, based on the max over 1-minute sliding
window.
This is the VM monitor implementation of the RFC at https://github.com/neondatabase/neon/pull/8111.
I tried to keep the user-visible behavior unchanged from what we have today. Improving the autoscaling algorithm is a separate topic, the point of this work is just to move the algorihm from the autoscaler agent to the VM monitor. That lays the groundwork for improving it later, based on more metrics and signals inside the VM.
Some notable changes:
I removed all the cgroup managing stuff. Instead of polling the cgroup memory threshold, this polls the overall system memory usage.
The scaling algorithm is based on sliding window of load average and memory usage over the last minute. I'm not sure how close that is to the algorithm used by the autoscaler agent, I couldn't find a description of what exactly the algorithm used there is. I think this is close, but if not, it can be changed to match the agent's current algorithm more closely. I copied the LoadAverageFractionTarget and MemoryUsageFractionTarget settings from the autoscaler agent, with the defaults I found in the repo, but I'm not sure if we use different settings in production.
I also didn't fully understand how the memory history logging in VM monitor, which was used to trigger upscaling. There is only one memory scaling codepath now, based on the max over 1-minute sliding window.