Support resizeable swap

neondatabase / autoscaling

Postgres vertical autoscaling in k8s

Apache License 2.0

150 stars 20 forks source link

Problem description / Motivation

For neondatabase/neon#7239, we'd like to resize swap when compute_ctl receives the compute spec. We'd still like to keep swap on a separate disk, like the current implementation.

The current neonvm swap implementation mounts an entire disk as swap (see #801 for more). This cannot easily be resized by the guest.

Feature idea(s) / DoD

Swap enabled with .spec.guest.settings.swap can be resized from within the VM.

Implementation ideas

To decrease the size of swap from within the VM, we have a few options:

Use a swapfile
Mount the swap as a resizeable partition (and then to resize, we swapoff, shrink the partition, mkswap, and swapon)
Mount individual disks for swap "chunks" so they can be individually turned off or on

Swapfiles apparently have a performance overhead compared to a "raw" swap disk. We could go with the "chunks" method, especially if we wanted to control swap amounts at runtime — but this is quite complex, and AFAIK we don't really need this (at least, not right now).

With the partition approach, there's separate issues around initialization — basically: how do we mkswap the partition inside the file? I think this is possible by getting the partition offset within the file and dd'ing a separate mkswap'd file into place, but tbh I'm not entirely sure. It may also be possible with GNU mkswap (which supports the -o/--offset flag), but there doesn't appear to be an easy way to get GNU mkswap into alpine, which is neonvm-runner's base image.

We could also initialize from within the VM as well.

Carrying discussion from #887 over here.

@Omrigan

Are you sure we can't just use the swapfile residing on a main disk? This would drop the need for this patch (and, actually, #801) and the entire thing would be replaced by a fallocate and mkswap in compute_ctl. Right?

A few things:

Swapfiles may have different performance characteristics, so I viewed a swap partition as something that's more likely to be stable. As far as I could find, the most concrete thing is that swapon will take longer to build a map from logical location to physical location, but I would also imagine that because swapfiles are not guaranteed to be contiguous, it may be slower due to cache issues (because we tell QEMU to use a 2MiB cluster size). Either way, it's enough different from a swap disk that we'd have to test it more.
In general, swpafiles are not supposed to be sparse. If we did fallocate and mkswap by compute_ctl (and needed it to not be sparse), we'd be asking it to write many GiBs of data, whereas a swap disk can be sparse on the host. (Maybe there's ways around this, I'm not sure)
We still need some program that compute_ctl is allowed to run as root (or maybe the program has the setuid bit, or something). IMO it's cleaner if we (autoscaling / neonvm) provide that, rather than that being added into the compute image in neondatabase/neon.

neondatabase / autoscaling