neondatabase / autoscaling

Postgres vertical autoscaling in k8s
Apache License 2.0
150 stars 20 forks source link

neonvm: Support resizeable swap via sized mkswap #887

Closed sharnoff closed 4 months ago

sharnoff commented 5 months ago

This commit adds support for resizeable swap in NeonVM, with a breaking change to the CRD (changed type of .guest.settings.swap) and changes to vm-builder and neonvm-runner to provide the necessary support.

Original PR description — now outdated
This consists of an opt-in flag to mount the swap as a GPT partition, as well as a convenience script inside the VM to resize the partition and re-enable swap. ## Detailed summary of changes 1. Change type of `.guest.settings.swap` from `resource.Quantity` to a struct, with fields: `size` (`resource.Quantity`), `shrinkable`, `skipSwapon`. The new type also accepts the same syntax as `resource.Quantity`, and acts as if only the size was specified. This is temporary, and exists only for backwards compatibility. 2. If `shrinkable = true`, neonvm-runner instead creates a disk with GPT and a single swap partition. It will also include `/neonvm/runtime/resize-swap.sh` as the underlying implementation of `/neonvm/bin/resize-swap`. `resize-swap.sh` doesn't depend on any runtime information, but because the swap disk is supplied externally and set up via `/neonvm/runtime/mounts.sh`, my view is that it makes more sense for the management script to be injected by neonvm-runner as well (instead of being added by vm-builder). 3. If `skipSwapon = true`, `/neonvm/runtime/mounts.sh` will not include the typical `swapon`. This is meant for cases where we don't need swap to be available until we later call `/neonvm/bin/resize-swap`, like what's described in neondatabase/neon#7239. 4. Because `resize-swap.sh` requires `parted` for editing the partitions and fetching associated info, we have to add that in vm-builder now. It has some shared libraries (check out `ldd /usr/sbin/parted`), and won't start without them, so those have been added to `/neonvm/lib`. 5. This commit also includes added tests for the various swap parameterizations. These are, briefly: - non-resizeable: Check that it's the expected size - resizeable: Check that it's initially the expected size, and we can downscale it. - resizeable, with `skipSwapon`: Check that it's initially 0, and we can resize it to the desired size (same as above). I'm not sure if we want to support non-resizeable swap long-term, or if `skipSwapon` is really necessary, but in the meantime we should definitely have tests. Closes #886. See also #801, which initially added swap support. ---

This consists of changing neonvm-runner to inject a new resize-swap.sh script in /neonvm/runtime, changing vm-builder to add a new resize-swap wrapper in /neonvm/bin, and adding a new flag to skip enabling the swap on startup (for when it'll always be resized later).

Detailed summary of changes

  1. Change type of .guest.settings.swap from resource.Quantity to a struct, with fields: size (resource.Quantity), skipSwapon (bool).

    The new type also accepts the same syntax as resource.Quantity, and acts as if only the size was specified. This is temporary, and exists only for backwards compatibility.

  2. If skipSwapon = true, /neonvm/runtime/mounts.sh will not include the typical swapon. This is meant for cases where we don't need swap to be available until we later call /neonvm/bin/resize-swap, like what's described in neondatabase#neon#7239.

  3. This commit also includes added tests for swap. These are, briefly:

    • normal: Check that it's initially the expected size, and we can shrink it.
    • with skipSwapon: Check that it's initially 0, and we can resize it to the desired size.

    I'm not sure if skipSwapon is really necessary, but in the meantime we should definitely have tests.

We may also, in the future, want to change the underlying implementation of swap resizing (e.g. instead of swapoff-mkswap-swapon, we could offer control at runtime with many smaller swap regions that could be individually enabled or disabled). Those ideas can be discussed separately, if/when they are needed.

Closes #886.

See also #801, which initially added swap support.

sharnoff commented 5 months ago

Discussed with @Omrigan, going to make the following changes:

  1. Always make swap resizeable
  2. Use resize-swap.sh during vminit if SkipSwapon is not true
  3. Make the "raw" disk slightly bigger than we need, so that we can specify exact sizes and don't need to parse the output of parted. This is potentially more fragile but should be much simpler and easier to understand.
sharnoff commented 5 months ago

Ok, looked into this more, and... a couple things:

First: If we want to skip parsing parted output in resize-swap.sh, we need to know the new end position of the partition. This is equal to start_position + partition_size - 1 sector. For some reason. Now, inside the VM it is possible to know (and specify) the sector size, because the disk is emulated — see e.g. here https://www.mail-archive.com/qemu-discuss@nongnu.org/msg06618.html.

But that doesn't help us outside the VM, because there parted will use the sector size from the host when creating the GPT table.

Now, in the host we can in theory just set the partition to whatever size we want and get away with it by resizing in the guest, but AFAICT we'll still be stuck with sector sizes from the host, and so we'd still need that information somehow. (I wouldn't feel comfortable assuming it's always 512.)

So maybe this is all fine and what we should do is actually still parse the output of parted on the host but then use that information in resize-swap.sh instead of also running parted there. If we do that though, the plumbing is tricky - currently, resize-swap.sh is populated by createISO9660runtime, which runs in parallel with swap creation (part of buildQEMUCmd).


Second: If we do still want to run parted in the guest, using resize-swap.sh in /neonvm/runtime/mounts.sh is a little trickier — if we directly used the size specified in the VM spec, we'd need to make sure that the partition is a little bit bigger. This is also entirely possible, but... yet another thing.


So, to recap. AFAICT, our options are:

  1. Keep as it is — swap available in the VM is ~3MiB less than what the spec says, and is enabled with plain swapon.
  2. Use resize-swap.sh, but keep parsing parted output — swap partition gets some extra space (doesn't matter how much, as long as it's >3MiB) and we resize-swap.sh on startup, which ends up parsing parted output to determine sector size. Maybe partition start location is fixed.
  3. Parse parted on the host to pass sector size and partition start to the guest, so it doesn't need to parse parted output.
    • This seems probably ok to me? I'd like to get a +1 on this to agree this should be sound (right now I suspect it should be fine, but I'm not confident)
  4. Instead of parsing parted on the host, we just assume that sector size is always 512 bytes.
    • As mentioned above, want to avoid this if possible. But technically it is an option.
  5. Instead of resizing a GPT partition, we just make a swapfile in the guest.
    • As mentioned elsewhere, want to avoid this if possible. But technically it is also an option.
  6. Instead of resizing a GPT partition, we just split the swap into N chunks, each 1GiB in size, so they can be individually turned on and off to resize.
    • This would be quite complicated, and probably result in an even more complex resize-swap.sh, but we don't have to deal with partitions, and we get to resize again over time :shrug:

Also, because mounts.sh and resize-swap.sh are both injected, we can also start with one solution and change to another later as a follow-up. Switching to/from (6) is probably the hardest, but switching between any of them is entirely feasible.

LMK what you think @Omrigan.

Omrigan commented 5 months ago

@sharnoff Option 3 sounds reasonable to me.

Have you looked into something more convenient than parted output? Like https://pkg.go.dev/github.com/diskfs/go-diskfs@v1.4.0/partition/gpt or sfdisk.

sharnoff commented 5 months ago

Have you looked into something more convenient than parted output?

I tried sgdisk but found it was quite slow (took ~1s to write a partition).

After looking at the Go package, it seems to still require the user to specify the sector size, so we'd need some way to get that. But otherwise it looks close to what we'd need.

After your comment, I tried sfdisk — it calls sync(2) after writing the partition table, which... is not necessary in this case, and on my machine takes ~0.3s. No clue what it'd take in prod, but... it's entirely possible we'd have a lot of activity on the kubernetes nodes' root filesystems, which IIUC is what'd be syncing here. Given the timing difference (parted is practically instant), I think we should keep with parted for configuring (at least in the host), but use sfdisk to get the sector size, because yeah, its output is much better.

(it's quite unfortunate actually — sfdisk's input is also much better than parted... if only it had a way to skip the sync() call)

IIUC sync() should be quick in the guest, when referring to a device, so maybe we could use sfdisk there. For now, I'm inclined to just stick to configuring with parted in both the guest and the host.

cicdteam commented 5 months ago

@sharnoff

Have you tried something like go-diskfs?

sharnoff commented 5 months ago

I looked into it. One of the difficulties in this PR is that we need to know the sector size when resizing the swap partition inside the guest (parted because the "end" position is actually start + size - sector size, and sfdisk because it uses units of sectors) — and to avoid parsing command output in the guest, we'd like to pass that in from the host.

AFAICT go-diskfs doesn't provide a way to fetch the recommended sector size, but maybe I missed it.

sharnoff commented 5 months ago

todo:

sharnoff commented 4 months ago

Just realized busybox mkswap does have a way to specify the size :facepalm:

For a certain size $size in bytes, you can just:

mkswap <path> $(( size / 1024 ))

Oh to live in a world where busybox has complete documentation...


Anyways I think a lot of this PR can be ripped out and replaced with the previous version - basically exclusively making the changes inside the VM... :disappointed: At least we caught it now, I guess.

I'll try to get that done, so we can do final review on a (hopefully) much smaller PR.

sharnoff commented 4 months ago

Ok, opened #899 — it borrows a lot from this PR, but I felt like the approach was different enough that it should be its own PR.

I'm happy to merge it into this PR or merge that one directly into main, whichever you think makes the most sense @Omrigan.

Omrigan commented 4 months ago

@sharnoff Can you move contents of a #899 here? It still shares a lot of code and if we started the review here, easier to finish it here IMO

sharnoff commented 4 months ago

Talked to @Omrigan - final actions:

  1. Rename the internal resize-swap.sh to resize-swap-internal.sh
  2. Given the mess with compute_ctl restarts, it's ok to keep --once for now, but it's not ideal.
  3. Make the swap disk names global constants where applicable