oxidecomputer / omicron

Omicron: Oxide control plane
Mozilla Public License 2.0
251 stars 39 forks source link

`omicron-package uninstall` can fail deleting swapfile #3651

Closed jordanhendricks closed 1 year ago

jordanhendricks commented 1 year ago

After #3571, the real sled agent now configures a swap device on the system when it starts up. The device lives on the zpool on the M.2 that the sled booted from, at oxi_<uuid>/swap:

# swap -l
/dev/zvol/dsk/oxi_43b21587-4614-45dc-a4c7-89f7ebc203c2/swap 170,1         8 536870904 536870904

# zfs list -r -o name,type,used,avail,mountpoint oxi_43b21587-4614-45dc-a4c7-89f7ebc203c2
NAME                                              TYPE         USED  AVAIL  MOUNTPOINT
oxi_43b21587-4614-45dc-a4c7-89f7ebc203c2          filesystem   646M   724G  /oxi_43b21587-4614-45dc-a4c7-89f7ebc203c2
oxi_43b21587-4614-45dc-a4c7-89f7ebc203c2/cluster  filesystem    96K   724G  /pool/int/43b21587-4614-45dc-a4c7-89f7ebc203c2/cluster
oxi_43b21587-4614-45dc-a4c7-89f7ebc203c2/config   filesystem   116K   724G  /pool/int/43b21587-4614-45dc-a4c7-89f7ebc203c2/config
oxi_43b21587-4614-45dc-a4c7-89f7ebc203c2/crash    filesystem    96K   724G  /pool/int/43b21587-4614-45dc-a4c7-89f7ebc203c2/crash
oxi_43b21587-4614-45dc-a4c7-89f7ebc203c2/debug    filesystem    96K   100G  /pool/int/43b21587-4614-45dc-a4c7-89f7ebc203c2/debug
oxi_43b21587-4614-45dc-a4c7-89f7ebc203c2/install  filesystem   643M   724G  /pool/int/43b21587-4614-45dc-a4c7-89f7ebc203c2/install
oxi_43b21587-4614-45dc-a4c7-89f7ebc203c2/swap     volume        88K   724G  -

omicron-package uninstall, which developers running a real sled agent use to tear down system state, indiscriminately tries to delete most things on that zpool.

If the zvol has been added as a swap device, the uninstall will fail because the device is in use. One can workaround this by deleting the swapfile manually with swap -d /dev/zvol/dsk/oxi_uuid/swap, then running the uninstall script again.

If the swapfile has been used for memory, though, it cannot be deleted. This is kind of annoying to get out of: basically the only reliable way I know how to workaround this is by rebooting the machine.

This regression only impacts developers running a sled-agent that has been told through its config.toml to configure a swap device (via setting swap_device_size_gb). In practice, that is only people running omicron using the gimlet config.toml file.

rcgoodfellow commented 1 year ago

This is going to become a more pronounced issue now that using a reservoir in propolis is required which, requires swap space. For developers using the swap_device_size_gb setting in /smf/sled-agent/non-gimlet/config.toml this issue will come up on every omicron-package uninstall.

andrewjstone commented 1 year ago

I'm reopening this in line with Ry's comment that it is consistently hit on every omicron-package uninstall as I just discovered.

There needs to be some mechanism for dealing with this, whether that's deleting the swap itself inside the uninstall, as I did manually with pfexec swap -d /dev/zvol/dsk/oxi_a462a7f7-b628-40fe-80ff-4e4189e2d62b/swap

I'm not sure the right answer. Maybe that answer is to ensure a swap device on non-gimlet machines. I'm not sure the optimal setting for that or how to ensure it's persisted though. Instructions on how to setup non-gimlet machines seems especially useful here.

jordanhendricks commented 1 year ago

Thanks for reporting @andrewjstone. I'm a bit surprised to hear that though -- it looks like we are still ignoring the swap dataset in the datasets omicron-package uninstall tries to delete:

https://github.com/oxidecomputer/omicron/blob/3780628669615e0d3f8fd58d6645e00d980b0727/illumos-utils/src/zfs.rs#L569-L576

What system(s) did you see this on?

andrewjstone commented 1 year ago

Thanks for reporting @andrewjstone. I'm a bit surprised to hear that though -- it looks like we are still ignoring the swap dataset in the datasets omicron-package uninstall tries to delete:

https://github.com/oxidecomputer/omicron/blob/3780628669615e0d3f8fd58d6645e00d980b0727/illumos-utils/src/zfs.rs#L569-L576

What system(s) did you see this on?

So, I think I may have mistakenly re-opened this. I saw it on my local helios box. However, what I saw was not that uninstall failed but that destroy_virtual_hardware.sh failed to remove the zpool containing the swap. I guess that is a separate issue that is already covered elsewhere, so this can be closed. Sorry for the hassle.

jordanhendricks commented 1 year ago

@andrewjstone looks like we have #4245 for that issue