oxidecomputer / omicron

Omicron: Oxide control plane
Mozilla Public License 2.0
251 stars 40 forks source link

Sled Agent must manage durable storage for configs, zones, explicitly #2888

Open smklein opened 1 year ago

smklein commented 1 year ago

See also: RFD 118

Sled Agent currently configures a few pieces of data outside datasets explicitly allocated in pools:

Q: So, why is this bad? A: All those paths are currently backed by a ramdisk -- specifically rpool -- on gimlets.

This means that when we reboot, a significant portion of the necessary configuration information to launch the sled will be lost. Furthermore, for the zonepath filesystems, a significant portion of user RAM will be dedicated to zone-based filesystems, which we'd prefer to distribute to disk-backed file storage.

Here's a list of some of the work we need to accomplish to mitigate this in a production environment:

andrewjstone commented 1 year ago

I just remembered that https://github.com/oxidecomputer/omicron/pull/3007 adds some support for persisting zone filesystems under crypt/zone.

This code has been ready to go for a few weeks now and has been tested numerous times. I've done cold boot testing on paris with it, but since there is no persistent install of software on the M.2s like dogfood, I have to re-run omicron-package install after reboot to rediscover and mount the U.2 encrypted datasets. The PR has the details about this. While I'd like to test with a proper install on dogfood I"m at about the point where I'd like to just merge this and see what happens.

smklein commented 9 months ago

FYI: I've been punting this a bit, because we clear these zones out on reboot, and the system is functional with self-managed storage of these zones. It's possible to put Nexus more explicitly in control of this "zone filesystem management", but not urgent.