RFC: whole-disk support with bootable GPT layout

sempervictus commented 3 years ago

Context

EFI boot requires a specific layout for GPT incompatible with the layout produced by ZFS when utilizing an entire disk
- An appropriate labeled partition is required at the start of the disk for the efi binary bootloader
- The partition must be FAT formatted putting it outside of the DMU's control
Cloud bootloaders often require a "readable to them" /boot partition to identify the kernel they're to paravirtualize
- These things, along with legacy systems, tend to boot from an MBR block embedded in the first 1MB of the GPT table and specially marked (legacy bootloaders blindly read from the start of the boot media so they dont care that its GPT instead of MBR so long as the offset has bootloader bytes)
Block devices managed via zpool create in their entirety are therefore not viable to boot many systems, or the user has to build block device partition tables in their specific OS paradigm to make a bootable and portable pool on one of the partitions.
There are too many standards bodies, motherboard manufacturers, hypervisors, and disk image processing tools to update with libzfs hooks to permit native access to ZFS-managed devices and the relevant bits stored on them (potentially all over the place).
- EFI loaders are signed which can slow down integration, and they're not likely to run a full libzfs so compatibility with downstream features will be hard to maintain

Compatible Partition layout

A 1MB MBR loader block
At least 128MB (before this is implemented the spec needs to be checked for minimum value since its bloody space consuming and wasteful when dealing with images shuttled around IaaS pipelines) of FAT-formatted EFI space
Sometimes, an EXT-or-amenable-to-IaaS-vendor /boot FS with no specific minimal space requirement, just enough to pick up boot and take it into rootfs.
The ZFS managed partition which is really the only one which should see IO at runtime since nobody should be writing tons of data to /boot or the MBR block (or any, really).

Proposed Solution

Implement the above "compatible layout" for zpool commands
- zpool create or zpool add via -o bootcompat=full or bootcompat=loaderonly to skip the third partition
Teach ZFS to "understand" this layout as one where it controls all of the disk IO at runtime
Retain backwards compatibility with the old GPT layout

How will this feature improve OpenZFS?

This should increase portability and simplify adoption/use
Should reduce issues with pools already configured on such a data-on-boot-disk partition relating to not treating the VDEV as a whole disk

sempervictus commented 3 years ago

To deal with this issue for MBR where GRUB can play nice with ZFS, we've internaly had Add V_BIOSBOOT for GPT BIOS Boot Partition and Create BIOS Boot Partition for grub2 in our tree since they got PR'd. The EFI partition poses a separate problem since it can't be done with trickery in the early bits of the block device and must have a specific FS format.

amotin commented 3 years ago

Why should file system care about this? Next step you propose ZFS to install specific boot loader into these partitions, understanding FAT32 for this.

sempervictus commented 3 years ago

Given that the bootloader in the MBR sense is effectively a pointer construct, ZFS couldn't do that without knowing where to point the next step of the boot process after the loader. The space for it however is needed along with an appropriate partition code. Same goes for EFI partition, the EFI loaders written into there are files of executable code, not block data written right into a partition, and also with contextual relevance for the OS/kernel/etc. The file system isn't what "cares" about this, the hypervisors and motheboard boot semantics "care," while the sub-filesystem (DMU) cares about whether it controls an entire block device or not. From a ZFS perspective, having a modern bootable volume alters how the pipeline interprets its control over the device; and from a systems perspective, the layouts produced by ZFS do not match the standards of operation in modern systems. The whole "partition 2 and partition 9" thing that ZFS does isn't workable for motherboards and visors as they are today, booting from ZFS directly is rough on bootloaders - they have to keep up with ZFS changes and when ZFS releases an update consumers have to fear that every subsequent boot may not work (or we have to keep forks maintained for ZFS and pray the relevant upstreams pull often and dont diverge too much, which wont help the hypervisor issue). Hence the proposal to use a GPT layout compatible with all bootloaders - bend the tree to the ground instead of trying to pile up earth 20 ft high to reach its fruit.

amotin commented 3 years ago

I am not against fixing the layout, I am against wider use of this deeply broken concept "raw" disks, which are not really raw at the end. On FreeBSD all this code was just removed for years (never ported from Illumos) and lived happily in both real raw form, if admin decide, or on top of any partitioning scheme. Just create whatever partitioning/loader scheme you like and put ZFS on its own partition.

sempervictus commented 3 years ago

I have to grudgingly admit that GEOM has its advantages. Probably to be expected given that it was built on spec with objectives while Linux is like watching natural evolution on a compressed timescale (everything eating everything else :) ). Off the top of my head i see no issues with implementing that way either - remove the "considerations" around underlying device type other than interfaces (maybe even give some vdev tunables for queue depth, dispatch size, etc), and add the convenience option to make a universally compatible/bootable GPT layout on zpool <add|create> invocations. A potential improvement on that would be to let all of the /efi and /boot partitions operate as mirrors such that a chassis with 100 disks could boot from any one of them when it reboots (ensuring fault tolerance in all phases of the operations cycle).

jumbi77 commented 3 years ago

FYI: referencing PR 6277 which is a bit old but still open (PR doesn't deal with GPT layout specifically but with partitioning in general, so maybe its helpful. A real "raw" disk w/o partitioning could be useful too).

Edit: referencing issues regarding (gpt) paritioning #94 and #3452

sempervictus commented 3 years ago

@jumbi77, thanks, i was trying to remember how to find that :). Chunwei handled the weird p1,p9 formatting scheme issue, but i'm not sure that it addressed the "whole disk" handling bit. Will look at the source this afternoon to see if he got that far. @tuxoko: if you're out there lurking in the ether, wanna weigh in with any insights you picked up 4y ago? :)

jumbi77 commented 3 years ago

Just found draft #11029 and want to reference it here. Altough i agree with amotin that the current implementation of "whole/raw disk" in linux is kinda misleading, I still like the idea to implement a property so that the user can finally choose between:

real raw disk w/o any partition/gtp (e.g. for non-boot pools, only data-pool)
new EFI partition style with option for custom partition size like proposed in #11029 (for rpool / boot pools)
legacy style, like current implemented for backward compatibility (-part1 and -part9)
....

sempervictus commented 3 years ago

@jumbi77: thanks for adding that. @behlendorf: is something like this viable or does the design approach infringe on some internal semantics of ZFS as dealing with block devices? I kind of like the idea of just treating every disk as a full-bore consumer of all IO available to the blockdev (all disks are raw disks), thats how all other consumers treat it, and "being considerate" to them isn't helping anyone (schedulers dont assume polite users either), but the actual building of universally bootable devices has benefits beyond the scheduler concern.

jumbi77 commented 3 years ago

@jumbi77: thanks for adding that. @behlendorf: is something like this viable or does the design approach infringe on some internal semantics of ZFS as dealing with block devices? I kind of like the idea of just treating every disk as a full-bore consumer of all IO available to the blockdev (all disks are raw disks), thats how all other consumers treat it, and "being considerate" to them isn't helping anyone (schedulers dont assume polite users either), but the actual building of universally bootable devices has benefits beyond the scheduler concern.

@behlendorf can you may give feedback if this approach is viable? As @sempervictus mentioned the design approach should be clarified first. There are several PRs and issues linked, each with different design approaches. Would be great to unify this in >= OpenZFS 2.x. To get closer to the other OS the new default behavior (for linux) should be no gpt'ing (real raw) at all. Being able to use -B (with -o bootsize) for boot pools for convenience. For any other/custom config its the admin responsibility doing partition/other stuff like desired before using zpool. We should probably make it possible to use the legacy (aka current) behavior though...

sempervictus commented 3 years ago

I dont think we can avoid either making a GPT or faking the relevant pieces/position of its metadata due to EFI depending on it strictly, ditto the FAT32 space at the beginning since that's all the firmware on the motherboard knows how to read (and validate for attestation). Direct-EFI load from linux works great (my personal system has no bootloader installed on the nvme housing the OS-relevant partitions - laptop knows to use the signed EFI loader generated every time i update kernel/initram images).

jumbi77 commented 2 years ago

I dont think we can avoid either making a GPT or faking the relevant pieces/position of its metadata due to EFI depending on it strictly, ditto the FAT32 space at the beginning since that's all the firmware on the motherboard knows how to read (and validate for attestation).

I guess this is true for boot pools (and this is why we should get a similar functionality like in Illumos with -B). But for non-boot pools, at least as far as I know, a GPT is not 100% mandatory (if ashift is set correct the alignment like 4k should be fine).... At Least in my testing FreeNAS doesn't gpt'ing and/or partitioning the disks at all when doing zpool create on a data pool. So this is meant by "real raw".

behlendorf commented 2 years ago

can you may give feedback if this approach is viable?

What your suggesting in the top comment sounds like a pretty workable design to me. The current GPT layout is simply the result of preserving the original illumos layout. Internally, OpenZFS doesn't care too much about the layout beyond 1) being aware the devices are partitions (and not raw disks), and 2) that the vdev partition is the first partition (-part1). As long as this remains true in the updated layout I wouldn't expect significant backwards compatibility issues.

We would just need someone will to put together a PR for this which passes the CI, and then some folks to review and test it. I also agree with @jumbi77 it would be nice to support the three use cases he mention in this https://github.com/openzfs/zfs/issues/11408#issuecomment-767791056.

jumbi77 commented 2 years ago

Just want to ping @sempervictus @nabijaczleweli because of the recent answer from behlendorf. Maybe you are interested in implementing the proposal in https://github.com/openzfs/zfs/issues/11408#issuecomment-767791056 ? Since I am not a dev I can't really do it myself, but i can test a PR etc. In any case much thanks for your inputs.

nabijaczleweli commented 2 years ago

it's on my list, because i do kinda want a hole in whole-disk, and why i paid special attention to libefi (read: gutted a header instead of just reducing it) in #12996; can't comment any further

pepa65 commented 1 year ago

EFI boot requires a specific layout for GPT incompatible with the layout produced by ZFS when utilizing an entire disk

When utilizing a whole disk, it is simply not going to be usable by UEFI, no problem, that is the nature of it, no GPT and no EFI System partition means the disk is not usable for UEFI boot. Same with other schemes, when ZFS uses the full device, legacy booting is also not going to work. Not all disks are for booting, actually, most disks are not meant to be bootable.

Block devices managed via zpool create in their entirety are therefore not viable to boot many systems, or the user has to build block device partition tables in their specific OS paradigm to make a bootable and portable pool on one of the partitions.

Exactly, there are dedicated utilities and tools to create and manage the various schemes. Please leave any overlapping GPT/MBR partitioning functionality outside of ZFS as much as possible.

This is all there is to say about this, other than that to as much as possible slowly reverting the adoption of partitioning code in the project. ZFS is for managing ZFS, and disk partitioning should be left out of it as much as possible. Please don't go further down the road of duplicating efforts. People that can manage ZFS can manage to make the disks bootable that they need to be bootable, and storage devices can then be used to the full extent of their capacity, purely managed by ZFS.

IvanVolosyuk commented 1 year ago

Looks like solaris ZFS added EFI support already: https://docs.oracle.com/cd/E36784_01/html/E36834/disks-1.html

ZFS is about simplifying management of storage devices and there is some code in openzfs for legacy solaris boot scheme. It's a matter of updating it or removing.

sempervictus commented 1 year ago

I think that having legacy compatibility is a good thing, so long as it doesn't introduce security concerns or drag down the pace of progress. If Oracle did it already, i take that as a sign that we're looking in the right direction :smile:.

ghfields commented 1 year ago

Could this be simplified if zpool create and an option that allowed "X amount of unused space" before the zfs partition? The user would then be able to manually do anything they wish with that space, including creating an EFI partition. (ntfs, ext4, a 'boot pool', or several of these possibilities)

With at least the UEFI implementations I've used, the EFI partition needs to be the first partition on the disk, however, it doesn't have to be entry #1 on the gpt table. ( you can check this by using gdisk's transform option 'x', then 't').

pepa65 commented 1 year ago

Could this be simplified if zpool create and an option that allowed "X amount of unused space" before the zfs partition? The user would then be able to manually do anything they wish with that space, including creating an EFI partition. (ntfs, ext4, a 'boot pool', or several of these possibilities)

If you want to create a partition (or more than one), isn't it much simpler to just use the dedicated partitioner that every installer comes with? It is well tested and well-maintained code.

If this gets done within OpenZFS, it will need to stay maintained as a separate peace of crucial code for no good reason. It will greatly pollute the ZFS interface by needing to provide all kinds of options to configure the parameters properly, and this also needs to be maintained.

If this doesn't get to be part of OpenZFS, then using a full disk device is simply that, no worries about needing to make partitions, how big, where, of which type, etc. This will keep the project agile and focussed on the core competencies.

ghfields commented 1 year ago

Sorry... Little more reading would have helped me. While Oracle doesn't have this, if using "-B", Illumos allows setting a "-o bootsize=" property as well, defaulting to 256m. https://illumos.org/man/8/zpool

This should be fine since a "tweaker" could just declare the size they wish, delete the produced EFI partition, and replace it with whatever they wish (for example, a swap partition). As long as there's a way to define the amount of space before the start of the ZFS partition, there is plenty of flexibility.

That said... the more exotic one becomes, the more they should have just entirely custom partitioned themselves.... They just see (sda'1' or wwn-xxxxxxxxxxxxxxxx-'part1').

pepa65 commented 1 year ago

If Oracle doesn't have it, all the more reason not to keep it in OpenZFS.

pepa65 commented 1 year ago

Could this be simplified if zpool create and an option that allowed "X amount of unused space" before the zfs partition?

There should be a clear option to not use any partitioning at all, to just use the whole device as data storage.

openzfs / zfs