sled-agent should format new NVMe drives with proper sector size

Following-up on this issue:

As of https://github.com/oxidecomputer/omicron/pull/2176 , the sled agent is issuing zpool create to U.2 devices which are missing a GPT, or which have a GPT but are missing a zpool in the first partition.

What's the recommended mechanism for sled agent to enforce sector size?

I found the following in the zpool documentation:

     The following properties can be set at creation time and import time, and
     later changed with the zpool set command:

     ashift=ashift
             Pool sector size exponent, to the power of 2 (internally referred
             to as ashift ). Values from 9 to 16, inclusive, are valid; also,
             the value 0 (the default) means to auto-detect using the kernel's
             block layer and a ZFS internal exception list.  I/O operations will
             be aligned to the specified size boundaries.  Additionally, the
             minimum (disk) write size will be set to the specified size, so
             this represents a space vs performance trade-off.  For optimal
             performance, the pool sector size should be greater than or equal
             to the sector size of the underlying disks.  The typical case for
             setting this property is when performance is important and the
             underlying disks use 4KiB sectors but report 512B sectors to the OS
             (for compatibility reasons); in that case, set ashift=12 (which is
             1<<12 = 4096). When set, this property is used as the default hint
             value in subsequent vdev operations (add, attach and replace).
             Changing this value will not modify any existing vdev, not even on
             disk replacement; however it can be used, for instance, to replace
             a dying 512B sectors disk with a newer 4KiB sectors device: this
             will probably result in bad performance but at the same time could
             prevent loss of data.

I also found the following in the nvmeadm documentation:

nvmeadm format ctl[/ns] [lba-format]
       Formats the specified namespace or all namespaces of the specified
       controller.  This command implies a nvmeadm detach and subsequent nvmeadm
       attach of the specified namespace(s), which will cause a changed LBA
       format to be detected.  If no LBA format is specified the LBA format
       currently used by the namespace will be used.  When formatting all
       namespaces without specifying a LBA format the LBA format of namespace 1
       will be used.  A list of LBA formats supported by a namespace can be
       queried with the nvmeadm identify command.

       Note that not all devices support formatting individual or all
       namespaces, or support formatting at all.

       LBA formats using a non-zero metadata size are not supported by nvmeadm
       or nvme(4D).

       The list of supported LBA formats on a namespace can be retrieved with
       the nvmeadm identify command.

Is this issue suggesting that zpools should be formatted to use 4KiB sectors atop whatever the disk reports (e.g., using ashift=12)?
Is this issue suggesting that we should be formatting the NVMe drive using the equivalent of that nvmeadm command?
- If so, is there a stable interface for us to use? The output of nvmeadm implies it was not intended to be parsed by automated tools, but this seems necessary for "picking a format that has the right LBA data size". (I'm kinda assuming "no" on the stable interface to NVMe devices, but figured I'd ask)
Is this issue suggesting we should do both?

What's the recommended mechanism for sled agent to enforce sector size? ...

Is this issue suggesting that zpools should be formatted to use 4KiB sectors atop whatever the disk reports (e.g., using ashift=12)?

Is this issue suggesting that we should be formatting the NVMe drive using the equivalent of that nvmeadm command?

The original purpose of allowing one to specify ashift at creation time was to provide an upgrade path from 512n or 512e devices to 4kn devices without destroying the pool and restoring from backup. The more general advice I would offer here is that ashift should be set at creation time iff there is a product lifecycle requirement for data to be preserved across future vdev replacement where the replacement vdev may have a larger minimum block size than the vdev in use at creation time. By far the most common case, past and present, is where an existing device has a minimum LB size of 512 and a future replacement may have a minimum LB size of 4k, but other cases are also possible. If this requirement exists, ashift should be set to reflect the largest minimum LB size of any future replacement device that must be supported.

  * If so, is there a stable interface for us to use? The output of `nvmeadm` implies it was not intended to be parsed by automated tools, but this seems necessary for "picking a format that has the right LBA data size". (I'm kinda assuming "no" on the stable interface to NVMe devices, but figured I'd ask)

At present, no. There is active work under way to provide a Committed library interface.

* Is this issue suggesting we should do both?

Practically speaking, yes. Assuming that we are willing to guarantee that all SSDs we will ever support in a particular chassis will support the 4k block size, it is correct when creating pools in that chassis to set ashift=12. Assuming that all devices in the pool at creation time also support the 4k block size, each device's controller should be configured to use that block size (as via nvmeadm format or for preference a future Committed library interface) prior to partitioning the device, creating pools, or otherwise storing any data on it.

To the best of my knowledge, that matches the current reality: all current and future devices that will be supported do and will support the 4k block size. In general, we should examine the supported set of block sizes for each device we are about to configure for the first time (or reconfigure with data loss as an acceptable outcome) and ensure that the actual block size we are configuring is (a) supported, (b) reported as "Best" by the controller or else that we consider its use a hard requirement regardless of possible performance degradation, and (c) the same as the other devices, if any, that will be part of the pool. If these constraints can't be satisfied or the block size we want/need is not supported, we need to fail and the device in question identified as faulty or unsupported. In addition, ashift should be set to the larger of the largest block size among the devices forming the pool or the largest block size of any future replacement device that must be supported. By default, this will be set to the largest actual block size among the devices in the pool, which will in turn have been determined by the NVMe controller's configuration for that namespace. If we have not expressly configured the namespace's LB size, it will be whatever was set by the vendor at the factory, which is often 512 and therefore not what we want. The result will be a pool that looks like this:

            ashift=0000000000000009

If such a pool has multiple vdevs, replacing one of the vdevs with a device that does not support, or is not configured to use, 512-byte blocks will not be possible. In practice, if the pool has only a single vdev anyway, replacement always entails total data loss so this may not be a significant consideration; in that case, the configured NVMe namespace block size and zpool ashift property are relevant primarily for reasons of performance and possibly write amplification/flash lifetime.

oxidecomputer / omicron

sled-agent should format new NVMe drives with proper sector size #1895