Open leftwo opened 2 years ago
Following-up on this issue:
As of https://github.com/oxidecomputer/omicron/pull/2176 , the sled agent is issuing zpool create
to U.2 devices which are missing a GPT, or which have a GPT but are missing a zpool in the first partition.
What's the recommended mechanism for sled agent to enforce sector size?
I found the following in the zpool documentation:
The following properties can be set at creation time and import time, and
later changed with the zpool set command:
ashift=ashift
Pool sector size exponent, to the power of 2 (internally referred
to as ashift ). Values from 9 to 16, inclusive, are valid; also,
the value 0 (the default) means to auto-detect using the kernel's
block layer and a ZFS internal exception list. I/O operations will
be aligned to the specified size boundaries. Additionally, the
minimum (disk) write size will be set to the specified size, so
this represents a space vs performance trade-off. For optimal
performance, the pool sector size should be greater than or equal
to the sector size of the underlying disks. The typical case for
setting this property is when performance is important and the
underlying disks use 4KiB sectors but report 512B sectors to the OS
(for compatibility reasons); in that case, set ashift=12 (which is
1<<12 = 4096). When set, this property is used as the default hint
value in subsequent vdev operations (add, attach and replace).
Changing this value will not modify any existing vdev, not even on
disk replacement; however it can be used, for instance, to replace
a dying 512B sectors disk with a newer 4KiB sectors device: this
will probably result in bad performance but at the same time could
prevent loss of data.
I also found the following in the nvmeadm documentation:
nvmeadm format ctl[/ns] [lba-format]
Formats the specified namespace or all namespaces of the specified
controller. This command implies a nvmeadm detach and subsequent nvmeadm
attach of the specified namespace(s), which will cause a changed LBA
format to be detected. If no LBA format is specified the LBA format
currently used by the namespace will be used. When formatting all
namespaces without specifying a LBA format the LBA format of namespace 1
will be used. A list of LBA formats supported by a namespace can be
queried with the nvmeadm identify command.
Note that not all devices support formatting individual or all
namespaces, or support formatting at all.
LBA formats using a non-zero metadata size are not supported by nvmeadm
or nvme(4D).
The list of supported LBA formats on a namespace can be retrieved with
the nvmeadm identify command.
ashift=12
)? nvmeadm
command?
nvmeadm
implies it was not intended to be parsed by automated tools, but this seems necessary for "picking a format that has the right LBA data size". (I'm kinda assuming "no" on the stable interface to NVMe devices, but figured I'd ask)What's the recommended mechanism for sled agent to enforce sector size? ...
Is this issue suggesting that zpools should be formatted to use 4KiB sectors atop whatever the disk reports (e.g., using
ashift=12
)?Is this issue suggesting that we should be formatting the NVMe drive using the equivalent of that
nvmeadm
command?
The original purpose of allowing one to specify ashift
at creation time was to provide an upgrade path from 512n or 512e devices to 4kn devices without destroying the pool and restoring from backup. The more general advice I would offer here is that ashift should be set at creation time iff there is a product lifecycle requirement for data to be preserved across future vdev replacement where the replacement vdev may have a larger minimum block size than the vdev in use at creation time. By far the most common case, past and present, is where an existing device has a minimum LB size of 512 and a future replacement may have a minimum LB size of 4k, but other cases are also possible. If this requirement exists, ashift
should be set to reflect the largest minimum LB size of any future replacement device that must be supported.
* If so, is there a stable interface for us to use? The output of `nvmeadm` implies it was not intended to be parsed by automated tools, but this seems necessary for "picking a format that has the right LBA data size". (I'm kinda assuming "no" on the stable interface to NVMe devices, but figured I'd ask)
At present, no. There is active work under way to provide a Committed library interface.
* Is this issue suggesting we should do both?
Practically speaking, yes. Assuming that we are willing to guarantee that all SSDs we will ever support in a particular chassis will support the 4k block size, it is correct when creating pools in that chassis to set ashift=12
. Assuming that all devices in the pool at creation time also support the 4k block size, each device's controller should be configured to use that block size (as via nvmeadm format
or for preference a future Committed library interface) prior to partitioning the device, creating pools, or otherwise storing any data on it.
To the best of my knowledge, that matches the current reality: all current and future devices that will be supported do and will support the 4k block size. In general, we should examine the supported set of block sizes for each device we are about to configure for the first time (or reconfigure with data loss as an acceptable outcome) and ensure that the actual block size we are configuring is (a) supported, (b) reported as "Best" by the controller or else that we consider its use a hard requirement regardless of possible performance degradation, and (c) the same as the other devices, if any, that will be part of the pool. If these constraints can't be satisfied or the block size we want/need is not supported, we need to fail and the device in question identified as faulty or unsupported. In addition, ashift
should be set to the larger of the largest block size among the devices forming the pool or the largest block size of any future replacement device that must be supported. By default, this will be set to the largest actual block size among the devices in the pool, which will in turn have been determined by the NVMe controller's configuration for that namespace. If we have not expressly configured the namespace's LB size, it will be whatever was set by the vendor at the factory, which is often 512 and therefore not what we want. The result will be a pool that looks like this:
ashift=0000000000000009
If such a pool has multiple vdevs, replacing one of the vdevs with a device that does not support, or is not configured to use, 512-byte blocks will not be possible. In practice, if the pool has only a single vdev anyway, replacement always entails total data loss so this may not be a significant consideration; in that case, the configured NVMe namespace block size and zpool ashift
property are relevant primarily for reasons of performance and possibly write amplification/flash lifetime.
New NVMe drives should be formatted to have 4096 bytes logical sector size.