openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.65k stars 1.75k forks source link

autoexpand doesn't expand grown AWS EC2 NVME drives #12505

Open grahamc opened 3 years ago

grahamc commented 3 years ago

System information

Type Version/Name
Distribution Name NixOS
Distribution Version 21.11pre
Kernel Version 5.10.60
Architecture x86_64
OpenZFS Version 2.1.0-1

Describe the problem you're observing

My apologies for this issue, which is the result of trying to report the bug of online EBS volume resizing not causing autoexpand to work. It seems I've potentially uncovered a handful of bugs in the process.

I have an AWS EC2 AMI with two EBS volumes:

  1. /boot, a partitioned disk formatted as FAT for /boot
  2. a zpool which owns the entire second disk, and is used for /
[root@ip-172-31-43-243:~]# lsblk
NAME        MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
nvme0n1     259:0    0    1G  0 disk 
├─nvme0n1p1 259:2    0    1M  0 part 
└─nvme0n1p2 259:3    0  997M  0 part /boot
nvme1n1     259:1    0   10G  0 disk 
├─nvme1n1p1 259:4    0    2G  0 part 
└─nvme1n1p9 259:5    0    8M  0 part 

The AMI is created with a 2G root disk, and then spawned as an EC2 instance with a 9G root:

[root@ip-172-31-43-243:~]# zpool list -v
NAME        SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
tank       1.88G   880M  1.02G        -        7G     7%    45%  1.00x    ONLINE  -
  nvme1n1  1.88G   880M  1.02G        -        7G     7%  45.8%      -    ONLINE

I set autoexpand on the pool as part of creating the AMI:

[root@ip-172-31-43-243:~]# zpool get autoexpand tank
NAME  PROPERTY    VALUE   SOURCE
tank  autoexpand  on      local

I expected the zpool to autoexpand on startup / import, but it doesn't seem to do that.

Having looked in to this before with the zpool being on a partition, I realized it might be because udev and zed typically trigger autoexpand. Hoping this was the case, I live-expanded the EBS volume while the instance was running.

After I initiated the expansion in the AWS console, I saw kernel events in journalctl -f:

Aug 23 19:52:47 ip-172-31-43-243.us-east-2.compute.internal kernel: nvme nvme1: rescanning namespaces.
Aug 23 19:52:47 ip-172-31-43-243.us-east-2.compute.internal kernel: nvme1n1: detected capacity change from 9663676416 to 10737418240

I also monitored udev for events and saw:

[root@ip-172-31-43-243:~]# udevadm monitor -p
monitor will print the received events for:
UDEV - the event which udev sends out after rule processing
KERNEL - the kernel uevent

KERNEL[861.549142] change   /devices/pci0000:00/0000:00:1f.0/nvme/nvme1/nvme1n1 (block)
ACTION=change
DEVPATH=/devices/pci0000:00/0000:00:1f.0/nvme/nvme1/nvme1n1
SUBSYSTEM=block
RESIZE=1
DEVNAME=/dev/nvme1n1
DEVTYPE=disk
SEQNUM=5996
MAJOR=259
MINOR=1

UDEV  [861.931766] change   /devices/pci0000:00/0000:00:1f.0/nvme/nvme1/nvme1n1 (block)
ACTION=change
DEVPATH=/devices/pci0000:00/0000:00:1f.0/nvme/nvme1/nvme1n1
SUBSYSTEM=block
RESIZE=1
DEVNAME=/dev/nvme1n1
DEVTYPE=disk
SEQNUM=5996
USEC_INITIALIZED=1130787
PATH=/nix/store/2gpx9lp9r5qmlc24fnam2sv26xq1cc1w-udev-path/bin:/nix/store/2gpx9lp9r5qmlc24fnam2sv26xq1cc1w-udev-path/sbin
ID_SERIAL_SHORT=vol01deea71c31ed963a
ID_WWN=nvme.1d0f-766f6c3031646565613731633331656439363361-416d617a6f6e20456c617374696320426c6f636b2053746f7265-00000001
ID_MODEL=Amazon Elastic Block Store
ID_REVISION=1.0
ID_SERIAL=Amazon Elastic Block Store_vol01deea71c31ed963a
ID_PATH=pci-0000:00:1f.0-nvme-1
ID_PATH_TAG=pci-0000_00_1f_0-nvme-1
ID_PART_TABLE_UUID=bee5d98c-a046-cc49-9cac-026f8320fffb
ID_PART_TABLE_TYPE=gpt
.ID_FS_TYPE_NEW=
ID_FS_TYPE=
MAJOR=259
MINOR=1
DEVLINKS=/dev/xvdb /dev/disk/by-id/nvme-Amazon_Elastic_Block_Store_vol01deea71c31ed963a-ns-1 /dev/disk/by-id/nvme-Amazon_Elastic_Block_Store_vol01deea71c31ed963a /dev/disk/by-id/nvme-nvme.1d0f-766f6c3031646565613731633331656439363361-416d617a6f6e20456c617374696320426c6f636b2053746f7265-00000001 /dev/disk/by-path/pci-0000:00:1f.0-nvme-1
TAGS=:systemd:
CURRENT_TAGS=:systemd:

However, my pool is still not expanded.

My next inclination was to have a oneshot systemd unit expand the disks on startup, however this doesn't seem to be straight forward either: I think I need to manually enumerate every disk that may be expandable.

I have this snippet that I've used in the past when hosting ZFS pools on partitions:

for pool in $(zpool list -H | awk '{print $1}'); do
    for vdev in $(zpool list -vLPH "$pool" | awk '($1 ~ "^/dev" && $10 = "ONLINE") {print $1;}'); do
        zpool online -e "$pool" "$vdev";
    done
done

but this fails when entire disks are dedicated, because -P seems to erroneously resolve the path to a disk to the first partition:

[root@ip-172-31-43-243:~]# zpool list -vLH
tank    9.88G   880M    9.02G   -   -   1%  8%  1.00x   ONLINE  -
    nvme1n1 9.88G   880M    9.02G   -   -   1%  8.70%   -   ONLINE

[root@ip-172-31-43-243:~]# zpool list -vLHP
tank    9.88G   880M    9.02G   -   -   1%  8%  1.00x   ONLINE  -
    /dev/nvme1n1p1  9.88G   880M    9.02G   -   -   1%  8.70%   -   ONLINE

and zpool online -e chokes on that:

[root@ip-172-31-43-243:~]# zpool online -e tank /dev/nvme1n1

[root@ip-172-31-43-243:~]# zpool online -e tank /dev/nvme1n1p1
cannot expand /dev/nvme1n1p1: cannot relabel '/dev/nvme1n1p1': unable to read disk capacity

and now I'm feeling stuck and unsure about a reliable and generic way to cause autoexpand to happen on these machines.

Describe how to reproduce the problem

  1. Create an AMI with a zpool on dedicated disks with autoexpand=on
  2. Spawn the AMI and check the size of the pool
  3. Grow the disks while online and note the pool is still not grown
grahamc commented 3 years ago

As a bit more detail, I am curious about https://github.com/openzfs/zfs/commit/d441e85dd754ecc15659322b4d36796cbd3838de. It says:

... The ZED, which is monitoring udev events, passes the change event along to zfs_deliver_dle() if the disk or partition contains a zfs_member as identified by blkid.

however, the disk that belongs to ZFS and receives the growth notification doesn't identify as zfs_member, although its partition does:

[root@ip-172-31-43-243:~]# blkid /dev/nvme1n1
/dev/nvme1n1: PTUUID="bee5d98c-a046-cc49-9cac-026f8320fffb" PTTYPE="gpt"

[root@ip-172-31-43-243:~]# blkid /dev/nvme1n1p1
/dev/nvme1n1p1: LABEL="tank" UUID="17154194770612704342" UUID_SUB="11351383998605785512" BLOCK_SIZE="4096" TYPE="zfs_member" PARTLABEL="zfs-7483cdb691eca6d9" PARTUUID="1b48f286-278a-e04d-8a50-46e43221766d"

it isn't clear to me (my own ignorance, my apologies) if that commit handles this case.

grahamc commented 3 years ago

It looks like my best bet to getting a list of devices to pass to online -e is:

$ zpool list -vH "$pool" | awk '($10 = "ONLINE") {print $1;}'
rpool
raidz1
ata-ST12000NM0008-2H3101_ZL001AMF
ata-ST2000DM001-1ER164_W4Z16WVT-part1
ata-ST2000DM001-1ER164_W4Z16WWV-part1
raidz1
ata-WDC_WD100EMAZ-00WJTA0_2YK1581D
ata-WDC_WD100EMAZ-00WJTA0_JEHR59DZ
ata-WDC_WD100EMAZ-00WJTA0_JEHRNYUZ

this is going to pass invalid entries (raidz1 etc.) but at least it is properly passing every whole disk and partition.

grahamc commented 3 years ago

I've found another approach:

[grahamc@kif:~/.zpool.d]$ zpool status -vp -c zzzbogus rpool | awk '($2 == "ONLINE" && $6 == "THIS-IS-A-DEVICE-37108bec-aff6-4b58-9e5e-53c7c9766f05") {print $1;}'
ata-ST12000NM0008-2H3101_ZL001AMF
ata-ST2000DM001-1ER164_W4Z16WVT-part1
ata-ST2000DM001-1ER164_W4Z16WWV-part1
ata-WDC_WD100EMAZ-00WJTA0_2YK1581D
ata-WDC_WD100EMAZ-00WJTA0_JEHR59DZ
ata-WDC_WD100EMAZ-00WJTA0_JEHRNYUZ

[grahamc@kif:~/.zpool.d]$ cat zzzbogus 
#!/bin/sh

echo "THIS-IS-A-DEVICE-37108bec-aff6-4b58-9e5e-53c7c9766f05"
allanjude commented 3 years ago

I think what you want is:

for pool in $(zpool list -H | awk '{print $1}'); do
    for vdev in $(zpool list -H -vg "$pool" | awk '($10 = "ONLINE") {print $1;}'); do
        zpool online -e "$pool" "$vdev";
    done
done

using the vdev GUID will avoid any ambiguity about the device name

grahamc commented 3 years ago

I don't think that works out. For example:

[grahamc@kif:~]$ zpool list -H -vg "rpool"
rpool   32.7T   22.4T   10.3T   -   -   53% 68% 1.00x   ONLINE  -
    13394846054374141118    5.44T   5.06T   386G    -   -   68% 93.1%   -   ONLINE
    3919657051718159816 -   -   -   -   -   -   -   -   ONLINE
    3484453173638860084 -   -   -   -   -   -   -   -   ONLINE
    7241866703652226212 -   -   -   -   -   -   -   -   ONLINE
    2902602422871981357 27.3T   17.4T   9.90T   -   -   51% 63.7%   -   ONLINE
    15011291225760097991    -   -   -   -   -   -   -   -   ONLINE
    3017261739652091422 -   -   -   -   -   -   -   -   ONLINE
    3794566743564545168 -   -   -   -   -   -   -   -   ONLINE

[grahamc@kif:~]$ sudo zpool online -e rpool 2902602422871981357
cannot expand 2902602422871981357: operation not supported on this type of pool

[grahamc@kif:~]$ sudo zpool online -e rpool 15011291225760097991
cannot expand 15011291225760097991: no such device in pool
grahamc commented 3 years ago

on IRC AllanJude suggested ZPOOL_VDEV_NAME_GUID=YES but that doesn't appear to do it either:

[root@kif:~]# ZPOOL_VDEV_NAME_GUID=YES zpool online -e rpool 15011291225760097991
cannot expand 15011291225760097991: no such device in pool

[root@kif:~]# ZPOOL_VDEV_NAME_GUID=YES zpool online -e rpool 2902602422871981357
cannot expand 2902602422871981357: operation not supported on this type of pool

Note that the kernel and ZFS versions have bumped slightly since this ticket opened:

[root@kif:~]# uname -a
Linux kif 5.14.14 #1-NixOS SMP Wed Oct 20 09:57:59 UTC 2021 x86_64 GNU/Linux

[root@kif:~]# zfs version
zfs-2.1.1-1
zfs-kmod-2.1.1-1
stale[bot] commented 2 years ago

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.