ZFS Pool gets stuck in I/O error state after disk resizing

sonicaj commented 4 years ago

System information

Type	Version/Name
Distribution Name	Debian
Distribution Version	bullseye/sid
Linux Kernel	5.2.0-3-amd64
Architecture	x86_64
ZFS Version	0.8.2-3
SPL Version	0.8.2-3

Describe the problem you're observing

Basically my objective is to increase a partition size for disks I have in a pool and make sure ZFS registers them nicely. I followed the following steps to do this:

root@localhost:~# fdisk -l /dev/sde
Disk /dev/sde: 20 GiB, 21474836480 bytes, 41943040 sectors
Disk model: VMware Virtual S
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: BF76D87C-A9D4-4C77-BEFC-6BA405AE9AA0

Device       Start      End  Sectors Size Type
/dev/sde1      128  6291456  6291329   3G Linux swap
/dev/sde2  6291584 23068672 16777089   8G Solaris /usr & Apple ZFS
root@localhost:~#
root@localhost:~# zpool create works /dev/sde2
root@localhost:~# zpool list works
NAME    SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
works  7.50G   114K  7.50G        -         -     0%     0%  1.00x    ONLINE  -
root@localhost:~# zpool status works
  pool: works
 state: ONLINE
  scan: none requested
config:

    NAME        STATE     READ WRITE CKSUM
    works       ONLINE       0     0     0
      sde2      ONLINE       0     0     0

errors: No known data errors
root@localhost:~# sgdisk -d 2 -n 2:0:0 -t 2:BF01 /dev/sde
Warning: The kernel is still using the old partition table.
The new table will be used at the next reboot or after you
run partprobe(8) or kpartx(8)
The operation has completed successfully.
root@localhost:~# partprobe /dev/sde
root@localhost:~# zpool online -e works 11751735547669965113
root@localhost:~# zpool list works
NAME    SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
works  16.5G   138K  16.5G        -         -     0%     0%  1.00x    ONLINE  -
root@localhost:~#

The above steps worked very nicely and ZFS successfully got the updated partition size. Moving on, I tried following steps which basically only differ on how I created the pool

root@localhost:~# fdisk -l /dev/sdb
Disk /dev/sdb: 20 GiB, 21474836480 bytes, 41943040 sectors
Disk model: VMware Virtual S
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 343A3861-9E31-47C3-8ECD-834F6505F1C5

Device       Start      End  Sectors Size Type
/dev/sdb1      128  6291456  6291329   3G Linux swap
/dev/sdb2  6291584 23068672 16777089   8G Solaris /usr & Apple ZFS
root@localhost:~#
root@localhost:~#
root@localhost:~# zpool create fail /dev/disk/by-partuuid/f8370575-5810-42ba-843d-03c906c6728a
root@localhost:~# zpool get autoexpand fail
NAME  PROPERTY    VALUE   SOURCE
fail  autoexpand  off     default
root@localhost:~#
root@localhost:~# sgdisk -d 2 -n 2:0:0 -t 2:BF01 /dev/sdb
Warning: The kernel is still using the old partition table.
The new table will be used at the next reboot or after you
run partprobe(8) or kpartx(8)
The operation has completed successfully.
root@localhost:~# partprobe /dev/sdb
root@localhost:~# zpool online -e fail 12431288832545957551

The last zpool online command gets stuck forever. If I see zpool status on another tab, I see

root@localhost:~# zpool status -v fail
  pool: fail
 state: SUSPENDED
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
   see: http://zfsonlinux.org/msg/ZFS-8000-HC
  scan: none requested
config:

    NAME                                    STATE     READ WRITE CKSUM
    fail                                    UNAVAIL      0     0     0  insufficient replicas
      f8370575-5810-42ba-843d-03c906c6728a  UNAVAIL      0     0     0

errors: List of errors unavailable: pool I/O is currently suspended
root@localhost:~#

Can someone please advise if I am handling this the wrong way ? It works nicely when I specify partition path, but doesn't when I use partition uuid.

Also I see similar results when autoexpand is on

Describe how to reproduce the problem

1) Format the disk with GPT partitions 2) zpool create failed_pool /dev/disk/by-partuuiud/uuid_of_partition 3) sgdisk -d 2 -n 2:0:0 -t 2:BF01 -u 2:existing_uuid_of_partition /dev/sdb ( please edit disk path and partition no to reflect yours ) 4) partprobe /dev/sdb 5) zpool online -e failed_pool GUID_of_device

Looking forward to hearing from you guys, thank you!

sonicaj commented 4 years ago

Hi. Can someone please point a direction if I am perhaps doing this the wrong way ? 🤔

Any help would be much appreciated. Thank you

drescherjm commented 4 years ago

While this could be a bug, the mailing list is for user support questions.

sonicaj commented 4 years ago

@drescherjm this looks like a bug that's why posted it as an issue :)

sonicaj commented 4 years ago

This works nicely on FreeBSD, and it broke for me when making the shift to zol

behlendorf commented 4 years ago

It sounds like this only occur for the by-partuuid paths, and not the other by-* paths. What I'd suggest checking is that the uuid_of_partition remains the same after running sgdisk and partprobe. When doing the zpool expand -e the device will need to be closed and reopened. If the path to the device has changed due to a new uuid it will be considered unavailable.

sonicaj commented 4 years ago

@behlendorf i ensured that the uuid of the partition in question remains same before and after the sgdisk operation as indeed it wouldn't be available to ZFS otherwise. If there's anything else you would like me to let you know, please feel free to ask. Thank you for your response

behlendorf commented 4 years ago

@sonicaj after the failure can you please check the /proc/spl/kstat/zfs/dbgmsg log. It should provide some additional information about why the device could not be reopened.

sonicaj commented 4 years ago

1579278631   spa_history.c:319:spa_history_log_sync(): txg 4 create pool version 5000; software version unknown; uts localhost 5.2.0-3-amd64 #1 SMP Debian 5.2.17-1 (2019-09-26) x86_64
1579278631   spa_history.c:319:spa_history_log_sync(): txg 4 set feature@async_destroy=enabled
1579278631   spa_history.c:319:spa_history_log_sync(): txg 4 set feature@empty_bpobj=enabled
1579278631   spa_history.c:319:spa_history_log_sync(): txg 4 set feature@lz4_compress=enabled
1579278631   spa_history.c:319:spa_history_log_sync(): txg 4 set feature@multi_vdev_crash_dump=enabled
1579278631   spa_history.c:319:spa_history_log_sync(): txg 4 set feature@spacemap_histogram=enabled
1579278631   spa_history.c:319:spa_history_log_sync(): txg 4 set feature@enabled_txg=enabled
1579278631   spa_history.c:319:spa_history_log_sync(): txg 4 set feature@hole_birth=enabled
1579278631   spa_history.c:319:spa_history_log_sync(): txg 4 set feature@extensible_dataset=enabled
1579278631   spa_history.c:319:spa_history_log_sync(): txg 4 set feature@embedded_data=enabled
1579278631   spa_history.c:319:spa_history_log_sync(): txg 4 set feature@bookmarks=enabled
1579278631   spa_history.c:319:spa_history_log_sync(): txg 4 set feature@filesystem_limits=enabled
1579278631   spa_history.c:319:spa_history_log_sync(): txg 4 set feature@large_blocks=enabled
1579278631   spa_history.c:319:spa_history_log_sync(): txg 4 set feature@large_dnode=enabled
1579278631   spa_history.c:319:spa_history_log_sync(): txg 4 set feature@sha512=enabled
1579278631   spa_history.c:319:spa_history_log_sync(): txg 4 set feature@skein=enabled
1579278631   spa_history.c:319:spa_history_log_sync(): txg 4 set feature@edonr=enabled
1579278631   spa_history.c:319:spa_history_log_sync(): txg 4 set feature@userobj_accounting=enabled
1579278631   spa_history.c:319:spa_history_log_sync(): txg 4 set feature@encryption=enabled
1579278631   spa_history.c:319:spa_history_log_sync(): txg 4 set feature@project_quota=enabled
1579278631   spa_history.c:319:spa_history_log_sync(): txg 4 set feature@device_removal=enabled
1579278631   spa_history.c:319:spa_history_log_sync(): txg 4 set feature@obsolete_counts=enabled
1579278631   spa_history.c:319:spa_history_log_sync(): txg 4 set feature@zpool_checkpoint=enabled
1579278631   spa_history.c:319:spa_history_log_sync(): txg 4 set feature@spacemap_v2=enabled
1579278631   spa_history.c:319:spa_history_log_sync(): txg 4 set feature@allocation_classes=enabled
1579278631   spa_history.c:319:spa_history_log_sync(): txg 4 set feature@resilver_defer=enabled
1579278631   spa_history.c:319:spa_history_log_sync(): txg 4 set feature@bookmark_v2=enabled
1579278631   mmp.c:249:mmp_thread_start(): MMP thread started pool 'failed_pool' gethrtime 25110035470858
1579278631   spa_history.c:306:spa_history_log_sync(): command: zpool create -f failed_pool /dev/disk/by-partuuid/6b1673f3-1b0a-42ca-9d31-97bdede50948
1579278687   metaslab.c:2667:metaslab_condense(): condensing: txg 29161, msp[8] ffff9a2bcb0b1800, vdev id 0, spa freenas-boot, smp size 16488, segments 503, forcing condense=FALSE
1579278706   vdev.c:124:vdev_dbgmsg(): disk vdev '/dev/disk/by-partuuid/6b1673f3-1b0a-42ca-9d31-97bdede50948': open error=2 count=100
1579278706   spa.c:7577:spa_async_request(): spa=failed_pool async request task=1
1579278706   zio.c:3479:zio_dva_allocate(): failed_pool: metaslab allocation failure: zio ffff9a2bd8fbea90, size 1024, error 28
1579278706   zio.c:3479:zio_dva_allocate(): failed_pool: metaslab allocation failure: zio ffff9a2bd2b3cd80, size 2048, error 28
1579278706   zio.c:3479:zio_dva_allocate(): failed_pool: metaslab allocation failure: zio ffff9a2bd2b3a6c0, size 1024, error 28
1579278706   zio.c:3479:zio_dva_allocate(): failed_pool: metaslab allocation failure: zio ffff9a2bd2b3c3d0, size 512, error 28

Kindly let me know if you would like more output. This is from the end. Before this, it seemed more old before I tried the reproduction case again.

sonicaj commented 4 years ago

I think sgdisk deletes the partition entry and then recreates it with same partition uuid. Do you think the delay in between causes this ZFS behaviour ?

behlendorf commented 4 years ago

It's definitely possible. The reopen logic expects the symlink to be removed and recreated and will wait for up to a full second for this to happen if needed. Here's the relevant line from the log. After trying to open /dev/disk/by-partuuid/6b1673f3-1b0a-42ca-9d31-97bdede5094 repeatedly for a full second, and always failing with ENOENT, the reopen was allowed to fail.

1579278706   vdev.c:124:vdev_dbgmsg(): disk vdev '/dev/disk/by-partuuid/6b1673f3-1b0a-42ca-9d31-97bdede50948': open error=2 count=100

sonicaj commented 4 years ago

But why doesn't that happen with other by-* paths ? I mean when sgdisk removes the partition entry, it should be removed from say /dev/sdb2 as it no longer exists or maybe linux holds on to that link for a while ? Btw in FreeBSD this works nicely. Do you think this is something zol can incorporate ( handling this case gracefully ) ?

behlendorf commented 4 years ago

But why doesn't that happen with other by-* paths?

That's a good question. ZoL doesn't treat the other by-* paths any differently, so I'm not sure why you would only see this with by-partuuid. It may be there's something different with the udev rules which generate these links which causes the issue. You may be able to get a better idea of what's going on by using udevadm monitor to get better visibility in to how the rules are executed.

If you're comfortable rebuilding ZFS from source you could increase the allowed timeout to see if waiting longer would help. Or if there's some other issue. If you clone the master source from Github you would need to increase the zfs_vdev_open_timeout_ms value in vdev_disk.c.

/*
 * Wait up to zfs_vdev_open_timeout_ms milliseconds before determining the
 * device is missing. The missing path may be transient since the links
 * can be briefly removed and recreated in response to udev events.
 */
static unsigned zfs_vdev_open_timeout_ms = 1000;

The way each platforms handles blocks devices is very different. We've made the code as similar as possible but FreeBSD and Linux each have their own challenges.

sonicaj commented 4 years ago

@behlendorf I'll try this weekend with the said fix and see if I still get into the issue. I am thinking of changing it to 3-4 seconds, the time more or less taken by sgdisk. Will let you know how it goes, thank you

stale[bot] commented 2 years ago

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

sonicaj commented 2 years ago

@behlendorf thank you for looking into it, i was testing this on latest debian bullseye version and i have not been able to reproduce it anymore. Closing it.

openzfs / zfs