`zpool clear` heroically dies restoring the pool, but doesn't save your previously queued IO

rincebrain commented 6 months ago

System information

Type	Version/Name
Distribution Name	Debian
Distribution Version	11
Kernel Version	5.10.0-28-amd64
Architecture	x86_64
OpenZFS Version	04bae5ec9

Describe the problem you're observing

If I have a USB hard drive that I created a pool on, using the name /dev/disk/by-id/mydrive-part1, pointing to /dev/sda1, unplug it, and replug it, ZFS will say it's SUSPENDED because the disk is gone.

Great. The device got the name sdb since sda was still being held open when the disk reappeared, I zpool clear that device (by guid, since the device's name won't work, which is a separate complaint), and it marks the pool and device ONLINE after one or two iterations of zpool clear pool or zpool clear pool guid, hurray.

...except the clear command never returns, IO from before the hotplug is hung forever, the old device name is still present (e.g. something is holding references to it still), and the stacktrace is in zio_resume.

New writes to the pool appear to succeed, but old IO is stuck forever. zfs_ioc_clear is blocking on zio_resume is blocking on zio_wait , and txg_wait and cp say they're blocking on:

[  605.497469] INFO: task txg_sync:3137 blocked for more than 241 seconds.
[  605.498342]       Tainted: P           OE     5.10.0-28-amd64 #1 Debian 5.10.209-2
[  605.499169] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  605.499995] task:txg_sync        state:D stack:    0 pid: 3137 ppid:     2 flags:0x00004000
[  605.500846] Call Trace:
[  605.501692]  __schedule+0x282/0x870
[  605.502530]  schedule+0x46/0xb0
[  605.503371]  schedule_timeout+0x8b/0x150
[  605.504213]  ? __next_timer_interrupt+0x110/0x110
[  605.505049]  io_schedule_timeout+0x4c/0x80
[  605.505905]  __cv_timedwait_common+0x12f/0x170 [spl]
[  605.506761]  ? add_wait_queue_exclusive+0x70/0x70
[  605.507623]  __cv_timedwait_io+0x15/0x20 [spl]
[  605.508542]  zio_wait+0x13c/0x2d0 [zfs]
[  605.509455]  dsl_pool_sync+0xf5/0x530 [zfs]
[  605.510354]  spa_sync+0x55a/0x1000 [zfs]
[  605.511175]  ? mutex_lock+0xe/0x30
[  605.512041]  ? spa_txg_history_init_io+0x105/0x110 [zfs]
[  605.512917]  txg_sync_thread+0x212/0x3b0 [zfs]
[  605.513789]  ? txg_quiesce_thread+0x330/0x330 [zfs]
[  605.514607]  thread_generic_wrapper+0x6c/0x80 [spl]
[  605.515409]  ? spl_taskq_fini+0x80/0x80 [spl]
[  605.516199]  kthread+0x118/0x140
[  605.516954]  ? __kthread_bind_mask+0x60/0x60
[  605.517718]  ret_from_fork+0x1f/0x30
[  605.518492] INFO: task cp:3254 blocked for more than 241 seconds.
[  605.519266]       Tainted: P           OE     5.10.0-28-amd64 #1 Debian 5.10.209-2
[  605.520061] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  605.520870] task:cp              state:D stack:    0 pid: 3254 ppid:  3253 flags:0x00004004
[  605.521702] Call Trace:
[  605.522523]  __schedule+0x282/0x870
[  605.523339]  ? __schedule+0x28a/0x870
[  605.524151]  schedule+0x46/0xb0
[  605.524958]  io_schedule+0x42/0x70
[  605.525766]  cv_wait_common+0xac/0x130 [spl]
[  605.526561]  ? add_wait_queue_exclusive+0x70/0x70
[  605.527424]  txg_wait_synced_impl+0xcd/0x120 [zfs]
[  605.528289]  txg_wait_synced+0xc/0x40 [zfs]
[  605.529159]  dmu_tx_wait+0x209/0x400 [zfs]
[  605.530022]  dmu_tx_assign+0x16b/0x4e0 [zfs]
[  605.530880]  zfs_write+0x4be/0xd10 [zfs]
[  605.531687]  ? xas_load+0x5/0x80
[  605.532475]  ? current_time+0x2b/0xf0
[  605.533251]  ? touch_atime+0x33/0x140
[  605.534069]  zpl_iter_write+0x103/0x170 [zfs]
[  605.534839]  new_sync_write+0x119/0x1b0
[  605.535590]  vfs_write+0x1ce/0x260
[  605.536337]  ksys_write+0x5f/0xe0
[  605.537065]  do_syscall_64+0x30/0x80
[  605.537783]  entry_SYSCALL_64_after_hwframe+0x62/0xc7

So I assume it's some dance like nothing told the old zios that their reference to the disk they're trying to write to is stale. (zpool reopen doesn't save you.)

Describe how to reproduce the problem

Above.

Include any warning/errors/backtraces from the system logs

Above.

GregorKopka commented 5 months ago

As a dirty hack: what about making zpool clear generally re-queue all in-flight ZIO for the pool?

rincebrain commented 5 months ago

It turns out to be very messy and buggy to do that, when I tried it. (Also, I think some IOs aren't necessarily safe to requeue - you'd basically need to reissue the entire txg's worth of IOs to the device, I think? e.g. if you had more than one queue's worth of IOs to issue, and the device went away while they were in the write cache, whether they actually landed depends on how "away" the device went...and you might no longer have them in queue to issue since nominally it claimed they went out.)

It's also somewhat difficult to test, since Linux's behavior on hot-removing a device in use is not...always consistent, and you're not always at the same point in the IO pipeline when you do it.

GregorKopka commented 5 months ago

ZIOs no longer in the queue don't matter, they can no longer block and we don't have the information to recreate them anyway...

But all still tracked by ZFS could simply be restarted (as they're not on disk yet and/or something is waiting for them) to remove the deadlock and actually persist all data we still have access to... possibly this restart operation could auto-trigger a partial scrub (from the last known-good TXG - or some back to be sure), to automatically check for writes that might have made it out of the ZIO pipeline but got lost on the way to stable storage.

🤔

On 30 May 2024 10:56:43 CEST, Rich Ercolani @.***> wrote:

It turns out to be very messy and buggy to do that, when I tried it.

It's also somewhat difficult to test, since Linux's behavior on hot-removing a device in use is not...always consistent, and you're not always at the same point in the IO pipeline when you do it.

-- Reply to this email directly or view it on GitHub: https://github.com/openzfs/zfs/issues/16175#issuecomment-2139081827 You are receiving this because you commented.

Message ID: @.***>

openzfs / zfs