openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.52k stars 1.74k forks source link

After a recent upgrade Loaded module v2.2.99-244_g014265f4e6, and 6.6.3, encountered an uncorrectable I/O failure #15628

Open skinkie opened 10 months ago

skinkie commented 10 months ago

System information

Type Version/Name
Distribution Name Arch Linux
Distribution Version
Kernel Version 6.6.3-arch1-1
Architecture x86_64
OpenZFS Version zfs-kmod-2.2.99-244_g014265f4e6, 2.2.99-244_g014265f4e6

Describe the problem you're observing

Yesterday, I have upgraded the arch linux zfs-dkms-git package. Virtually instantly issues were found regarding pools on USB disks not being available any more. The complete system hung while send/receive in syncoid. After a reboot in the morning the disks on this system were not available any more due to "encountered an uncorrectable I/O failure and has been suspended". The nvme disk did not (yet) have issues. A different system is able to import the disks (6.0.7-gentoo, zfs-2.2.1-r0-gentoo).

Describe how to reproduce the problem

Unknown.

Include any warning/errors/backtraces from the system logs

zpool status -v
  pool: ahn4
 state: SUSPENDED
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-HC
config:

    NAME                                  STATE     READ WRITE CKSUM
    ahn4                                  UNAVAIL      0     0     0  insufficient replicas
      ata-WDC_WD120EDBZ-11B1HA0_5QJ6BBMB  UNAVAIL      0     0     0

errors: List of errors unavailable: pool I/O is currently suspended

  pool: roel
 state: SUSPENDED
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-HC
config:

    NAME                                  STATE     READ WRITE CKSUM
    roel                                  UNAVAIL      0     0     0  insufficient replicas
      ata-WDC_WD140EDGZ-11B2DA2_2CGLS4HN  UNAVAIL      0     0     0

errors: List of errors unavailable: pool I/O is currently suspended
[ 2276.992051] WARNING: Pool 'roel' has encountered an uncorrectable I/O failure and has been suspended.

[ 2304.376399] WARNING: Pool 'ahn4' has encountered an uncorrectable I/O failure and has been suspended.

[ 2334.233429] INFO: task txg_sync:943 blocked for more than 122 seconds.
[ 2334.233435]       Tainted: P           OE      6.6.3-arch1-1 #1
[ 2334.233437] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2334.233438] task:txg_sync        state:D stack:0     pid:943   ppid:2      flags:0x00004000
[ 2334.233441] Call Trace:
[ 2334.233442]  <TASK>
[ 2334.233445]  __schedule+0x3e7/0x1410
[ 2334.233451]  schedule+0x5e/0xd0
[ 2334.233454]  schedule_timeout+0x98/0x160
[ 2334.233456]  ? __pfx_process_timeout+0x10/0x10
[ 2334.233459]  io_schedule_timeout+0x50/0x80
[ 2334.233462]  __cv_timedwait_common+0x12f/0x170 [spl 2d31695334a957bedb64286d3d5af142dc2b81d7]
[ 2334.233474]  ? __pfx_autoremove_wake_function+0x10/0x10
[ 2334.233478]  __cv_timedwait_io+0x19/0x20 [spl 2d31695334a957bedb64286d3d5af142dc2b81d7]
[ 2334.233491]  zio_wait+0x14d/0x2d0 [zfs 7c89bfc0a48673f2c39e94366aa13a249df4f4cf]
[ 2334.233702]  dsl_pool_sync+0x45d/0x520 [zfs 7c89bfc0a48673f2c39e94366aa13a249df4f4cf]
[ 2334.233873]  spa_sync+0x596/0x1070 [zfs 7c89bfc0a48673f2c39e94366aa13a249df4f4cf]
[ 2334.234043]  ? spa_txg_history_init_io+0x117/0x120 [zfs 7c89bfc0a48673f2c39e94366aa13a249df4f4cf]
[ 2334.234202]  txg_sync_thread+0x1fe/0x3a0 [zfs 7c89bfc0a48673f2c39e94366aa13a249df4f4cf]
[ 2334.234363]  ? __pfx_txg_sync_thread+0x10/0x10 [zfs 7c89bfc0a48673f2c39e94366aa13a249df4f4cf]
[ 2334.234577]  ? __pfx_thread_generic_wrapper+0x10/0x10 [spl 2d31695334a957bedb64286d3d5af142dc2b81d7]
[ 2334.234588]  thread_generic_wrapper+0x5b/0x70 [spl 2d31695334a957bedb64286d3d5af142dc2b81d7]
[ 2334.234600]  kthread+0xe5/0x120
[ 2334.234603]  ? __pfx_kthread+0x10/0x10
[ 2334.234605]  ret_from_fork+0x31/0x50
[ 2334.234608]  ? __pfx_kthread+0x10/0x10
[ 2334.234610]  ret_from_fork_asm+0x1b/0x30
[ 2334.234615]  </TASK>
robn commented 10 months ago

Any errors from the kernel (dmesg) at the same time? Maybe a timeout?

Only the USB devices failing, right?

You say you upgraded, do you know the version of ZFS you upgraded from?

skinkie commented 10 months ago

Any errors from the kernel (dmesg) at the same time? Maybe a timeout?

None, only the above.

Only the USB devices failing, right?

Exactly.

You say you upgraded, do you know the version of ZFS you upgraded from?

My hunch would be:

[2023-12-03T18:20:13+0100] [ALPM-SCRIPTLET] ==> WARNING: `dkms install --no-depmod digimend-kernel-drivers/11.r0.gae07a3d -k 6.6.3-arch1-1' exited 10
[2023-12-03T18:20:13+0100] [ALPM-SCRIPTLET] ==> dkms install --no-depmod zfs/2.2.99.r205.g3a8d9b8487 -k 6.6.3-arch1-1
robn commented 10 months ago

So something between 3a8d9b848 and 014265f4e. Nothing jumps out at me as obvious. Are you able to bisect?

skinkie commented 10 months ago

So something between 3a8d9b8 and 014265f. Nothing jumps out at me as obvious. Are you able to bisect?

I first want to be sure there was no related kernel update. Some tips for zfs bisecting is appreciated, just modprobe -r zfs spl and try again?

robn commented 10 months ago

Yeah, something like that. Off the top of my head:

# bisect start
git bisect start 014265f 3a8d9b8
# stop zfs
zpool export -a
modprobe -r zfs spl

# build and install
./autogen.sh && ./configure && make && make install
[or whatever if you've got a package manager or etc doing it]

# start zfs
modprobe zfs
zpool import -a

[do the thing that makes it break or not. if it breaks, reboot probably :( ]

# record results
git bisect <good|bad>

# repeat until it can pin one down

Its potentially a lot of pool crashing though, so usual warnings apply about critical data, backups, etc.