`metaslab_force_ganging` + encryption = suspended pool

rincebrain commented 1 year ago

System information

Type	Version/Name
Distribution Name	Debian
Distribution Version	11
Kernel Version	5.10.0-19-amd64
Architecture	amd64
OpenZFS Version	2.1.7/2.1.8 at least

Describe the problem you're observing

While trying to reproduce #14413, I tried setting this in lieu of forcibly fragmenting the pool, and success, found that the pool almost immediately suspends after doing this with an encrypted dataset present.

Go back to 2.1.7 to confirm it's no longer an issue per the report of which commit to blame in there, and...it still suspends.

Uh-oh.

Describe how to reproduce the problem

("somefile" below is a 100G sparse file; works fine with actual disks too)

$ sudo zpool create testme /somefile -o ashift=9 -f
$ sudo zfs create -o encryption=aes-256-gcm testme/badplan3 -o keyformat=raw -o keylocation=file:///some/key
$ echo 1 | sudo tee /sys/module/zfs/parameters/metaslab_force_ganging
$ sudo touch /testme/badplan3/meow
$ sudo touch /testme/badplan3/meow2
$ sudo zpool status -v
  pool: testme
 state: SUSPENDED
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-HC
config:

        NAME              STATE     READ WRITE CKSUM
        testme            ONLINE       0     0     0
          /somefile       ONLINE       0     0     0

errors: List of errors unavailable: pool I/O is currently suspended

And even if you unset the tunable and try zpool clear, it will not, in my experience, come back.

Include any warning/errors/backtraces from the system logs

dmesg now has a few lovely

[   57.426180] WARNING: Pool 'testme' has encountered an uncorrectable I/O failure and has been suspended.

[  243.006324] INFO: task txg_sync:2965 blocked for more than 120 seconds.
[  243.006352]       Tainted: P           OE     5.10.0-19-amd64 #1 Debian 5.10.149-2
[  243.006374] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  243.006396] task:txg_sync        state:D stack:    0 pid: 2965 ppid:     2 flags:0x00004000
[  243.006420] Call Trace:
[  243.006435]  __schedule+0x282/0x880
[  243.006448]  schedule+0x46/0xb0
[  243.006460]  schedule_timeout+0x8b/0x150
[  243.006473]  ? __next_timer_interrupt+0x110/0x110
[  243.006488]  io_schedule_timeout+0x4c/0x80
[  243.006506]  __cv_timedwait_common+0x12f/0x170 [spl]
[  243.006524]  ? add_wait_queue_exclusive+0x70/0x70
[  243.006540]  __cv_timedwait_io+0x15/0x20 [spl]
[  243.006606]  zio_wait+0x129/0x2b0 [zfs]
[  243.006661]  dsl_pool_sync+0x467/0x520 [zfs]
[  243.006719]  spa_sync+0x542/0xfa0 [zfs]
[  243.006775]  ? spa_txg_history_init_io+0x105/0x110 [zfs]
[  243.006830]  txg_sync_thread+0x287/0x410 [zfs]
[  243.006880]  ? txg_fini+0x260/0x260 [zfs]
[  243.006896]  thread_generic_wrapper+0x6c/0x80 [spl]
[  243.006914]  ? __thread_exit+0x20/0x20 [spl]
[  243.006930]  kthread+0x118/0x140
[  243.006942]  ? __kthread_bind_mask+0x60/0x60
[  243.006956]  ret_from_fork+0x1f/0x30
[  363.836294] INFO: task txg_sync:2965 blocked for more than 241 seconds.
[  363.837247]       Tainted: P           OE     5.10.0-19-amd64 #1 Debian 5.10.149-2
[  363.838083] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  363.838830] task:txg_sync        state:D stack:    0 pid: 2965 ppid:     2 flags:0x00004000
[  363.839587] Call Trace:
[  363.840368]  __schedule+0x282/0x880
[  363.841229]  schedule+0x46/0xb0
[  363.842081]  schedule_timeout+0x8b/0x150
[  363.842826]  ? __next_timer_interrupt+0x110/0x110
[  363.843567]  io_schedule_timeout+0x4c/0x80
[  363.844325]  __cv_timedwait_common+0x12f/0x170 [spl]
[  363.845121]  ? add_wait_queue_exclusive+0x70/0x70
[  363.845874]  __cv_timedwait_io+0x15/0x20 [spl]
[  363.846667]  zio_wait+0x129/0x2b0 [zfs]
[  363.847434]  dsl_pool_sync+0x467/0x520 [zfs]
[  363.848229]  spa_sync+0x542/0xfa0 [zfs]
[  363.849035]  ? spa_txg_history_init_io+0x105/0x110 [zfs]
[  363.849769]  txg_sync_thread+0x287/0x410 [zfs]
[  363.850485]  ? txg_fini+0x260/0x260 [zfs]
[  363.851163]  thread_generic_wrapper+0x6c/0x80 [spl]
[  363.851846]  ? __thread_exit+0x20/0x20 [spl]
[  363.852540]  kthread+0x118/0x140
[  363.853255]  ? __kthread_bind_mask+0x60/0x60
[  363.853901]  ret_from_fork+0x1f/0x30

rincebrain commented 1 year ago

So, this happens without encryption too, it just happens much faster there.

It seems like the reason is there's no equivalent of the psize > SPA_MIN_BLOCKSIZE limiter from zio.c in the code triggering from metaslab_force_ganging, so it can simulate failing on things too small to gang, and then the gang allocation doesn't trigger since there is a conditional around that, it just fails.

Simply adding that would make the behavior better, but not quite right, I think, since I think the actual goal would be to not trigger on allocations smaller than the smallest possible on any eligible vdevs (e.g. if we have an ashift 13 vdev and an ashift 12 special, we probably don't want to try ganging below ashift 12 at least, and ideally 13 if we have enough information and know 12 isn't an eligible device?). I don't think it's harmful, just wasted effort, though, in the zio.c case, unlike here, where it can cause allocation failures spuriously. :/

(You could argue this counts as holding it wrong, but I don't think a tunable intended to trigger this codepath should be able to trigger failure to allocate at all in cases where the non-forced code never could...)

In any event, I'll probably cut a PR shortly to see if people object to at least limiting it to > SPA_MINBLOCKSIZE, since the test suite currently only uses the tunable to force allocating on 16k or larger records, so it won't break any usage there...

pharshalp commented 1 year ago

looking at this, would it be better to hold off upgrading to 2.1.8 from 2.1.6?

rincebrain commented 1 year ago

This isn't new, it happens on 2.0.0 too, and is only for a debug tunable. So I wouldn't be afraid of this.

openzfs / zfs