openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.54k stars 1.74k forks source link

PANIC at dmu_recv.c:2092:receive_object() when receiving to encrypted dataset with 2.2.6 #16623

Open scratchings opened 1 week ago

scratchings commented 1 week ago

System information

Type Version/Name
Distribution Name Red Hat Enterprise Linux
Distribution Version 9.4
Kernel Version 5.14.0-427.35.1.el9_4.x86_64
Architecture x86_64
OpenZFS Version 2.2.6

Describe the problem you're observing

Panic reported for pool receiving snapshots into encrypted pool

Describe how to reproduce the problem

Use Syncoid/Sanoid to receive (and prune) snapshots from remote servers on a cron schedule. Wait some time, crash will occur.

As requested, this is a new ticket reporting a related issue to the closed ticket #11679.

As described in #11679, this WAS using my exclusive lock python script to prevent multiple receives to the same pool, but as I am transferring data out of this pool to a LUKS encrypted pool on a different JBOD, the effected pool is being used as the source of another internal send/receive via syncoid (and several rsync sources which are also transferring data out) (source dataset is not the same as the dataset being received into). These other egress syncoids are continuing to run and the syncoid that most likely caused the panic remains in the process table, but the dataset is not increasing in size:

sh -c ssh  -c aes128-gcm@openssh.com    -S /tmp/syncoid-syncoid@src.host.com-1728381395-9240 syncoid@src.host.com 'sudo zfs send  -I '"'"'src_pool/fileset'"'"'@'"'"'autosnap_2024-10-08_05:59:03_hourly'"'"' '"'"'src_pool/fileset'"'"'@'"'"'autosnap_2024-10-08_08:59:03_hourly'"'"' | mbuffer  -q -s 128k -m 1G' | mbuffer  -q -s 128k -m 1G |  zfs receive -u -o compression=zstd  -s -F 'dst_pool/src_host_com/fileset' 2>&1

Include any warning/errors/backtraces from the system logs

Oct  6 09:12:23 backup01 kernel: inode_doinit_use_xattr: 2 callbacks suppressed
Oct  8 10:56:46 backup01 kernel: VERIFY0(dmu_bonus_hold_by_dnode(dn, FTAG, &db, flags)) failed (0 == 5)
Oct  8 10:56:46 backup01 kernel: PANIC at dmu_recv.c:2093:receive_object()
Oct  8 10:56:46 backup01 kernel: Showing stack for process 594865
Oct  8 10:56:46 backup01 kernel: CPU: 19 PID: 594865 Comm: receive_writer Kdump: loaded Tainted: P           OE     -------  ---  5.14.0-427.35.1.el9_4.x86_64 #1
Oct  8 10:56:46 backup01 kernel: Hardware name: Dell Inc. PowerEdge R6615/09K0D9, BIOS 1.8.3 04/02/2024
Oct  8 10:56:46 backup01 kernel: Call Trace:
Oct  8 10:56:46 backup01 kernel: <TASK>
Oct  8 10:56:46 backup01 kernel: dump_stack_lvl+0x34/0x48
Oct  8 10:56:46 backup01 kernel: spl_panic+0xd1/0xe9 [spl]
Oct  8 10:56:46 backup01 kernel: ? srso_alias_return_thunk+0x5/0xfbef5
Oct  8 10:56:46 backup01 kernel: ? spl_kmem_cache_free+0xff/0x1b0 [spl]
Oct  8 10:56:46 backup01 kernel: ? srso_alias_return_thunk+0x5/0xfbef5
Oct  8 10:56:46 backup01 kernel: ? mutex_lock+0xe/0x30
Oct  8 10:56:46 backup01 kernel: ? srso_alias_return_thunk+0x5/0xfbef5
Oct  8 10:56:46 backup01 kernel: ? aggsum_add+0x173/0x190 [zfs]
Oct  8 10:56:46 backup01 kernel: ? srso_alias_return_thunk+0x5/0xfbef5
Oct  8 10:56:46 backup01 kernel: ? dnode_evict_bonus+0x7d/0xa0 [zfs]
Oct  8 10:56:46 backup01 kernel: ? srso_alias_return_thunk+0x5/0xfbef5
Oct  8 10:56:46 backup01 kernel: ? srso_alias_return_thunk+0x5/0xfbef5
Oct  8 10:56:46 backup01 kernel: ? dnode_rele_and_unlock+0x59/0xf0 [zfs]
Oct  8 10:56:46 backup01 kernel: ? srso_alias_return_thunk+0x5/0xfbef5
Oct  8 10:56:46 backup01 kernel: receive_object+0x915/0x9d0 [zfs]
Oct  8 10:56:46 backup01 kernel: ? __pfx_thread_generic_wrapper+0x10/0x10 [spl]
Oct  8 10:56:46 backup01 kernel: ? srso_alias_return_thunk+0x5/0xfbef5
Oct  8 10:56:46 backup01 kernel: ? receive_freeobjects+0xa8/0x110 [zfs]
Oct  8 10:56:46 backup01 kernel: ? __pfx_thread_generic_wrapper+0x10/0x10 [spl]
Oct  8 10:56:46 backup01 kernel: receive_process_record+0x132/0x250 [zfs]
Oct  8 10:56:46 backup01 kernel: receive_writer_thread+0xbb/0x1c0 [zfs]
Oct  8 10:56:46 backup01 kernel: ? __pfx_receive_writer_thread+0x10/0x10 [zfs]
Oct  8 10:56:46 backup01 kernel: ? __pfx_thread_generic_wrapper+0x10/0x10 [spl]
Oct  8 10:56:46 backup01 kernel: thread_generic_wrapper+0x56/0x70 [spl]
Oct  8 10:56:46 backup01 kernel: kthread+0xdd/0x100
Oct  8 10:56:46 backup01 kernel: ? __pfx_kthread+0x10/0x10
Oct  8 10:56:46 backup01 kernel: ret_from_fork+0x29/0x50
Oct  8 10:56:46 backup01 kernel: </TASK>
Oct  8 10:59:41 backup01 kernel: INFO: task txg_quiesce:59996 blocked for more than 122 seconds.
Oct  8 10:59:41 backup01 kernel:      Tainted: P           OE     -------  ---  5.14.0-427.35.1.el9_4.x86_64 #1
Oct  8 10:59:41 backup01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct  8 10:59:41 backup01 kernel: task:txg_quiesce     state:D stack:0     pid:59996 ppid:2      flags:0x00004000
Oct  8 10:59:41 backup01 kernel: Call Trace:
Oct  8 10:59:41 backup01 kernel: <TASK>
Oct  8 10:59:41 backup01 kernel: __schedule+0x21b/0x550
Oct  8 10:59:41 backup01 kernel: schedule+0x2d/0x70
Oct  8 10:59:41 backup01 kernel: cv_wait_common+0xfa/0x130 [spl]
Oct  8 10:59:41 backup01 kernel: ? __pfx_autoremove_wake_function+0x10/0x10
Oct  8 10:59:41 backup01 kernel: txg_quiesce+0x1bf/0x250 [zfs]
Oct  8 10:59:41 backup01 kernel: txg_quiesce_thread+0xd9/0x130 [zfs]
Oct  8 10:59:41 backup01 kernel: ? __pfx_txg_quiesce_thread+0x10/0x10 [zfs]
Oct  8 10:59:41 backup01 kernel: ? __pfx_thread_generic_wrapper+0x10/0x10 [spl]
Oct  8 10:59:41 backup01 kernel: thread_generic_wrapper+0x56/0x70 [spl]
Oct  8 10:59:41 backup01 kernel: kthread+0xdd/0x100
Oct  8 10:59:41 backup01 kernel: ? __pfx_kthread+0x10/0x10
Oct  8 10:59:41 backup01 kernel: ret_from_fork+0x29/0x50
Oct  8 10:59:41 backup01 kernel: </TASK>
Oct  8 10:59:41 backup01 kernel: INFO: task receive_writer:594865 blocked for more than 122 seconds.
Oct  8 10:59:41 backup01 kernel:      Tainted: P           OE     -------  ---  5.14.0-427.35.1.el9_4.x86_64 #1
Oct  8 10:59:41 backup01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct  8 10:59:41 backup01 kernel: task:receive_writer  state:D stack:0     pid:594865 ppid:2      flags:0x00004000
Oct  8 10:59:41 backup01 kernel: Call Trace:
Oct  8 10:59:41 backup01 kernel: <TASK>
Oct  8 10:59:41 backup01 kernel: __schedule+0x21b/0x550
Oct  8 10:59:41 backup01 kernel: schedule+0x2d/0x70
Oct  8 10:59:41 backup01 kernel: spl_panic+0xe7/0xe9 [spl]
Oct  8 10:59:41 backup01 kernel: ? srso_alias_return_thunk+0x5/0xfbef5
Oct  8 10:59:41 backup01 kernel: ? spl_kmem_cache_free+0xff/0x1b0 [spl]
Oct  8 10:59:41 backup01 kernel: ? srso_alias_return_thunk+0x5/0xfbef5
Oct  8 10:59:41 backup01 kernel: ? mutex_lock+0xe/0x30
Oct  8 10:59:41 backup01 kernel: ? srso_alias_return_thunk+0x5/0xfbef5
Oct  8 10:59:41 backup01 kernel: ? aggsum_add+0x173/0x190 [zfs]
Oct  8 10:59:41 backup01 kernel: ? srso_alias_return_thunk+0x5/0xfbef5
Oct  8 10:59:41 backup01 kernel: ? dnode_evict_bonus+0x7d/0xa0 [zfs]
Oct  8 10:59:41 backup01 kernel: ? srso_alias_return_thunk+0x5/0xfbef5
Oct  8 10:59:41 backup01 kernel: ? srso_alias_return_thunk+0x5/0xfbef5
Oct  8 10:59:41 backup01 kernel: ? dnode_rele_and_unlock+0x59/0xf0 [zfs]
Oct  8 10:59:41 backup01 kernel: ? srso_alias_return_thunk+0x5/0xfbef5
Oct  8 10:59:41 backup01 kernel: receive_object+0x915/0x9d0 [zfs]
Oct  8 10:59:41 backup01 kernel: ? __pfx_thread_generic_wrapper+0x10/0x10 [spl]
Oct  8 10:59:41 backup01 kernel: ? srso_alias_return_thunk+0x5/0xfbef5
Oct  8 10:59:41 backup01 kernel: ? receive_freeobjects+0xa8/0x110 [zfs]
Oct  8 10:59:41 backup01 kernel: ? __pfx_thread_generic_wrapper+0x10/0x10 [spl]
Oct  8 10:59:41 backup01 kernel: receive_process_record+0x132/0x250 [zfs]
Oct  8 10:59:41 backup01 kernel: receive_writer_thread+0xbb/0x1c0 [zfs]
Oct  8 10:59:41 backup01 kernel: ? __pfx_receive_writer_thread+0x10/0x10 [zfs]
Oct  8 10:59:41 backup01 kernel: ? __pfx_thread_generic_wrapper+0x10/0x10 [spl]
Oct  8 10:59:41 backup01 kernel: thread_generic_wrapper+0x56/0x70 [spl]
Oct  8 10:59:41 backup01 kernel: kthread+0xdd/0x100
Oct  8 10:59:41 backup01 kernel: ? __pfx_kthread+0x10/0x10
Oct  8 10:59:41 backup01 kernel: ret_from_fork+0x29/0x50
Oct  8 10:59:41 backup01 kernel: </TASK>
Oct  8 11:01:44 backup01 kernel: INFO: task txg_quiesce:59996 blocked for more than 245 seconds.
Oct  8 11:01:44 backup01 kernel:      Tainted: P           OE     -------  ---  5.14.0-427.35.1.el9_4.x86_64 #1
Oct  8 11:01:44 backup01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct  8 11:01:44 backup01 kernel: task:txg_quiesce     state:D stack:0     pid:59996 ppid:2      flags:0x00004000
Oct  8 11:01:44 backup01 kernel: Call Trace:
Oct  8 11:01:44 backup01 kernel: <TASK>
Oct  8 11:01:44 backup01 kernel: __schedule+0x21b/0x550
Oct  8 11:01:44 backup01 kernel: schedule+0x2d/0x70
Oct  8 11:01:44 backup01 kernel: cv_wait_common+0xfa/0x130 [spl]
Oct  8 11:01:44 backup01 kernel: ? __pfx_autoremove_wake_function+0x10/0x10
Oct  8 11:01:44 backup01 kernel: txg_quiesce+0x1bf/0x250 [zfs]
Oct  8 11:01:44 backup01 kernel: txg_quiesce_thread+0xd9/0x130 [zfs]
Oct  8 11:01:44 backup01 kernel: ? __pfx_txg_quiesce_thread+0x10/0x10 [zfs]
Oct  8 11:01:44 backup01 kernel: ? __pfx_thread_generic_wrapper+0x10/0x10 [spl]
Oct  8 11:01:44 backup01 kernel: thread_generic_wrapper+0x56/0x70 [spl]
Oct  8 11:01:44 backup01 kernel: kthread+0xdd/0x100
Oct  8 11:01:44 backup01 kernel: ? __pfx_kthread+0x10/0x10
Oct  8 11:01:44 backup01 kernel: ret_from_fork+0x29/0x50
Oct  8 11:01:44 backup01 kernel: </TASK>
Oct  8 11:01:44 backup01 kernel: INFO: task receive_writer:594865 blocked for more than 245 seconds.
Oct  8 11:01:44 backup01 kernel:      Tainted: P           OE     -------  ---  5.14.0-427.35.1.el9_4.x86_64 #1
Oct  8 11:01:44 backup01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct  8 11:01:44 backup01 kernel: task:receive_writer  state:D stack:0     pid:594865 ppid:2      flags:0x00004000
Oct  8 11:01:44 backup01 kernel: Call Trace:
Oct  8 11:01:44 backup01 kernel: <TASK>
Oct  8 11:01:44 backup01 kernel: __schedule+0x21b/0x550
Oct  8 11:01:44 backup01 kernel: schedule+0x2d/0x70
Oct  8 11:01:44 backup01 kernel: spl_panic+0xe7/0xe9 [spl]
Oct  8 11:01:44 backup01 kernel: ? srso_alias_return_thunk+0x5/0xfbef5
Oct  8 11:01:44 backup01 kernel: ? spl_kmem_cache_free+0xff/0x1b0 [spl]
Oct  8 11:01:44 backup01 kernel: ? srso_alias_return_thunk+0x5/0xfbef5
Oct  8 11:01:44 backup01 kernel: ? mutex_lock+0xe/0x30
Oct  8 11:01:44 backup01 kernel: ? srso_alias_return_thunk+0x5/0xfbef5
Oct  8 11:01:44 backup01 kernel: ? aggsum_add+0x173/0x190 [zfs]
Oct  8 11:01:44 backup01 kernel: ? srso_alias_return_thunk+0x5/0xfbef5
Oct  8 11:01:44 backup01 kernel: ? dnode_evict_bonus+0x7d/0xa0 [zfs]
Oct  8 11:01:44 backup01 kernel: ? srso_alias_return_thunk+0x5/0xfbef5
Oct  8 11:01:44 backup01 kernel: ? srso_alias_return_thunk+0x5/0xfbef5
Oct  8 11:01:44 backup01 kernel: ? dnode_rele_and_unlock+0x59/0xf0 [zfs]
Oct  8 11:01:44 backup01 kernel: ? srso_alias_return_thunk+0x5/0xfbef5
Oct  8 11:01:44 backup01 kernel: receive_object+0x915/0x9d0 [zfs]
Oct  8 11:01:44 backup01 kernel: ? __pfx_thread_generic_wrapper+0x10/0x10 [spl]
Oct  8 11:01:44 backup01 kernel: ? srso_alias_return_thunk+0x5/0xfbef5
Oct  8 11:01:44 backup01 kernel: ? receive_freeobjects+0xa8/0x110 [zfs]
Oct  8 11:01:44 backup01 kernel: ? __pfx_thread_generic_wrapper+0x10/0x10 [spl]
Oct  8 11:01:44 backup01 kernel: receive_process_record+0x132/0x250 [zfs]
Oct  8 11:01:44 backup01 kernel: receive_writer_thread+0xbb/0x1c0 [zfs]
Oct  8 11:01:44 backup01 kernel: ? __pfx_receive_writer_thread+0x10/0x10 [zfs]
Oct  8 11:01:44 backup01 kernel: ? __pfx_thread_generic_wrapper+0x10/0x10 [spl]
Oct  8 11:01:44 backup01 kernel: thread_generic_wrapper+0x56/0x70 [spl]
Oct  8 11:01:44 backup01 kernel: kthread+0xdd/0x100
Oct  8 11:01:44 backup01 kernel: ? __pfx_kthread+0x10/0x10
Oct  8 11:01:44 backup01 kernel: ret_from_fork+0x29/0x50
Oct  8 11:01:44 backup01 kernel: </TASK>
Oct  8 11:03:47 backup01 kernel: INFO: task txg_quiesce:59996 blocked for more than 368 seconds.
Oct  8 11:03:47 backup01 kernel:      Tainted: P           OE     -------  ---  5.14.0-427.35.1.el9_4.x86_64 #1
Oct  8 11:03:47 backup01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct  8 11:03:47 backup01 kernel: task:txg_quiesce     state:D stack:0     pid:59996 ppid:2      flags:0x00004000
Oct  8 11:03:47 backup01 kernel: Call Trace:
Oct  8 11:03:47 backup01 kernel: <TASK>
Oct  8 11:03:47 backup01 kernel: __schedule+0x21b/0x550
Oct  8 11:03:47 backup01 kernel: schedule+0x2d/0x70
Oct  8 11:03:47 backup01 kernel: cv_wait_common+0xfa/0x130 [spl]
Oct  8 11:03:47 backup01 kernel: ? __pfx_autoremove_wake_function+0x10/0x10
Oct  8 11:03:47 backup01 kernel: txg_quiesce+0x1bf/0x250 [zfs]
Oct  8 11:03:47 backup01 kernel: txg_quiesce_thread+0xd9/0x130 [zfs]
Oct  8 11:03:47 backup01 kernel: ? __pfx_txg_quiesce_thread+0x10/0x10 [zfs]
Oct  8 11:03:47 backup01 kernel: ? __pfx_thread_generic_wrapper+0x10/0x10 [spl]
Oct  8 11:03:47 backup01 kernel: thread_generic_wrapper+0x56/0x70 [spl]
Oct  8 11:03:47 backup01 kernel: kthread+0xdd/0x100
Oct  8 11:03:47 backup01 kernel: ? __pfx_kthread+0x10/0x10
Oct  8 11:03:47 backup01 kernel: ret_from_fork+0x29/0x50
Oct  8 11:03:47 backup01 kernel: </TASK>
Oct  8 11:03:47 backup01 kernel: INFO: task receive_writer:594865 blocked for more than 368 seconds.
Oct  8 11:03:47 backup01 kernel:      Tainted: P           OE     -------  ---  5.14.0-427.35.1.el9_4.x86_64 #1
Oct  8 11:03:47 backup01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct  8 11:03:47 backup01 kernel: task:receive_writer  state:D stack:0     pid:594865 ppid:2      flags:0x00004000
Oct  8 11:03:47 backup01 kernel: Call Trace:
Oct  8 11:03:47 backup01 kernel: <TASK>
Oct  8 11:03:47 backup01 kernel: __schedule+0x21b/0x550
Oct  8 11:03:47 backup01 kernel: schedule+0x2d/0x70
Oct  8 11:03:47 backup01 kernel: spl_panic+0xe7/0xe9 [spl]
Oct  8 11:03:47 backup01 kernel: ? srso_alias_return_thunk+0x5/0xfbef5
Oct  8 11:03:47 backup01 kernel: ? spl_kmem_cache_free+0xff/0x1b0 [spl]
Oct  8 11:03:47 backup01 kernel: ? srso_alias_return_thunk+0x5/0xfbef5
Oct  8 11:03:47 backup01 kernel: ? mutex_lock+0xe/0x30
Oct  8 11:03:47 backup01 kernel: ? srso_alias_return_thunk+0x5/0xfbef5
Oct  8 11:03:47 backup01 kernel: ? aggsum_add+0x173/0x190 [zfs]
Oct  8 11:03:47 backup01 kernel: ? srso_alias_return_thunk+0x5/0xfbef5
Oct  8 11:03:47 backup01 kernel: ? dnode_evict_bonus+0x7d/0xa0 [zfs]
Oct  8 11:03:47 backup01 kernel: ? srso_alias_return_thunk+0x5/0xfbef5
Oct  8 11:03:47 backup01 kernel: ? srso_alias_return_thunk+0x5/0xfbef5
Oct  8 11:03:47 backup01 kernel: ? dnode_rele_and_unlock+0x59/0xf0 [zfs]
Oct  8 11:03:47 backup01 kernel: ? srso_alias_return_thunk+0x5/0xfbef5
Oct  8 11:03:47 backup01 kernel: receive_object+0x915/0x9d0 [zfs]
Oct  8 11:03:47 backup01 kernel: ? __pfx_thread_generic_wrapper+0x10/0x10 [spl]
Oct  8 11:03:47 backup01 kernel: ? srso_alias_return_thunk+0x5/0xfbef5
Oct  8 11:03:47 backup01 kernel: ? receive_freeobjects+0xa8/0x110 [zfs]
Oct  8 11:03:47 backup01 kernel: ? __pfx_thread_generic_wrapper+0x10/0x10 [spl]
Oct  8 11:03:47 backup01 kernel: receive_process_record+0x132/0x250 [zfs]
Oct  8 11:03:47 backup01 kernel: receive_writer_thread+0xbb/0x1c0 [zfs]
Oct  8 11:03:47 backup01 kernel: ? __pfx_receive_writer_thread+0x10/0x10 [zfs]
Oct  8 11:03:47 backup01 kernel: ? __pfx_thread_generic_wrapper+0x10/0x10 [spl]
Oct  8 11:03:47 backup01 kernel: thread_generic_wrapper+0x56/0x70 [spl]
Oct  8 11:03:47 backup01 kernel: kthread+0xdd/0x100
Oct  8 11:03:47 backup01 kernel: ? __pfx_kthread+0x10/0x10
Oct  8 11:03:47 backup01 kernel: ret_from_fork+0x29/0x50
Oct  8 11:03:47 backup01 kernel: </TASK>
Oct  8 11:03:47 backup01 kernel: INFO: task zfs:598834 blocked for more than 122 seconds.
Oct  8 11:03:47 backup01 kernel:      Tainted: P           OE     -------  ---  5.14.0-427.35.1.el9_4.x86_64 #1
Oct  8 11:03:47 backup01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct  8 11:03:47 backup01 kernel: task:zfs             state:D stack:0     pid:598834 ppid:597711 flags:0x00000002
Oct  8 11:03:47 backup01 kernel: Call Trace:
Oct  8 11:03:47 backup01 kernel: <TASK>
Oct  8 11:03:47 backup01 kernel: __schedule+0x21b/0x550
Oct  8 11:03:47 backup01 kernel: schedule+0x2d/0x70
Oct  8 11:03:47 backup01 kernel: io_schedule+0x42/0x70
Oct  8 11:03:47 backup01 kernel: cv_wait_common+0xaa/0x130 [spl]
Oct  8 11:03:47 backup01 kernel: ? __pfx_autoremove_wake_function+0x10/0x10
Oct  8 11:03:47 backup01 kernel: txg_wait_synced_impl+0xcb/0x110 [zfs]
Oct  8 11:03:47 backup01 kernel: txg_wait_synced+0xc/0x40 [zfs]
Oct  8 11:03:47 backup01 kernel: dsl_sync_task_common+0x1b0/0x290 [zfs]
Oct  8 11:03:47 backup01 kernel: ? __pfx_dsl_dataset_snapshot_sync+0x10/0x10 [zfs]
Oct  8 11:03:47 backup01 kernel: ? __pfx_dsl_dataset_snapshot_check+0x10/0x10 [zfs]
Oct  8 11:03:47 backup01 kernel: ? __pfx_dsl_dataset_snapshot_check+0x10/0x10 [zfs]
Oct  8 11:03:47 backup01 kernel: ? __pfx_dsl_dataset_snapshot_sync+0x10/0x10 [zfs]
Oct  8 11:03:47 backup01 kernel: dsl_sync_task+0x16/0x20 [zfs]
Oct  8 11:03:47 backup01 kernel: dsl_dataset_snapshot+0x125/0x350 [zfs]
Oct  8 11:03:47 backup01 kernel: ? spl_kmem_alloc_impl+0xcc/0x110 [spl]
Oct  8 11:03:47 backup01 kernel: ? srso_alias_return_thunk+0x5/0xfbef5
Oct  8 11:03:47 backup01 kernel: ? nvt_remove_nvpair+0xcf/0x130 [zfs]
Oct  8 11:03:47 backup01 kernel: ? srso_alias_return_thunk+0x5/0xfbef5
Oct  8 11:03:47 backup01 kernel: ? __kmem_cache_alloc_node+0x1c7/0x2d0
Oct  8 11:03:47 backup01 kernel: ? spl_kmem_alloc_impl+0xcc/0x110 [spl]
Oct  8 11:03:47 backup01 kernel: ? spl_kmem_alloc_impl+0xcc/0x110 [spl]
Oct  8 11:03:47 backup01 kernel: ? srso_alias_return_thunk+0x5/0xfbef5
Oct  8 11:03:47 backup01 kernel: ? __kmalloc_node+0x4e/0x140
Oct  8 11:03:47 backup01 kernel: ? srso_alias_return_thunk+0x5/0xfbef5
Oct  8 11:03:47 backup01 kernel: ? nvt_lookup_name_type.isra.0+0x66/0xa0 [zfs]
Oct  8 11:03:47 backup01 kernel: zfs_ioc_snapshot+0x21b/0x2a0 [zfs]
Oct  8 11:03:47 backup01 kernel: zfsdev_ioctl_common+0x29d/0x770 [zfs]
Oct  8 11:03:47 backup01 kernel: ? srso_alias_return_thunk+0x5/0xfbef5
Oct  8 11:03:47 backup01 kernel: zfsdev_ioctl+0x53/0xe0 [zfs]
Oct  8 11:03:47 backup01 kernel: __x64_sys_ioctl+0x87/0xc0
Oct  8 11:03:47 backup01 kernel: do_syscall_64+0x59/0x90
Oct  8 11:03:47 backup01 kernel: ? srso_alias_return_thunk+0x5/0xfbef5
Oct  8 11:03:47 backup01 kernel: ? handle_mm_fault+0xcd/0x290
Oct  8 11:03:47 backup01 kernel: ? srso_alias_return_thunk+0x5/0xfbef5
Oct  8 11:03:47 backup01 kernel: ? do_user_addr_fault+0x1d6/0x6a0
Oct  8 11:03:47 backup01 kernel: ? srso_alias_return_thunk+0x5/0xfbef5
Oct  8 11:03:47 backup01 kernel: ? srso_alias_return_thunk+0x5/0xfbef5
Oct  8 11:03:47 backup01 kernel: ? exc_page_fault+0x62/0x150
Oct  8 11:03:47 backup01 kernel: entry_SYSCALL_64_after_hwframe+0x72/0xdc
Oct  8 11:03:47 backup01 kernel: RIP: 0033:0x7f874db0357b
Oct  8 11:03:47 backup01 kernel: RSP: 002b:00007fff8bccdfd8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Oct  8 11:03:47 backup01 kernel: RAX: ffffffffffffffda RBX: 0000000000005a23 RCX: 00007f874db0357b
Oct  8 11:03:47 backup01 kernel: RDX: 00007fff8bcce000 RSI: 0000000000005a23 RDI: 0000000000000004
Oct  8 11:03:47 backup01 kernel: RBP: 00007fff8bcd15f0 R08: 000000000000002b R09: 0000000000000000
Oct  8 11:03:47 backup01 kernel: R10: 0000000008000000 R11: 0000000000000246 R12: 00007fff8bcce000
Oct  8 11:03:47 backup01 kernel: R13: 0000000000005a23 R14: 00007fff8bcd1601 R15: 0000556e9077e540
Oct  8 11:03:47 backup01 kernel: </TASK>
Oct  8 11:05:50 backup01 kernel: INFO: task txg_quiesce:59996 blocked for more than 491 seconds.
Oct  8 11:05:50 backup01 kernel:      Tainted: P           OE     -------  ---  5.14.0-427.35.1.el9_4.x86_64 #1
Oct  8 11:05:50 backup01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct  8 11:05:50 backup01 kernel: task:txg_quiesce     state:D stack:0     pid:59996 ppid:2      flags:0x00004000
Oct  8 11:05:50 backup01 kernel: Call Trace:
Oct  8 11:05:50 backup01 kernel: <TASK>
Oct  8 11:05:50 backup01 kernel: __schedule+0x21b/0x550
Oct  8 11:05:50 backup01 kernel: schedule+0x2d/0x70
Oct  8 11:05:50 backup01 kernel: cv_wait_common+0xfa/0x130 [spl]
Oct  8 11:05:50 backup01 kernel: ? __pfx_autoremove_wake_function+0x10/0x10
Oct  8 11:05:50 backup01 kernel: txg_quiesce+0x1bf/0x250 [zfs]
Oct  8 11:05:50 backup01 kernel: txg_quiesce_thread+0xd9/0x130 [zfs]
Oct  8 11:05:50 backup01 kernel: ? __pfx_txg_quiesce_thread+0x10/0x10 [zfs]
Oct  8 11:05:50 backup01 kernel: ? __pfx_thread_generic_wrapper+0x10/0x10 [spl]
Oct  8 11:05:50 backup01 kernel: thread_generic_wrapper+0x56/0x70 [spl]
Oct  8 11:05:50 backup01 kernel: kthread+0xdd/0x100
Oct  8 11:05:50 backup01 kernel: ? __pfx_kthread+0x10/0x10
Oct  8 11:05:50 backup01 kernel: ret_from_fork+0x29/0x50
Oct  8 11:05:50 backup01 kernel: </TASK>
Oct  8 11:05:50 backup01 kernel: INFO: task receive_writer:594865 blocked for more than 491 seconds.
Oct  8 11:05:50 backup01 kernel:      Tainted: P           OE     -------  ---  5.14.0-427.35.1.el9_4.x86_64 #1
Oct  8 11:05:50 backup01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct  8 11:05:50 backup01 kernel: task:receive_writer  state:D stack:0     pid:594865 ppid:2      flags:0x00004000
Oct  8 11:05:50 backup01 kernel: Call Trace:
Oct  8 11:05:50 backup01 kernel: <TASK>
Oct  8 11:05:50 backup01 kernel: __schedule+0x21b/0x550
Oct  8 11:05:50 backup01 kernel: schedule+0x2d/0x70
Oct  8 11:05:50 backup01 kernel: spl_panic+0xe7/0xe9 [spl]
Oct  8 11:05:50 backup01 kernel: ? srso_alias_return_thunk+0x5/0xfbef5
Oct  8 11:05:50 backup01 kernel: ? spl_kmem_cache_free+0xff/0x1b0 [spl]
Oct  8 11:05:50 backup01 kernel: ? srso_alias_return_thunk+0x5/0xfbef5
Oct  8 11:05:50 backup01 kernel: ? mutex_lock+0xe/0x30
Oct  8 11:05:50 backup01 kernel: ? srso_alias_return_thunk+0x5/0xfbef5
Oct  8 11:05:50 backup01 kernel: ? aggsum_add+0x173/0x190 [zfs]
Oct  8 11:05:50 backup01 kernel: ? srso_alias_return_thunk+0x5/0xfbef5
Oct  8 11:05:50 backup01 kernel: ? dnode_evict_bonus+0x7d/0xa0 [zfs]
Oct  8 11:05:50 backup01 kernel: ? srso_alias_return_thunk+0x5/0xfbef5
Oct  8 11:05:50 backup01 kernel: ? srso_alias_return_thunk+0x5/0xfbef5
Oct  8 11:05:50 backup01 kernel: ? dnode_rele_and_unlock+0x59/0xf0 [zfs]
Oct  8 11:05:50 backup01 kernel: ? srso_alias_return_thunk+0x5/0xfbef5
Oct  8 11:05:50 backup01 kernel: receive_object+0x915/0x9d0 [zfs]
Oct  8 11:05:50 backup01 kernel: ? __pfx_thread_generic_wrapper+0x10/0x10 [spl]
Oct  8 11:05:50 backup01 kernel: ? srso_alias_return_thunk+0x5/0xfbef5
Oct  8 11:05:50 backup01 kernel: ? receive_freeobjects+0xa8/0x110 [zfs]
Oct  8 11:05:50 backup01 kernel: ? __pfx_thread_generic_wrapper+0x10/0x10 [spl]
Oct  8 11:05:50 backup01 kernel: receive_process_record+0x132/0x250 [zfs]
Oct  8 11:05:50 backup01 kernel: receive_writer_thread+0xbb/0x1c0 [zfs]
Oct  8 11:05:50 backup01 kernel: ? __pfx_receive_writer_thread+0x10/0x10 [zfs]
Oct  8 11:05:50 backup01 kernel: ? __pfx_thread_generic_wrapper+0x10/0x10 [spl]
Oct  8 11:05:50 backup01 kernel: thread_generic_wrapper+0x56/0x70 [spl]
Oct  8 11:05:50 backup01 kernel: kthread+0xdd/0x100
Oct  8 11:05:50 backup01 kernel: ? __pfx_kthread+0x10/0x10
Oct  8 11:05:50 backup01 kernel: ret_from_fork+0x29/0x50
Oct  8 11:05:50 backup01 kernel: </TASK>
Oct  8 11:05:50 backup01 kernel: INFO: task zfs:598834 blocked for more than 245 seconds.
Oct  8 11:05:50 backup01 kernel:      Tainted: P           OE     -------  ---  5.14.0-427.35.1.el9_4.x86_64 #1
Oct  8 11:05:50 backup01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct  8 11:05:50 backup01 kernel: task:zfs             state:D stack:0     pid:598834 ppid:597711 flags:0x00000002
Oct  8 11:05:50 backup01 kernel: Call Trace:
Oct  8 11:05:50 backup01 kernel: <TASK>
Oct  8 11:05:50 backup01 kernel: __schedule+0x21b/0x550
Oct  8 11:05:50 backup01 kernel: schedule+0x2d/0x70
Oct  8 11:05:50 backup01 kernel: io_schedule+0x42/0x70
Oct  8 11:05:50 backup01 kernel: cv_wait_common+0xaa/0x130 [spl]
Oct  8 11:05:50 backup01 kernel: ? __pfx_autoremove_wake_function+0x10/0x10
Oct  8 11:05:50 backup01 kernel: txg_wait_synced_impl+0xcb/0x110 [zfs]
Oct  8 11:05:50 backup01 kernel: txg_wait_synced+0xc/0x40 [zfs]
Oct  8 11:05:50 backup01 kernel: dsl_sync_task_common+0x1b0/0x290 [zfs]
Oct  8 11:05:50 backup01 kernel: ? __pfx_dsl_dataset_snapshot_sync+0x10/0x10 [zfs]
Oct  8 11:05:50 backup01 kernel: ? __pfx_dsl_dataset_snapshot_check+0x10/0x10 [zfs]
Oct  8 11:05:50 backup01 kernel: ? __pfx_dsl_dataset_snapshot_check+0x10/0x10 [zfs]
Oct  8 11:05:50 backup01 kernel: ? __pfx_dsl_dataset_snapshot_sync+0x10/0x10 [zfs]
Oct  8 11:05:50 backup01 kernel: dsl_sync_task+0x16/0x20 [zfs]
Oct  8 11:05:50 backup01 kernel: dsl_dataset_snapshot+0x125/0x350 [zfs]
Oct  8 11:05:50 backup01 kernel: ? spl_kmem_alloc_impl+0xcc/0x110 [spl]
Oct  8 11:05:50 backup01 kernel: ? srso_alias_return_thunk+0x5/0xfbef5
Oct  8 11:05:50 backup01 kernel: ? nvt_remove_nvpair+0xcf/0x130 [zfs]
Oct  8 11:05:50 backup01 kernel: ? srso_alias_return_thunk+0x5/0xfbef5
Oct  8 11:05:50 backup01 kernel: ? __kmem_cache_alloc_node+0x1c7/0x2d0
Oct  8 11:05:50 backup01 kernel: ? spl_kmem_alloc_impl+0xcc/0x110 [spl]
Oct  8 11:05:50 backup01 kernel: ? spl_kmem_alloc_impl+0xcc/0x110 [spl]
Oct  8 11:05:50 backup01 kernel: ? srso_alias_return_thunk+0x5/0xfbef5
Oct  8 11:05:50 backup01 kernel: ? __kmalloc_node+0x4e/0x140
Oct  8 11:05:50 backup01 kernel: ? srso_alias_return_thunk+0x5/0xfbef5
Oct  8 11:05:50 backup01 kernel: ? nvt_lookup_name_type.isra.0+0x66/0xa0 [zfs]
Oct  8 11:05:50 backup01 kernel: zfs_ioc_snapshot+0x21b/0x2a0 [zfs]
Oct  8 11:05:50 backup01 kernel: zfsdev_ioctl_common+0x29d/0x770 [zfs]
Oct  8 11:05:50 backup01 kernel: ? srso_alias_return_thunk+0x5/0xfbef5
Oct  8 11:05:50 backup01 kernel: zfsdev_ioctl+0x53/0xe0 [zfs]
Oct  8 11:05:50 backup01 kernel: __x64_sys_ioctl+0x87/0xc0
Oct  8 11:05:50 backup01 kernel: do_syscall_64+0x59/0x90
Oct  8 11:05:50 backup01 kernel: ? srso_alias_return_thunk+0x5/0xfbef5
Oct  8 11:05:50 backup01 kernel: ? handle_mm_fault+0xcd/0x290
Oct  8 11:05:50 backup01 kernel: ? srso_alias_return_thunk+0x5/0xfbef5
Oct  8 11:05:50 backup01 kernel: ? do_user_addr_fault+0x1d6/0x6a0
Oct  8 11:05:50 backup01 kernel: ? srso_alias_return_thunk+0x5/0xfbef5
Oct  8 11:05:50 backup01 kernel: ? srso_alias_return_thunk+0x5/0xfbef5
Oct  8 11:05:50 backup01 kernel: ? exc_page_fault+0x62/0x150
Oct  8 11:05:50 backup01 kernel: entry_SYSCALL_64_after_hwframe+0x72/0xdc
Oct  8 11:05:50 backup01 kernel: RIP: 0033:0x7f874db0357b
Oct  8 11:05:50 backup01 kernel: RSP: 002b:00007fff8bccdfd8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Oct  8 11:05:50 backup01 kernel: RAX: ffffffffffffffda RBX: 0000000000005a23 RCX: 00007f874db0357b
Oct  8 11:05:50 backup01 kernel: RDX: 00007fff8bcce000 RSI: 0000000000005a23 RDI: 0000000000000004
Oct  8 11:05:50 backup01 kernel: RBP: 00007fff8bcd15f0 R08: 000000000000002b R09: 0000000000000000
Oct  8 11:05:50 backup01 kernel: R10: 0000000008000000 R11: 0000000000000246 R12: 00007fff8bcce000
Oct  8 11:05:50 backup01 kernel: R13: 0000000000005a23 R14: 00007fff8bcd1601 R15: 0000556e9077e540
Oct  8 11:05:50 backup01 kernel: </TASK>
amotin commented 1 day ago

@scratchings Have you checked to what replication process 594865 triggering the panic belongs? Is it remote raw/encrypted or local unencrypted one? The error 5 returned by dmu_bonus_hold_by_dnode() is EIO. But bonus buffers reads do not need actual read disk that could fail. And for encrypted receive it should not decrypt, which could produce the EIO error. I have difficulty to understand what may produce the EIO error there.

Is the target pool your local replication receives to also encrypted or plaintext? I wonder if the panic may be happening on that plain-text receive, since one unlike raw receive does decrypt on receive, and it may produce EIO there as part of decryption or authentication.

If those guesses give nothing, I wonder if it is possible to try bpftrace SET_ERROR() calls inside dmu_bonus_hold_by_dnode() with 5 as argument. Stack trace from that, if we could get it could show where the error is coming from.

PS: Just as a random thought: haven't you observed random disconnects and replication restarts just before panics by chance? I suppose for remote system it may happen more often than for local.

scratchings commented 21 hours ago

Sorry, host long since rebooted, but I was convinced at the time that it was a receive on the old pool that this related to, as the send/receive to the new pool was still happily progressing (although even this eventually stopped working a few days later).

The transfers that cause the issue are:

As to the PS, it's not always even a panic, often it can just be that the 'zfs' commands that syncoid issues hang indefinitely, e.g. rollbacks or snapshot enumeration. Worst case, the kernel crashes or the watchdog process reboots the host. This continuation of operation was unusual - don't think I've seen that before, but then I haven't had the second pool for very long. As this usually occurs whilst the machine isn't being actively watched and the syncoid process is running quiet we usually don't even notice for hours or a day or so.

I've still got ca 1 week worth of transfers before I'm completely on the LUKS pool, so if there's anything I can do to capture data then let me know, I'd need guidance on how to use bpftrace.

scratchings commented 7 hours ago

Further update. I forgot to mention that post hang the forced reboot went badly wrong. The boot loader was unable to activate any SAS devices and thus could not load the OS. Reverting to the previous kernel fixed everything, 427.33 and upgrading to the latest kernel similarly was fine. At this point I noticed that the old kernel had lost one of its drives. Are there conditions where a disk write failure would cause an EIO and hang?