Open adamdmoss opened 1 year ago
zb1->zb_level
and zb2->zb_level
are both 0. This is either an empty file or an embedded data, since blocks in regular files will always have a level greater than 0.
There is likely enough information here to fix this after studying the code in more detail.
FWIW I don't think I can repro this again unless my pool corrupts(?) in the same way - I rolled back one dataset (that a scrub was complaining about) to an earlier snap and everything is smooth again! :grimacing: Still, I guess no sort of on-disk corruption should end up with UB inside the kernel. Perhaps very wishful thinking. :)
i got this one with some other kernel trace
[220073.761054] ================================================================================
[20073.761433] UBSAN: shift-out-of-bounds in /home/tom/sources/pve/pve-kernel/proxmox-kernel-6.5.11/modules/pkg-zfs/module/zfs/zio.c:5050:28
[20073.761832] shift exponent -10 is negative
[20073.762141] CPU: 3 PID: 665554 Comm: z_rd_int_1 Tainted: P O 6.5.11-7-pve #1
[20073.762148] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.1-0-g3208b098f51a-prebuilt.qemu.org 04/01/2014
[20073.762151] Call Trace:
[20073.762165] <TASK>
[20073.762190] dump_stack_lvl+0x48/0x70
[20073.762237] dump_stack+0x10/0x20
[20073.762244] __ubsan_handle_shift_out_of_bounds+0x1ac/0x360
[20073.762255] zbookmark_compare.cold+0x20/0x66 [zfs]
[20073.763336] zbookmark_subtree_completed+0x60/0x90 [zfs]
[20073.763593] dsl_scan_check_prefetch_resume+0x82/0xc0 [zfs]
[20073.763919] dsl_scan_prefetch+0x96/0x290 [zfs]
[20073.764298] dsl_scan_prefetch_cb+0x15f/0x350 [zfs]
[20073.764610] arc_read_done+0x2ad/0x4b0 [zfs]
[20073.764900] l2arc_read_done+0x94a/0xaa0 [zfs]
[20073.765180] ? vdev_queue_io_to_issue+0x4a4/0xce0 [zfs]
[20073.765515] zio_done+0x28c/0x10b0 [zfs]
[20073.765826] ? _raw_spin_unlock+0xe/0x40
[20073.765835] ? zio_wait_for_children+0x91/0xd0 [zfs]
[20073.766138] zio_execute+0x8b/0x130 [zfs]
[20073.766433] ? _raw_spin_unlock_irqrestore+0x11/0x60
[20073.766440] taskq_thread+0x282/0x490 [spl]
[20073.766494] ? __pfx_default_wake_function+0x10/0x10
[20073.766505] ? __pfx_zio_execute+0x10/0x10 [zfs]
[20073.766800] ? __pfx_taskq_thread+0x10/0x10 [spl]
[20073.766823] kthread+0xf2/0x120
[20073.766841] ? __pfx_kthread+0x10/0x10
[20073.766847] ret_from_fork+0x47/0x70
[20073.766874] ? __pfx_kthread+0x10/0x10
[20073.766880] ret_from_fork_asm+0x1b/0x30
[20073.766894] </TASK>
[20073.766923] ================================================================================
[641249.027057] INFO: task tokio-runtime-w:857942 blocked for more than 120 seconds.
[641249.027551] Tainted: P O 6.5.11-7-pve #1
[641249.027898] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[641249.028232] task:tokio-runtime-w state:D stack:0 pid:857942 ppid:1 flags:0x00004002
[641249.028241] Call Trace:
[641249.028261] <TASK>
[641249.028294] __schedule+0x3fd/0x1450
[641249.028304] ? __schedule+0x405/0x1450
[641249.028310] ? _raw_spin_unlock_irq+0xe/0x50
[641249.028318] schedule+0x63/0x110
[641249.028323] wb_wait_for_completion+0x89/0xc0
[641249.028342] ? __pfx_autoremove_wake_function+0x10/0x10
[641249.028357] __writeback_inodes_sb_nr+0x9d/0xd0
[641249.028383] writeback_inodes_sb+0x3c/0x60
[641249.028404] sync_filesystem+0x3d/0xb0
[641249.028415] __x64_sys_syncfs+0x49/0xb0
[641249.028421] do_syscall_64+0x5b/0x90
[641249.028428] ? exit_to_user_mode_prepare+0xa5/0x190
[641249.028443] ? syscall_exit_to_user_mode+0x37/0x60
[641249.028450] ? do_syscall_64+0x67/0x90
[641249.028455] ? exc_page_fault+0x94/0x1b0
[641249.028461] entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[641249.028484] RIP: 0033:0x7fe8d5d1db57
[641249.028545] RSP: 002b:00007fe7b548d2a8 EFLAGS: 00000202 ORIG_RAX: 0000000000000132
[641249.028551] RAX: ffffffffffffffda RBX: 00007fe7b548d2f8 RCX: 00007fe8d5d1db57
[641249.028554] RDX: 00007fef568dce4c RSI: 0000000000000007 RDI: 000000000000001c
[641249.028558] RBP: 000000000000001c R08: 0000000000000007 R09: 00007fe8a800cae0
[641249.028560] R10: 6362d3f42356e84f R11: 0000000000000202 R12: 0000000000000001
[641249.028563] R13: 00007fe85c03af90 R14: 000000000000001c R15: 00007fe8a803c120
[641249.028570] </TASK>
[641249.028584] INFO: task tokio-runtime-w:858018 blocked for more than 120 seconds.
[641249.028927] Tainted: P O 6.5.11-7-pve #1
[641249.029260] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[641249.029585] task:tokio-runtime-w state:D stack:0 pid:858018 ppid:1 flags:0x00004002
[641249.029592] Call Trace:
[641249.029594] <TASK>
[641249.029598] __schedule+0x3fd/0x1450
[641249.029604] ? __queue_delayed_work+0x83/0xf0
[641249.029622] ? _raw_spin_unlock_irq+0xe/0x50
[641249.029629] schedule+0x63/0x110
[641249.029634] wb_wait_for_completion+0x89/0xc0
[641249.029639] ? __pfx_autoremove_wake_function+0x10/0x10
[641249.029645] sync_inodes_sb+0xd6/0x2c0
[641249.029652] sync_filesystem+0x70/0xb0
[641249.029662] __x64_sys_syncfs+0x49/0xb0
[641249.029696] do_syscall_64+0x5b/0x90
[641249.029701] ? do_syscall_64+0x67/0x90
[641249.029706] ? syscall_exit_to_user_mode+0x37/0x60
[641249.029712] ? do_syscall_64+0x67/0x90
[641249.029717] ? syscall_exit_to_user_mode+0x37/0x60
[641249.029722] ? do_syscall_64+0x67/0x90
[641249.029727] ? exc_page_fault+0x94/0x1b0
[641249.029733] entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[641249.029739] RIP: 0033:0x7fe8d5d1db57
[641249.029749] RSP: 002b:00007fe7972df2a8 EFLAGS: 00000202 ORIG_RAX: 0000000000000132
[641249.029754] RAX: ffffffffffffffda RBX: 00007fe7972df2f8 RCX: 00007fe8d5d1db57
[641249.029757] RDX: 00007fef867f3d82 RSI: 0000000000000007 RDI: 000000000000003d
[641249.029759] RBP: 000000000000003d R08: 0000000000000007 R09: 00007fe7d00326f0
[641249.029762] R10: 6362d3f42356e84f R11: 0000000000000202 R12: 0000000000000001
[641249.029782] R13: 00007fe8c806c910 R14: 000000000000001c R15: 00007fe7d00fd500
[641249.029789] </TASK>
System information
Describe the problem you're observing
kernel spew when mounting(?) zfs filesystems:
Full spew:
The spew points to this code in zio.c:
... and that final line is line 5009: *zb2L0 = (zb2->zb_blkid) BP_SPANB(ibs2, zb2->zb_level);**
BP_SPANB is
spa.h says:
#define SPA_BLKPTRSHIFT 7
so
level * (indblkshift - 7) == -5
I think those IMPLYs are trying to catch such a situation but I'm not on a debug kernel... :)
I don't see anything clearly related in recent git history so I guess my pool has some interesting corruption, but I'm reporting it just in case.
Describe how to reproduce the problem
Unsure.