Open arcenik opened 1 month ago
It looks like the cause is that kernel thread arc_prune exited #16324
Can you try this patch pls? https://github.com/vpsfreecz/zfs/commit/74964ac17e0e5f461b74a0a847d87007d21c8d75
We have had issues like this recently on some of our systems as well, the call traces are similar. In our case the systems run RHEL8 with the 4.18.0 kernel, and zfs version 2.1.15 The zfs filesystems are accessed via NFSv3, and in all cases one of the users had copied files from a snapshot not long before the issue occurred.
After that, any zfs or zpool commands hang indefinitely, until the system gets a hard reboot.
[Thu Nov 7 14:01:07 2024] task:spl_delay_taskq state:D stack:0 pid:2010 ppid:2 flags:0x80004000
[Thu Nov 7 14:01:07 2024] Call Trace:
[Thu Nov 7 14:01:07 2024] __schedule+0x2d1/0x870
[Thu Nov 7 14:01:07 2024] schedule+0x55/0xf0
[Thu Nov 7 14:01:07 2024] schedule_timeout+0x281/0x320
[Thu Nov 7 14:01:07 2024] ? srso_return_thunk+0x5/0x5f
[Thu Nov 7 14:01:07 2024] ? try_to_wake_up+0x1b4/0x4b0
[Thu Nov 7 14:01:07 2024] ? srso_return_thunk+0x5/0x5f
[Thu Nov 7 14:01:07 2024] ? srso_return_thunk+0x5/0x5f
[Thu Nov 7 14:01:07 2024] wait_for_completion+0x96/0x100
[Thu Nov 7 14:01:07 2024] call_usermodehelper_exec+0x132/0x160
[Thu Nov 7 14:01:07 2024] zfsctl_snapshot_unmount+0x109/0x1f0 [zfs]
[Thu Nov 7 14:01:07 2024] snapentry_expire+0x55/0xe0 [zfs]
[Thu Nov 7 14:01:07 2024] taskq_thread+0x2e1/0x510 [spl]
[Thu Nov 7 14:01:07 2024] ? wake_up_q+0x60/0x60
[Thu Nov 7 14:01:07 2024] ? taskq_thread_spawn+0x50/0x50 [spl]
[Thu Nov 7 14:01:07 2024] kthread+0x134/0x150
[Thu Nov 7 14:01:07 2024] ? set_kthread_struct+0x50/0x50
[Thu Nov 7 14:01:07 2024] ret_from_fork+0x35/0x40
[Thu Nov 7 14:01:07 2024] task:rpc.mountd state:D stack:0 pid:7269 ppid:1 flags:0x00004080
[Thu Nov 7 14:01:07 2024] Call Trace:
[Thu Nov 7 14:01:07 2024] __schedule+0x2d1/0x870
[Thu Nov 7 14:01:07 2024] schedule+0x55/0xf0
[Thu Nov 7 14:01:07 2024] taskq_wait_id+0x8e/0xe0 [spl]
[Thu Nov 7 14:01:07 2024] ? finish_wait+0x80/0x80
[Thu Nov 7 14:01:07 2024] taskq_cancel_id+0xce/0x120 [spl]
[Thu Nov 7 14:01:07 2024] zfsctl_snapshot_unmount_cancel+0x37/0x80 [zfs]
[Thu Nov 7 14:01:07 2024] zfsctl_snapshot_unmount_delay+0x3e/0xb0 [zfs]
[Thu Nov 7 14:01:07 2024] zfs_lookup+0x133/0x400 [zfs]
[Thu Nov 7 14:01:07 2024] zpl_xattr_get_dir+0x5b/0x1a0 [zfs]
[Thu Nov 7 14:01:07 2024] ? nvlist_lookup_common+0x32/0x80 [znvpair]
[Thu Nov 7 14:01:07 2024] ? srso_return_thunk+0x5/0x5f
[Thu Nov 7 14:01:07 2024] ? zpl_xattr_get_sa+0xcc/0x120 [zfs]
[Thu Nov 7 14:01:07 2024] ? srso_return_thunk+0x5/0x5f
[Thu Nov 7 14:01:07 2024] ? srso_return_thunk+0x5/0x5f
[Thu Nov 7 14:01:07 2024] zpl_xattr_get+0xd3/0x1d0 [zfs]
[Thu Nov 7 14:01:07 2024] zpl_xattr_security_get+0x3e/0x60 [zfs]
[Thu Nov 7 14:01:07 2024] __vfs_getxattr+0x54/0x70
[Thu Nov 7 14:01:07 2024] get_vfs_caps_from_disk+0x68/0x190
[Thu Nov 7 14:01:07 2024] ? srso_return_thunk+0x5/0x5f
[Thu Nov 7 14:01:07 2024] ? terminate_walk+0x7e/0xf0
[Thu Nov 7 14:01:07 2024] audit_copy_inode+0x94/0xd0
[Thu Nov 7 14:01:07 2024] filename_lookup.part.58+0x114/0x170
[Thu Nov 7 14:01:07 2024] ? srso_return_thunk+0x5/0x5f
[Thu Nov 7 14:01:07 2024] ? srso_return_thunk+0x5/0x5f
[Thu Nov 7 14:01:07 2024] ? path_get+0x11/0x30
[Thu Nov 7 14:01:07 2024] ? srso_return_thunk+0x5/0x5f
[Thu Nov 7 14:01:07 2024] ? audit_alloc_name+0x132/0x150
[Thu Nov 7 14:01:07 2024] ? srso_return_thunk+0x5/0x5f
[Thu Nov 7 14:01:07 2024] ? __audit_getname+0x2d/0x50
[Thu Nov 7 14:01:07 2024] ? srso_return_thunk+0x5/0x5f
[Thu Nov 7 14:01:07 2024] vfs_statx+0x74/0xe0
[Thu Nov 7 14:01:07 2024] __do_sys_newstat+0x39/0x70
[Thu Nov 7 14:01:07 2024] ? srso_return_thunk+0x5/0x5f
[Thu Nov 7 14:01:07 2024] ? syscall_trace_enter+0x1ff/0x2d0
[Thu Nov 7 14:01:07 2024] ? srso_return_thunk+0x5/0x5f
[Thu Nov 7 14:01:07 2024] ? audit_reset_context.part.16+0x26a/0x2d0
[Thu Nov 7 14:01:07 2024] do_syscall_64+0x5b/0x1a0
[Thu Nov 7 14:01:07 2024] entry_SYSCALL_64_after_hwframe+0x66/0xcb
[Thu Nov 7 14:01:07 2024] RIP: 0033:0x7f67b861fb09
[Thu Nov 7 14:01:07 2024] Code: Unable to access opcode bytes at RIP 0x7f67b861fadf.
[Thu Nov 7 14:01:07 2024] RSP: 002b:00007ffe0789cd28 EFLAGS: 00000246 ORIG_RAX: 0000000000000004
[Thu Nov 7 14:01:07 2024] RAX: ffffffffffffffda RBX: 0000555eea3cf290 RCX: 00007f67b861fb09
[Thu Nov 7 14:01:07 2024] RDX: 00007ffe0789cdd0 RSI: 00007ffe0789cdd0 RDI: 0000555eea3dfcd8
[Thu Nov 7 14:01:07 2024] RBP: 0000555eea3dfcd8 R08: 00007ffe0789c241 R09: 0000000000000000
[Thu Nov 7 14:01:07 2024] R10: 00007f67b8672e80 R11: 0000000000000246 R12: 0000555eea3cf2a8
[Thu Nov 7 14:01:07 2024] R13: 0000555eea3cf290 R14: 0000555eea3e7030 R15: 0000555eea3e11a0
[Thu Nov 7 14:01:07 2024] task:umount state:D stack:0 pid:444721 ppid:191258 flags:0x00004080
[Thu Nov 7 14:01:07 2024] Call Trace:
[Thu Nov 7 14:01:07 2024] __schedule+0x2d1/0x870
[Thu Nov 7 14:01:07 2024] ? srso_return_thunk+0x5/0x5f
[Thu Nov 7 14:01:07 2024] schedule+0x55/0xf0
[Thu Nov 7 14:01:07 2024] schedule_preempt_disabled+0xa/0x10
[Thu Nov 7 14:01:07 2024] rwsem_down_write_slowpath+0x370/0x570
[Thu Nov 7 14:01:07 2024] ? mnt_get_count+0x39/0x50
[Thu Nov 7 14:01:07 2024] ? pin_kill+0x70/0x190
[Thu Nov 7 14:01:07 2024] down_write+0x29/0x50
[Thu Nov 7 14:01:07 2024] zfsctl_destroy+0x53/0xd0 [zfs]
[Thu Nov 7 14:01:07 2024] zfs_preumount+0x2a/0x70 [zfs]
[Thu Nov 7 14:01:07 2024] zpl_kill_sb+0xe/0x20 [zfs]
[Thu Nov 7 14:01:07 2024] deactivate_locked_super+0x34/0x70
[Thu Nov 7 14:01:07 2024] cleanup_mnt+0x3b/0x70
[Thu Nov 7 14:01:07 2024] task_work_run+0x8a/0xb0
[Thu Nov 7 14:01:07 2024] exit_to_usermode_loop+0xef/0x100
[Thu Nov 7 14:01:07 2024] do_syscall_64+0x195/0x1a0
[Thu Nov 7 14:01:07 2024] entry_SYSCALL_64_after_hwframe+0x66/0xcb
[Thu Nov 7 14:01:07 2024] RIP: 0033:0x7fcaec2558fb
[Thu Nov 7 14:01:07 2024] Code: Unable to access opcode bytes at RIP 0x7fcaec2558d1.
[Thu Nov 7 14:01:07 2024] RSP: 002b:00007fff09fd8cf8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
[Thu Nov 7 14:01:07 2024] RAX: 0000000000000000 RBX: 00005632b76f0310 RCX: 00007fcaec2558fb
[Thu Nov 7 14:01:07 2024] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 00005632b76fbac0
[Thu Nov 7 14:01:07 2024] RBP: 0000000000000000 R08: 0000000000000000 R09: 00007fcaec3ac220
[Thu Nov 7 14:01:07 2024] R10: 00005632b76f48b0 R11: 0000000000000246 R12: 00005632b76fbac0
[Thu Nov 7 14:01:07 2024] R13: 00007fcaed0d8184 R14: 00005632b76f45e0 R15: 00000000ffffffff
System information
Describe the problem you're observing
Zfs umount on an old snapshot (triggered by zfs destroy) command get stuck forever, until hard reboot.
Describe how to reproduce the problem
It just happens.
Include any warning/errors/backtraces from the system logs
First kernel thread [spl_delay_taskq] to get stuck:
Another [spl_delay_taskq] kernel thread:
An umount command umount -t zfs -n /data/archives/omega-2013/.zfs/snapshot/2013-11-27:
Another one umount -t zfs -n /data/archives/omega-2013/.zfs/snapshot/2013-11-28: