Open reefland opened 1 year ago
Got about 12 hours from it, before it got wonky and panic again.
First few entries seem to be related to ext4
, the system does have one zvol with ext4
.
Jul 04 00:44:50 k3s06 kernel: INFO: task python3:584178 blocked for more than 120 seconds.
Jul 04 00:44:50 k3s06 kernel: Tainted: P O 5.15.0-76-generic #83-Ubuntu
Jul 04 00:44:50 k3s06 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 04 00:44:50 k3s06 kernel: task:python3 state:D stack: 0 pid:584178 ppid:579848 flags:0x00004000
Jul 04 00:44:50 k3s06 kernel: Call Trace:
Jul 04 00:44:50 k3s06 kernel: <TASK>
Jul 04 00:44:50 k3s06 kernel: __schedule+0x24e/0x590
Jul 04 00:44:50 k3s06 kernel: schedule+0x69/0x110
Jul 04 00:44:50 k3s06 kernel: io_schedule+0x46/0x80
Jul 04 00:44:50 k3s06 kernel: wait_on_page_bit_common+0x10c/0x3d0
Jul 04 00:44:50 k3s06 kernel: ? filemap_invalidate_unlock_two+0x50/0x50
Jul 04 00:44:50 k3s06 kernel: wait_on_page_bit+0x3f/0x50
Jul 04 00:44:50 k3s06 kernel: wait_on_page_writeback+0x26/0x80
Jul 04 00:44:50 k3s06 kernel: __filemap_fdatawait_range+0x97/0x120
Jul 04 00:44:50 k3s06 kernel: ? dentry_lru_isolate+0xcf/0x140
Jul 04 00:44:50 k3s06 kernel: file_write_and_wait_range+0xd3/0xf0
Jul 04 00:44:50 k3s06 kernel: ext4_sync_file+0x7b/0x330
Jul 04 00:44:50 k3s06 kernel: __x64_sys_fdatasync+0x48/0x90
Jul 04 00:44:50 k3s06 kernel: do_syscall_64+0x59/0xc0
Jul 04 00:44:50 k3s06 kernel: ? exit_to_user_mode_prepare+0x37/0xb0
Jul 04 00:44:50 k3s06 kernel: ? syscall_exit_to_user_mode+0x27/0x50
Jul 04 00:44:50 k3s06 kernel: ? __x64_sys_write+0x19/0x20
Jul 04 00:44:50 k3s06 kernel: ? do_syscall_64+0x69/0xc0
Jul 04 00:44:50 k3s06 kernel: ? syscall_exit_to_user_mode+0x27/0x50
Jul 04 00:44:50 k3s06 kernel: ? __x64_sys_lseek+0x18/0x20
Jul 04 00:44:50 k3s06 kernel: ? do_syscall_64+0x69/0xc0
Jul 04 00:44:50 k3s06 kernel: ? do_syscall_64+0x69/0xc0
Jul 04 00:44:50 k3s06 kernel: entry_SYSCALL_64_after_hwframe+0x61/0xcb
Jul 04 00:44:50 k3s06 kernel: RIP: 0033:0x7f03aa537fac
Jul 04 00:44:50 k3s06 kernel: RSP: 002b:00007f039defe4a8 EFLAGS: 00000246 ORIG_RAX: 000000000000004b
Jul 04 00:44:50 k3s06 kernel: RAX: ffffffffffffffda RBX: 000000000000004b RCX: 00007f03aa537fac
Jul 04 00:44:50 k3s06 kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000009
Jul 04 00:44:50 k3s06 kernel: RBP: 00007f039deffb38 R08: 0000000000000000 R09: 0000000000000000
Jul 04 00:44:50 k3s06 kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 00007f02ec89a008
Jul 04 00:44:50 k3s06 kernel: R13: 0000000000000000 R14: 0000000000000408 R15: 0000000000000000
Jul 04 00:44:50 k3s06 kernel: </TASK>
Jul 04 00:44:50 k3s06 kernel: INFO: task node:615723 blocked for more than 120 seconds.
Jul 04 00:44:50 k3s06 kernel: Tainted: P O 5.15.0-76-generic #83-Ubuntu
Jul 04 00:44:50 k3s06 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 04 00:44:50 k3s06 kernel: task:node state:D stack: 0 pid:615723 ppid:615555 flags:0x00000004
Jul 04 00:44:50 k3s06 kernel: Call Trace:
Jul 04 00:44:50 k3s06 kernel: <TASK>
Jul 04 00:44:50 k3s06 kernel: __schedule+0x24e/0x590
Jul 04 00:44:50 k3s06 kernel: schedule+0x69/0x110
Jul 04 00:44:50 k3s06 kernel: io_schedule+0x46/0x80
Jul 04 00:44:50 k3s06 kernel: wait_on_page_bit_common+0x10c/0x3d0
Jul 04 00:44:50 k3s06 kernel: ? filemap_invalidate_unlock_two+0x50/0x50
Jul 04 00:44:50 k3s06 kernel: wait_on_page_bit+0x3f/0x50
Jul 04 00:44:50 k3s06 kernel: wait_on_page_writeback+0x26/0x80
Jul 04 00:44:50 k3s06 kernel: __filemap_fdatawait_range+0x97/0x120
Jul 04 00:44:50 k3s06 kernel: ? __filemap_fdatawait_range+0x3a/0x120
Jul 04 00:44:50 k3s06 kernel: file_write_and_wait_range+0xd3/0xf0
Jul 04 00:44:50 k3s06 kernel: ext4_sync_file+0x7b/0x330
Jul 04 00:44:50 k3s06 kernel: vfs_fsync_range+0x46/0x90
Jul 04 00:44:50 k3s06 kernel: ? __fget_light+0x39/0x90
Jul 04 00:44:50 k3s06 kernel: __x64_sys_fsync+0x38/0x70
Jul 04 00:44:50 k3s06 kernel: do_syscall_64+0x59/0xc0
Jul 04 00:44:50 k3s06 kernel: ? syscall_exit_to_user_mode+0x27/0x50
Jul 04 00:44:50 k3s06 kernel: ? __x64_sys_write+0x19/0x20
Jul 04 00:44:50 k3s06 kernel: ? do_syscall_64+0x69/0xc0
Jul 04 00:44:50 k3s06 kernel: ? exit_to_user_mode_prepare+0x37/0xb0
Jul 04 00:44:50 k3s06 kernel: ? syscall_exit_to_user_mode+0x27/0x50
Jul 04 00:44:50 k3s06 kernel: ? __x64_sys_write+0x19/0x20
Jul 04 00:44:50 k3s06 kernel: ? do_syscall_64+0x69/0xc0
Jul 04 00:44:50 k3s06 kernel: ? __x64_sys_lseek+0x18/0x20
Jul 04 00:44:50 k3s06 kernel: ? do_syscall_64+0x69/0xc0
Jul 04 00:44:50 k3s06 kernel: ? syscall_exit_to_user_mode+0x27/0x50
Jul 04 00:44:50 k3s06 kernel: ? __x64_sys_write+0x19/0x20
Jul 04 00:44:50 k3s06 kernel: ? do_syscall_64+0x69/0xc0
Jul 04 00:44:50 k3s06 kernel: ? do_syscall_64+0x69/0xc0
Jul 04 00:44:50 k3s06 kernel: entry_SYSCALL_64_after_hwframe+0x61/0xcb
Jul 04 00:44:50 k3s06 kernel: RIP: 0033:0x7f8a92ba2a97
Jul 04 00:44:50 k3s06 kernel: RSP: 002b:00007f8a82b19890 EFLAGS: 00000293 ORIG_RAX: 000000000000004a
Jul 04 00:44:50 k3s06 kernel: RAX: ffffffffffffffda RBX: 0000000000000013 RCX: 00007f8a92ba2a97
Jul 04 00:44:50 k3s06 kernel: RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000000013
Jul 04 00:44:50 k3s06 kernel: RBP: 00007f8a82b198e0 R08: 0000000000000000 R09: 0000000000000001
Jul 04 00:44:50 k3s06 kernel: R10: 0000000000000000 R11: 0000000000000293 R12: 00007f8a82b19960
Jul 04 00:44:50 k3s06 kernel: R13: 0000000000000002 R14: 00007f8a6413cb50 R15: 00007f8a64016238
Jul 04 00:44:50 k3s06 kernel: </TASK>
Jul 04 00:44:50 k3s06 kernel: INFO: task WTCheck.tThread:616605 blocked for more than 120 seconds.
Jul 04 00:44:50 k3s06 kernel: Tainted: P O 5.15.0-76-generic #83-Ubuntu
Jul 04 00:44:50 k3s06 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 04 00:44:50 k3s06 kernel: task:WTCheck.tThread state:D stack: 0 pid:616605 ppid:616510 flags:0x00000000
Jul 04 00:44:50 k3s06 kernel: Call Trace:
Jul 04 00:44:50 k3s06 kernel: <TASK>
Jul 04 00:44:50 k3s06 kernel: __schedule+0x24e/0x590
Jul 04 00:44:50 k3s06 kernel: schedule+0x69/0x110
Jul 04 00:44:50 k3s06 kernel: io_schedule+0x46/0x80
Jul 04 00:44:50 k3s06 kernel: wait_on_page_bit_common+0x10c/0x3d0
Jul 04 00:44:50 k3s06 kernel: ? filemap_invalidate_unlock_two+0x50/0x50
Jul 04 00:44:50 k3s06 kernel: wait_on_page_bit+0x3f/0x50
Jul 04 00:44:50 k3s06 kernel: wait_on_page_writeback+0x26/0x80
Jul 04 00:44:50 k3s06 kernel: __filemap_fdatawait_range+0x97/0x120
Jul 04 00:44:50 k3s06 kernel: ? wbc_detach_inode+0x142/0x210
Jul 04 00:44:50 k3s06 kernel: file_write_and_wait_range+0xd3/0xf0
Jul 04 00:44:50 k3s06 kernel: ext4_sync_file+0x7b/0x330
Jul 04 00:44:50 k3s06 kernel: __x64_sys_fdatasync+0x48/0x90
Jul 04 00:44:50 k3s06 kernel: do_syscall_64+0x59/0xc0
Jul 04 00:44:50 k3s06 kernel: ? do_syscall_64+0x69/0xc0
Jul 04 00:44:50 k3s06 kernel: ? do_syscall_64+0x69/0xc0
Jul 04 00:44:50 k3s06 kernel: ? do_syscall_64+0x69/0xc0
Jul 04 00:44:50 k3s06 kernel: ? sysvec_apic_timer_interrupt+0x4e/0x90
Jul 04 00:44:50 k3s06 kernel: entry_SYSCALL_64_after_hwframe+0x61/0xcb
Jul 04 00:44:50 k3s06 kernel: RIP: 0033:0x7fcd4954ffc7
Jul 04 00:44:50 k3s06 kernel: RSP: 002b:00007fcd4357a7b0 EFLAGS: 00000293 ORIG_RAX: 000000000000004b
Jul 04 00:44:50 k3s06 kernel: RAX: ffffffffffffffda RBX: 0000000000000093 RCX: 00007fcd4954ffc7
Jul 04 00:44:50 k3s06 kernel: RDX: 0000000000000000 RSI: 0000000000000093 RDI: 0000000000000093
Jul 04 00:44:50 k3s06 kernel: RBP: 00007fcd4357a810 R08: 0000000000000020 R09: 0000000000000000
Jul 04 00:44:50 k3s06 kernel: R10: 0000000000000017 R11: 0000000000000293 R12: 000000000000000a
Jul 04 00:44:50 k3s06 kernel: R13: 0000000000000001 R14: 00005630ccb039c0 R15: 00005630c5c46269
Jul 04 00:44:50 k3s06 kernel: </TASK>
Jul 04 00:46:51 k3s06 kernel: INFO: task python3:584178 blocked for more than 241 seconds.
Jul 04 00:46:51 k3s06 kernel: Tainted: P O 5.15.0-76-generic #83-Ubuntu
Jul 04 00:46:51 k3s06 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 04 00:46:51 k3s06 kernel: task:python3 state:D stack: 0 pid:584178 ppid:579848 flags:0x00004000
Jul 04 00:46:51 k3s06 kernel: Call Trace:
Jul 04 00:46:51 k3s06 kernel: <TASK>
Jul 04 00:46:51 k3s06 kernel: __schedule+0x24e/0x590
Jul 04 00:46:51 k3s06 kernel: schedule+0x69/0x110
Jul 04 00:46:51 k3s06 kernel: io_schedule+0x46/0x80
Jul 04 00:46:51 k3s06 kernel: wait_on_page_bit_common+0x10c/0x3d0
Jul 04 00:46:51 k3s06 kernel: ? filemap_invalidate_unlock_two+0x50/0x50
Jul 04 00:46:51 k3s06 kernel: wait_on_page_bit+0x3f/0x50
Jul 04 00:46:51 k3s06 kernel: wait_on_page_writeback+0x26/0x80
Jul 04 00:46:51 k3s06 kernel: __filemap_fdatawait_range+0x97/0x120
Jul 04 00:46:51 k3s06 kernel: ? dentry_lru_isolate+0xcf/0x140
Jul 04 00:46:51 k3s06 kernel: file_write_and_wait_range+0xd3/0xf0
Jul 04 00:46:51 k3s06 kernel: ext4_sync_file+0x7b/0x330
Jul 04 00:46:51 k3s06 kernel: __x64_sys_fdatasync+0x48/0x90
Jul 04 00:46:51 k3s06 kernel: do_syscall_64+0x59/0xc0
Jul 04 00:46:51 k3s06 kernel: ? exit_to_user_mode_prepare+0x37/0xb0
Jul 04 00:46:51 k3s06 kernel: ? syscall_exit_to_user_mode+0x27/0x50
Jul 04 00:46:51 k3s06 kernel: ? __x64_sys_write+0x19/0x20
Jul 04 00:46:51 k3s06 kernel: ? do_syscall_64+0x69/0xc0
Jul 04 00:46:51 k3s06 kernel: ? syscall_exit_to_user_mode+0x27/0x50
Jul 04 00:46:51 k3s06 kernel: ? __x64_sys_lseek+0x18/0x20
Jul 04 00:46:51 k3s06 kernel: ? do_syscall_64+0x69/0xc0
Jul 04 00:46:51 k3s06 kernel: ? do_syscall_64+0x69/0xc0
Jul 04 00:46:51 k3s06 kernel: entry_SYSCALL_64_after_hwframe+0x61/0xcb
Jul 04 00:46:51 k3s06 kernel: RIP: 0033:0x7f03aa537fac
Jul 04 00:46:51 k3s06 kernel: RSP: 002b:00007f039defe4a8 EFLAGS: 00000246 ORIG_RAX: 000000000000004b
Jul 04 00:46:51 k3s06 kernel: RAX: ffffffffffffffda RBX: 000000000000004b RCX: 00007f03aa537fac
Jul 04 00:46:51 k3s06 kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000009
Jul 04 00:46:51 k3s06 kernel: RBP: 00007f039deffb38 R08: 0000000000000000 R09: 0000000000000000
Jul 04 00:46:51 k3s06 kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 00007f02ec89a008
Jul 04 00:46:51 k3s06 kernel: R13: 0000000000000000 R14: 0000000000000408 R15: 0000000000000000
Jul 04 00:46:51 k3s06 kernel: </TASK>
Jul 04 00:46:51 k3s06 kernel: INFO: task node:615723 blocked for more than 241 seconds.
Jul 04 00:46:51 k3s06 kernel: Tainted: P O 5.15.0-76-generic #83-Ubuntu
Jul 04 00:46:51 k3s06 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 04 00:46:51 k3s06 kernel: task:node state:D stack: 0 pid:615723 ppid:615555 flags:0x00000004
Jul 04 00:46:51 k3s06 kernel: Call Trace:
Jul 04 00:46:51 k3s06 kernel: <TASK>
Jul 04 00:46:51 k3s06 kernel: __schedule+0x24e/0x590
Jul 04 00:46:51 k3s06 kernel: schedule+0x69/0x110
Jul 04 00:46:51 k3s06 kernel: io_schedule+0x46/0x80
Jul 04 00:46:51 k3s06 kernel: wait_on_page_bit_common+0x10c/0x3d0
Jul 04 00:46:51 k3s06 kernel: ? filemap_invalidate_unlock_two+0x50/0x50
Jul 04 00:46:51 k3s06 kernel: wait_on_page_bit+0x3f/0x50
Jul 04 00:46:51 k3s06 kernel: wait_on_page_writeback+0x26/0x80
Jul 04 00:46:51 k3s06 kernel: __filemap_fdatawait_range+0x97/0x120
Jul 04 00:46:51 k3s06 kernel: ? __filemap_fdatawait_range+0x3a/0x120
Jul 04 00:46:51 k3s06 kernel: file_write_and_wait_range+0xd3/0xf0
Jul 04 00:46:51 k3s06 kernel: ext4_sync_file+0x7b/0x330
Jul 04 00:46:51 k3s06 kernel: vfs_fsync_range+0x46/0x90
Jul 04 00:46:51 k3s06 kernel: ? __fget_light+0x39/0x90
Jul 04 00:46:51 k3s06 kernel: __x64_sys_fsync+0x38/0x70
Jul 04 00:46:51 k3s06 kernel: do_syscall_64+0x59/0xc0
Jul 04 00:46:51 k3s06 kernel: ? syscall_exit_to_user_mode+0x27/0x50
Jul 04 00:46:51 k3s06 kernel: ? __x64_sys_write+0x19/0x20
Jul 04 00:46:51 k3s06 kernel: ? do_syscall_64+0x69/0xc0
Jul 04 00:46:51 k3s06 kernel: ? exit_to_user_mode_prepare+0x37/0xb0
Jul 04 00:46:51 k3s06 kernel: ? syscall_exit_to_user_mode+0x27/0x50
Jul 04 00:46:51 k3s06 kernel: ? __x64_sys_write+0x19/0x20
Jul 04 00:46:51 k3s06 kernel: ? do_syscall_64+0x69/0xc0
Jul 04 00:46:51 k3s06 kernel: ? __x64_sys_lseek+0x18/0x20
Jul 04 00:46:51 k3s06 kernel: ? do_syscall_64+0x69/0xc0
Jul 04 00:46:51 k3s06 kernel: ? syscall_exit_to_user_mode+0x27/0x50
Jul 04 00:46:51 k3s06 kernel: ? __x64_sys_write+0x19/0x20
Jul 04 00:46:51 k3s06 kernel: ? do_syscall_64+0x69/0xc0
Jul 04 00:46:51 k3s06 kernel: ? do_syscall_64+0x69/0xc0
Jul 04 00:46:51 k3s06 kernel: entry_SYSCALL_64_after_hwframe+0x61/0xcb
Jul 04 00:46:51 k3s06 kernel: RIP: 0033:0x7f8a92ba2a97
Jul 04 00:46:51 k3s06 kernel: RSP: 002b:00007f8a82b19890 EFLAGS: 00000293 ORIG_RAX: 000000000000004a
Jul 04 00:46:51 k3s06 kernel: RAX: ffffffffffffffda RBX: 0000000000000013 RCX: 00007f8a92ba2a97
Jul 04 00:46:51 k3s06 kernel: RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000000013
Jul 04 00:46:51 k3s06 kernel: RBP: 00007f8a82b198e0 R08: 0000000000000000 R09: 0000000000000001
Jul 04 00:46:51 k3s06 kernel: R10: 0000000000000000 R11: 0000000000000293 R12: 00007f8a82b19960
Jul 04 00:46:51 k3s06 kernel: R13: 0000000000000002 R14: 00007f8a6413cb50 R15: 00007f8a64016238
Jul 04 00:46:51 k3s06 kernel: </TASK>
Jul 04 00:46:51 k3s06 kernel: INFO: task WTCheck.tThread:616605 blocked for more than 241 seconds.
Jul 04 00:46:51 k3s06 kernel: Tainted: P O 5.15.0-76-generic #83-Ubuntu
Jul 04 00:46:51 k3s06 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 04 00:46:51 k3s06 kernel: task:WTCheck.tThread state:D stack: 0 pid:616605 ppid:616510 flags:0x00000000
Jul 04 00:46:51 k3s06 kernel: Call Trace:
Jul 04 00:46:51 k3s06 kernel: <TASK>
Jul 04 00:46:51 k3s06 kernel: __schedule+0x24e/0x590
Jul 04 00:46:51 k3s06 kernel: schedule+0x69/0x110
Jul 04 00:46:51 k3s06 kernel: io_schedule+0x46/0x80
Jul 04 00:46:51 k3s06 kernel: wait_on_page_bit_common+0x10c/0x3d0
Jul 04 00:46:51 k3s06 kernel: ? filemap_invalidate_unlock_two+0x50/0x50
Jul 04 00:46:51 k3s06 kernel: wait_on_page_bit+0x3f/0x50
Jul 04 00:46:51 k3s06 kernel: wait_on_page_writeback+0x26/0x80
Jul 04 00:46:51 k3s06 kernel: __filemap_fdatawait_range+0x97/0x120
Jul 04 00:46:51 k3s06 kernel: ? wbc_detach_inode+0x142/0x210
Jul 04 00:46:51 k3s06 kernel: file_write_and_wait_range+0xd3/0xf0
Jul 04 00:46:51 k3s06 kernel: ext4_sync_file+0x7b/0x330
Jul 04 00:46:51 k3s06 kernel: __x64_sys_fdatasync+0x48/0x90
Jul 04 00:46:51 k3s06 kernel: do_syscall_64+0x59/0xc0
Jul 04 00:46:51 k3s06 kernel: ? do_syscall_64+0x69/0xc0
Jul 04 00:46:51 k3s06 kernel: ? do_syscall_64+0x69/0xc0
Jul 04 00:46:51 k3s06 kernel: ? do_syscall_64+0x69/0xc0
Jul 04 00:46:51 k3s06 kernel: ? sysvec_apic_timer_interrupt+0x4e/0x90
Jul 04 00:46:51 k3s06 kernel: entry_SYSCALL_64_after_hwframe+0x61/0xcb
Jul 04 00:46:51 k3s06 kernel: RIP: 0033:0x7fcd4954ffc7
Jul 04 00:46:51 k3s06 kernel: RSP: 002b:00007fcd4357a7b0 EFLAGS: 00000293 ORIG_RAX: 000000000000004b
Jul 04 00:46:51 k3s06 kernel: RAX: ffffffffffffffda RBX: 0000000000000093 RCX: 00007fcd4954ffc7
Jul 04 00:46:51 k3s06 kernel: RDX: 0000000000000000 RSI: 0000000000000093 RDI: 0000000000000093
Jul 04 00:46:51 k3s06 kernel: RBP: 00007fcd4357a810 R08: 0000000000000020 R09: 0000000000000000
Jul 04 00:46:51 k3s06 kernel: R10: 0000000000000017 R11: 0000000000000293 R12: 000000000000000a
Jul 04 00:46:51 k3s06 kernel: R13: 0000000000000001 R14: 00005630ccb039c0 R15: 00005630c5c46269
Jul 04 00:46:51 k3s06 kernel: </TASK>
Jul 04 00:48:52 k3s06 kernel: INFO: task jbd2/rbd7-8:571525 blocked for more than 120 seconds.
Jul 04 00:48:52 k3s06 kernel: Tainted: P O 5.15.0-76-generic #83-Ubuntu
Jul 04 00:48:52 k3s06 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 04 00:48:52 k3s06 kernel: task:jbd2/rbd7-8 state:D stack: 0 pid:571525 ppid: 2 flags:0x00004000
Jul 04 00:48:52 k3s06 kernel: Call Trace:
Jul 04 00:48:52 k3s06 kernel: <TASK>
Jul 04 00:48:52 k3s06 kernel: __schedule+0x24e/0x590
Jul 04 00:48:52 k3s06 kernel: schedule+0x69/0x110
Jul 04 00:48:52 k3s06 kernel: jbd2_journal_commit_transaction+0x307/0x17a0
Jul 04 00:48:52 k3s06 kernel: ? __update_idle_core+0x93/0x120
Jul 04 00:48:52 k3s06 kernel: ? psi_task_switch+0xc6/0x220
Jul 04 00:48:52 k3s06 kernel: ? wait_woken+0x70/0x70
Jul 04 00:48:52 k3s06 kernel: kjournald2+0xa9/0x280
Jul 04 00:48:52 k3s06 kernel: ? wait_woken+0x70/0x70
Jul 04 00:48:52 k3s06 kernel: ? load_superblock.part.0+0xc0/0xc0
Jul 04 00:48:52 k3s06 kernel: kthread+0x127/0x150
Jul 04 00:48:52 k3s06 kernel: ? set_kthread_struct+0x50/0x50
Jul 04 00:48:52 k3s06 kernel: ret_from_fork+0x1f/0x30
Jul 04 00:48:52 k3s06 kernel: </TASK>
Jul 04 00:48:52 k3s06 kernel: INFO: task python3:584178 blocked for more than 362 seconds.
Jul 04 00:48:52 k3s06 kernel: Tainted: P O 5.15.0-76-generic #83-Ubuntu
Jul 04 00:48:52 k3s06 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 04 00:48:52 k3s06 kernel: task:python3 state:D stack: 0 pid:584178 ppid:579848 flags:0x00004000
Jul 04 00:48:52 k3s06 kernel: Call Trace:
Jul 04 00:48:52 k3s06 kernel: <TASK>
Jul 04 00:48:52 k3s06 kernel: __schedule+0x24e/0x590
Jul 04 00:48:52 k3s06 kernel: schedule+0x69/0x110
Jul 04 00:48:52 k3s06 kernel: io_schedule+0x46/0x80
Jul 04 00:48:52 k3s06 kernel: wait_on_page_bit_common+0x10c/0x3d0
Jul 04 00:48:52 k3s06 kernel: ? filemap_invalidate_unlock_two+0x50/0x50
Jul 04 00:48:52 k3s06 kernel: wait_on_page_bit+0x3f/0x50
Jul 04 00:48:52 k3s06 kernel: wait_on_page_writeback+0x26/0x80
Jul 04 00:48:52 k3s06 kernel: __filemap_fdatawait_range+0x97/0x120
Jul 04 00:48:52 k3s06 kernel: ? dentry_lru_isolate+0xcf/0x140
Jul 04 00:48:52 k3s06 kernel: file_write_and_wait_range+0xd3/0xf0
Jul 04 00:48:52 k3s06 kernel: ext4_sync_file+0x7b/0x330
Jul 04 00:48:52 k3s06 kernel: __x64_sys_fdatasync+0x48/0x90
Jul 04 00:48:52 k3s06 kernel: do_syscall_64+0x59/0xc0
Jul 04 00:48:52 k3s06 kernel: ? exit_to_user_mode_prepare+0x37/0xb0
Jul 04 00:48:52 k3s06 kernel: ? syscall_exit_to_user_mode+0x27/0x50
Jul 04 00:48:52 k3s06 kernel: ? __x64_sys_write+0x19/0x20
Jul 04 00:48:52 k3s06 kernel: ? do_syscall_64+0x69/0xc0
Jul 04 00:48:52 k3s06 kernel: ? syscall_exit_to_user_mode+0x27/0x50
Jul 04 00:48:52 k3s06 kernel: ? __x64_sys_lseek+0x18/0x20
Jul 04 00:48:52 k3s06 kernel: ? do_syscall_64+0x69/0xc0
Jul 04 00:48:52 k3s06 kernel: ? do_syscall_64+0x69/0xc0
Jul 04 00:48:52 k3s06 kernel: entry_SYSCALL_64_after_hwframe+0x61/0xcb
Jul 04 00:48:52 k3s06 kernel: RIP: 0033:0x7f03aa537fac
Jul 04 00:48:52 k3s06 kernel: RSP: 002b:00007f039defe4a8 EFLAGS: 00000246 ORIG_RAX: 000000000000004b
Jul 04 00:48:52 k3s06 kernel: RAX: ffffffffffffffda RBX: 000000000000004b RCX: 00007f03aa537fac
Jul 04 00:48:52 k3s06 kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000009
Jul 04 00:48:52 k3s06 kernel: RBP: 00007f039deffb38 R08: 0000000000000000 R09: 0000000000000000
Jul 04 00:48:52 k3s06 kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 00007f02ec89a008
Jul 04 00:48:52 k3s06 kernel: R13: 0000000000000000 R14: 0000000000000408 R15: 0000000000000000
Jul 04 00:48:52 k3s06 kernel: </TASK>
Jul 04 00:48:52 k3s06 kernel: INFO: task node:615723 blocked for more than 362 seconds.
Jul 04 00:48:52 k3s06 kernel: Tainted: P O 5.15.0-76-generic #83-Ubuntu
Jul 04 00:48:52 k3s06 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 04 00:48:52 k3s06 kernel: task:node state:D stack: 0 pid:615723 ppid:615555 flags:0x00000004
Jul 04 00:48:52 k3s06 kernel: Call Trace:
Jul 04 00:48:52 k3s06 kernel: <TASK>
Jul 04 00:48:52 k3s06 kernel: __schedule+0x24e/0x590
Jul 04 00:48:52 k3s06 kernel: schedule+0x69/0x110
Jul 04 00:48:52 k3s06 kernel: io_schedule+0x46/0x80
Jul 04 00:48:52 k3s06 kernel: wait_on_page_bit_common+0x10c/0x3d0
Jul 04 00:48:52 k3s06 kernel: ? filemap_invalidate_unlock_two+0x50/0x50
Jul 04 00:48:52 k3s06 kernel: wait_on_page_bit+0x3f/0x50
Jul 04 00:48:52 k3s06 kernel: wait_on_page_writeback+0x26/0x80
Jul 04 00:48:52 k3s06 kernel: __filemap_fdatawait_range+0x97/0x120
Jul 04 00:48:52 k3s06 kernel: ? __filemap_fdatawait_range+0x3a/0x120
Jul 04 00:48:52 k3s06 kernel: file_write_and_wait_range+0xd3/0xf0
Jul 04 00:48:52 k3s06 kernel: ext4_sync_file+0x7b/0x330
Jul 04 00:48:52 k3s06 kernel: vfs_fsync_range+0x46/0x90
Jul 04 00:48:52 k3s06 kernel: ? __fget_light+0x39/0x90
Jul 04 00:48:52 k3s06 kernel: __x64_sys_fsync+0x38/0x70
Jul 04 00:48:52 k3s06 kernel: do_syscall_64+0x59/0xc0
Jul 04 00:48:52 k3s06 kernel: ? syscall_exit_to_user_mode+0x27/0x50
Jul 04 00:48:52 k3s06 kernel: ? __x64_sys_write+0x19/0x20
Jul 04 00:48:52 k3s06 kernel: ? do_syscall_64+0x69/0xc0
Jul 04 00:48:52 k3s06 kernel: ? exit_to_user_mode_prepare+0x37/0xb0
Jul 04 00:48:52 k3s06 kernel: ? syscall_exit_to_user_mode+0x27/0x50
Jul 04 00:48:52 k3s06 kernel: ? __x64_sys_write+0x19/0x20
Jul 04 00:48:52 k3s06 kernel: ? do_syscall_64+0x69/0xc0
Jul 04 00:48:52 k3s06 kernel: ? __x64_sys_lseek+0x18/0x20
Jul 04 00:48:52 k3s06 kernel: ? do_syscall_64+0x69/0xc0
Jul 04 00:48:52 k3s06 kernel: ? syscall_exit_to_user_mode+0x27/0x50
Jul 04 00:48:52 k3s06 kernel: ? __x64_sys_write+0x19/0x20
Jul 04 00:48:52 k3s06 kernel: ? do_syscall_64+0x69/0xc0
Jul 04 00:48:52 k3s06 kernel: ? do_syscall_64+0x69/0xc0
Jul 04 00:48:52 k3s06 kernel: entry_SYSCALL_64_after_hwframe+0x61/0xcb
Jul 04 00:48:52 k3s06 kernel: RIP: 0033:0x7f8a92ba2a97
Jul 04 00:48:52 k3s06 kernel: RSP: 002b:00007f8a82b19890 EFLAGS: 00000293 ORIG_RAX: 000000000000004a
Jul 04 00:48:52 k3s06 kernel: RAX: ffffffffffffffda RBX: 0000000000000013 RCX: 00007f8a92ba2a97
Jul 04 00:48:52 k3s06 kernel: RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000000013
Jul 04 00:48:52 k3s06 kernel: RBP: 00007f8a82b198e0 R08: 0000000000000000 R09: 0000000000000001
Jul 04 00:48:52 k3s06 kernel: R10: 0000000000000000 R11: 0000000000000293 R12: 00007f8a82b19960
Jul 04 00:48:52 k3s06 kernel: R13: 0000000000000002 R14: 00007f8a6413cb50 R15: 00007f8a64016238
Jul 04 00:48:52 k3s06 kernel: </TASK>
Jul 04 00:48:52 k3s06 kernel: INFO: task WTJourn.Flusher:616604 blocked for more than 120 seconds.
Jul 04 00:48:52 k3s06 kernel: Tainted: P O 5.15.0-76-generic #83-Ubuntu
Jul 04 00:48:52 k3s06 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 04 00:48:52 k3s06 kernel: task:WTJourn.Flusher state:D stack: 0 pid:616604 ppid:616510 flags:0x00000000
Jul 04 00:48:52 k3s06 kernel: Call Trace:
Jul 04 00:48:52 k3s06 kernel: <TASK>
Jul 04 00:48:52 k3s06 kernel: __schedule+0x24e/0x590
Jul 04 00:48:52 k3s06 kernel: ? update_sd_lb_stats.constprop.0+0xf2/0x390
Jul 04 00:48:52 k3s06 kernel: schedule+0x69/0x110
Jul 04 00:48:52 k3s06 kernel: wait_transaction_locked+0x8e/0xd0
Jul 04 00:48:52 k3s06 kernel: ? wait_woken+0x70/0x70
Jul 04 00:48:52 k3s06 kernel: add_transaction_credits+0xdb/0x2d0
Jul 04 00:48:52 k3s06 kernel: start_this_handle+0xfd/0x510
Jul 04 00:48:52 k3s06 kernel: ? kmem_cache_alloc+0x1ab/0x2f0
Jul 04 00:48:52 k3s06 kernel: ? jbd2__journal_start+0x95/0x1f0
Jul 04 00:48:52 k3s06 kernel: jbd2__journal_start+0x103/0x1f0
Jul 04 00:48:52 k3s06 kernel: __ext4_journal_start_sb+0x10b/0x130
Jul 04 00:48:52 k3s06 kernel: ext4_dirty_inode+0x3a/0x80
Jul 04 00:48:52 k3s06 kernel: __mark_inode_dirty+0x5b/0x330
Jul 04 00:48:52 k3s06 kernel: generic_update_time+0x6c/0xd0
Jul 04 00:48:52 k3s06 kernel: file_update_time+0x127/0x140
Jul 04 00:48:52 k3s06 kernel: file_modified+0x27/0x40
Jul 04 00:48:52 k3s06 kernel: ext4_buffered_write_iter+0x5f/0x180
Jul 04 00:48:52 k3s06 kernel: ext4_file_write_iter+0x43/0x60
Jul 04 00:48:52 k3s06 kernel: new_sync_write+0x111/0x1a0
Jul 04 00:48:52 k3s06 kernel: vfs_write+0x1d5/0x270
Jul 04 00:48:52 k3s06 kernel: __x64_sys_pwrite64+0x96/0xc0
Jul 04 00:48:52 k3s06 kernel: do_syscall_64+0x59/0xc0
Jul 04 00:48:52 k3s06 kernel: ? restore_fpregs_from_fpstate+0x4e/0xc0
Jul 04 00:48:52 k3s06 kernel: ? switch_fpu_return+0x4e/0xc0
Jul 04 00:48:52 k3s06 kernel: ? exit_to_user_mode_prepare+0x96/0xb0
Jul 04 00:48:52 k3s06 kernel: ? syscall_exit_to_user_mode+0x27/0x50
Jul 04 00:48:52 k3s06 kernel: ? do_syscall_64+0x69/0xc0
Jul 04 00:48:52 k3s06 kernel: ? syscall_exit_to_user_mode+0x27/0x50
Jul 04 00:48:52 k3s06 kernel: ? do_syscall_64+0x69/0xc0
Jul 04 00:48:52 k3s06 kernel: ? exit_to_user_mode_prepare+0x96/0xb0
Jul 04 00:48:52 k3s06 kernel: ? syscall_exit_to_user_mode+0x27/0x50
Jul 04 00:48:52 k3s06 kernel: ? do_syscall_64+0x69/0xc0
Jul 04 00:48:52 k3s06 kernel: entry_SYSCALL_64_after_hwframe+0x61/0xcb
Jul 04 00:48:52 k3s06 kernel: RIP: 0033:0x7fcd4983c13f
Jul 04 00:48:52 k3s06 kernel: RSP: 002b:00007fcd43d7b570 EFLAGS: 00000293 ORIG_RAX: 0000000000000012
Jul 04 00:48:52 k3s06 kernel: RAX: ffffffffffffffda RBX: 0000000000000010 RCX: 00007fcd4983c13f
Jul 04 00:48:52 k3s06 kernel: RDX: 0000000000000500 RSI: 00005630cbdec000 RDI: 0000000000000010
Jul 04 00:48:52 k3s06 kernel: RBP: 00005630cbdec000 R08: 0000000000000000 R09: 0000000000000500
Jul 04 00:48:52 k3s06 kernel: R10: 0000000000b6d300 R11: 0000000000000293 R12: 0000000000000500
Jul 04 00:48:52 k3s06 kernel: R13: 0000000000b6d300 R14: 00005630c8968b40 R15: 0000000000b6d300
Jul 04 00:48:52 k3s06 kernel: </TASK>
The ZFS panic does not show up until a bit later:
Jul 04 01:20:35 k3s06 kernel: PANIC: zfs: adding existent segment to range tree (offset=2941f81f000 size=1000)
Jul 04 01:20:35 k3s06 kernel: Showing stack for process 1111
Jul 04 01:20:35 k3s06 kernel: CPU: 18 PID: 1111 Comm: txg_sync Tainted: P O 5.15.0-76-generic #83-Ubuntu
Jul 04 01:20:35 k3s06 kernel: Hardware name: ASRock B660M Steel Legend/B660M Steel Legend, BIOS 10.02 02/10/2023
Jul 04 01:20:35 k3s06 kernel: Call Trace:
Jul 04 01:20:35 k3s06 kernel: <TASK>
Jul 04 01:20:35 k3s06 kernel: show_stack+0x52/0x5c
Jul 04 01:20:35 k3s06 kernel: dump_stack_lvl+0x4a/0x63
Jul 04 01:20:35 k3s06 kernel: dump_stack+0x10/0x16
Jul 04 01:20:35 k3s06 kernel: spl_dumpstack+0x29/0x2f [spl]
Jul 04 01:20:35 k3s06 kernel: vcmn_err.cold+0x60/0x78 [spl]
Jul 04 01:20:35 k3s06 kernel: ? bcpy+0x17/0x20 [zfs]
Jul 04 01:20:35 k3s06 kernel: ? zfs_btree_insert_leaf_impl+0x3f/0x50 [zfs]
Jul 04 01:20:35 k3s06 kernel: ? zfs_btree_insert_into_leaf+0x21f/0x2b0 [zfs]
Jul 04 01:20:35 k3s06 kernel: ? percpu_counter_add+0xf/0x20 [spl]
Jul 04 01:20:35 k3s06 kernel: ? zfs_btree_find_in_buf+0x5a/0xa0 [zfs]
Jul 04 01:20:35 k3s06 kernel: zfs_panic_recover+0x6d/0x90 [zfs]
Jul 04 01:20:35 k3s06 kernel: range_tree_add_impl+0x183/0x610 [zfs]
Jul 04 01:20:35 k3s06 kernel: ? __raw_spin_unlock+0x9/0x10 [zfs]
Jul 04 01:20:35 k3s06 kernel: range_tree_add+0x11/0x20 [zfs]
Jul 04 01:20:35 k3s06 kernel: metaslab_free_concrete+0x146/0x270 [zfs]
Jul 04 01:20:35 k3s06 kernel: ? do_raw_spin_unlock+0x9/0x10 [zfs]
Jul 04 01:20:35 k3s06 kernel: metaslab_free_impl+0xb3/0xf0 [zfs]
Jul 04 01:20:35 k3s06 kernel: metaslab_free_dva+0x61/0x80 [zfs]
Jul 04 01:20:35 k3s06 kernel: metaslab_free+0x114/0x1d0 [zfs]
Jul 04 01:20:35 k3s06 kernel: zio_free_sync+0xf1/0x110 [zfs]
Jul 04 01:20:35 k3s06 kernel: dsl_scan_free_block_cb+0x6e/0x1d0 [zfs]
Jul 04 01:20:35 k3s06 kernel: bpobj_dsl_scan_free_block_cb+0x11/0x20 [zfs]
Jul 04 01:20:35 k3s06 kernel: bpobj_iterate_blkptrs+0xf6/0x380 [zfs]
Jul 04 01:20:35 k3s06 kernel: ? list_head+0xd/0x30 [zfs]
Jul 04 01:20:35 k3s06 kernel: ? dsl_scan_free_block_cb+0x1d0/0x1d0 [zfs]
Jul 04 01:20:35 k3s06 kernel: bpobj_iterate_impl+0x23b/0x390 [zfs]
Jul 04 01:20:35 k3s06 kernel: ? dsl_scan_free_block_cb+0x1d0/0x1d0 [zfs]
Jul 04 01:20:35 k3s06 kernel: bpobj_iterate+0x17/0x20 [zfs]
Jul 04 01:20:35 k3s06 kernel: dsl_process_async_destroys+0x2d5/0x580 [zfs]
Jul 04 01:20:35 k3s06 kernel: dsl_scan_sync+0x1ec/0x910 [zfs]
Jul 04 01:20:35 k3s06 kernel: ? ddt_sync+0xa8/0xd0 [zfs]
Jul 04 01:20:35 k3s06 kernel: spa_sync_iterate_to_convergence+0x124/0x1f0 [zfs]
Jul 04 01:20:35 k3s06 kernel: spa_sync+0x2dc/0x5b0 [zfs]
Jul 04 01:20:35 k3s06 kernel: txg_sync_thread+0x266/0x2f0 [zfs]
Jul 04 01:20:35 k3s06 kernel: ? txg_dispatch_callbacks+0x100/0x100 [zfs]
Jul 04 01:20:35 k3s06 kernel: thread_generic_wrapper+0x61/0x80 [spl]
Jul 04 01:20:35 k3s06 kernel: ? __thread_exit+0x20/0x20 [spl]
Jul 04 01:20:35 k3s06 kernel: kthread+0x127/0x150
Jul 04 01:20:35 k3s06 kernel: ? set_kthread_struct+0x50/0x50
Jul 04 01:20:35 k3s06 kernel: ret_from_fork+0x1f/0x30
Jul 04 01:20:35 k3s06 kernel: </TASK>
I was having similar symptoms on a TrueNAS Scale system. My backup pool of 2 disks got corrupted somehow. Scrub is clean, but any time I'm pushing a particular dataset to it, it panics. I was able to reproduce the panic by manually deleting one snapshot from the dataset. Export and reimport works, but deleting snapshots immediately panics the system and needs hard reset since normal shutdown never completes after zfs panic.
Since the backup pool is somewhat important for me, I enabled zfs_recover
parameter and deleted the offending dataset and replicated from scratch. For now the system appears to behave, but I'm not sure if I have confidence in the pool any more. I will keep the system running with zfs_recover
enabled for a few days, and will wipe/recreate the pool when this reoccurs.
I left it running for about a week with zil_replay_disable
and zfs_recover
set. All that time, did not have a panic or warning message. Rebooted and completed a full scrub without a panic message.
I've had no more zfs_panic_recover
or WARNING: zfs: adding existent segment to range tree
since then... close to 2 weeks now.
I assume this is from zfs constantly rewriting spacemaps in condensing operation. Running the pool in recovery mode for a while eventually removed the duplicated range descriptions during the rewrite operations.... or something like that.
I'm also seeing this on a pool using kernel 6.5-rc6. Here is my zdb output: https://dpaste.com/DL9KS6SQ5
I've set boot variables for vfs.zfs.zil.replay_disable=1
and vfs.zfs.zfs_recover=1
for now, so hopefully that helps?
@reefland Is there a way to force spacemap condensing to happen faster to try to resolve this?
@satmandu - sorry, no idea. No clue how to monitor the spacemap condensing operation - trial and error. Hopefully someone more knowledgeable can answer.
scan: scrub repaired 0B in 11:13:59 with 0 errors on Sun Aug 13 11:38:00 2023
Thanks. My pool has clean scrubs, but my system still panics if I mount the pool without the special zfs options... but I also haven't had the pool mounted with the special options for more than 16 hours...
If rewriting the spacemaps removes the assertion, that might suggest you still have lurking excitement eventually, since in theory that should be equally upset by duplicate range insertion, no?
My wild guess would be when it threw out the log after you imported the pool and unlocked the datasets with replaying disabled, it no longer had a pending insertion that was invalid. But that's just a wild guess.
I hit this again... and had to set zfs.zil.replay_disable=1
and zfs.zfs_recover=1
as kernel cmdline arguments to get boot to succeed.
I've had this multiple times on different systems (it has not duplicated on the same system). It's just a process I follow now to recover the system. At least its a workaround. I'm no closer to finding a cause / pattern.
I destroyed a zfs subvolume (that may have had this issue because that's where intense writes from docker were happening) and still have the issue when importing the parent volume.
I will copy all the data to a fresh volume and recreate the volume.
If I delete all the data and the volume still has these issues, could someone look at an image of the disk to help figure out how to repair this issue, even if figuring out how this damage is being created isn't proximally resolved?
Would that be useful at all?
@satmandu - I can't comment on your disk image question.
But I've had to destroy a number of ZVOLs (extract data, delete ZVOL, recreate, restore, etc) related to this as well. I've since switched all ZVOLs to forced sync writes. Yeah, performance impact, but I haven't had any ZVOLs get corrupted since.
I was frequently getting this after a ZFS panic:
errors: Permanent errors have been detected in the following files:
rpool/rancher:<0x1>
Multiple scrubs were not able to fix it. I'm using CRI / K3s instead of Docker, but we're probably using ZVOLs for similar reasons.
I wonder if we can create a test case of amplifying read/writes to replicate the issue.
Or maybe if there is an existing test case we should be trying on our hardware?
This sounds like it should be replicable in a test environment where one is reaching the I/O limits of a disk subsystem or a disk.
I've also encounters this when moving a nvme to another system, I don't know what I did wrong. Only thing I can think of it inserting into the new system while PSU was plugged in but pc was off. Forgot to unplug it. I also noticed getting i/o errors, I don't think the drive is faulty but have not tested it yet. Computer passed memtest
I've been encountering this on 2.3.0-rc2 after deleting large files (4+GB each) from a pool with dedup active.
PANIC: zfs: adding existent segment to range tree (offset=78cca32000 size=6000)
Showing stack for process 11146
CPU: 0 UID: 0 PID: 11146 Comm: txg_sync Tainted: P OE 6.11.5-1.el8.elrepo.x86_64 #1
Tainted: [P]=PROPRIETARY_MODULE, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
Hardware name: Supermicro AS -2014TP-HTR/H12SST-PS, BIOS 2.8 01/19/2024
Call Trace:
<TASK>
dump_stack_lvl+0x60/0x80
vcmn_err+0xca/0x110 [spl]
? bt_grow_leaf+0xd0/0x160 [zfs]
? srso_return_thunk+0x5/0x5f
? srso_return_thunk+0x5/0x5f
? arc_access+0x295/0x4a0 [zfs]
? srso_return_thunk+0x5/0x5f
? arc_buf_access+0xf8/0x180 [zfs]
? srso_return_thunk+0x5/0x5f
? srso_return_thunk+0x5/0x5f
? dbuf_read+0x287/0x640 [zfs]
? srso_return_thunk+0x5/0x5f
? mutex_lock+0xe/0x30
? srso_return_thunk+0x5/0x5f
? srso_return_thunk+0x5/0x5f
zfs_panic_recover+0x6d/0x90 [zfs]
range_tree_add_impl+0x244/0xe30 [zfs]
? srso_return_thunk+0x5/0x5f
? dbuf_read+0x287/0x640 [zfs]
? srso_return_thunk+0x5/0x5f
range_tree_remove_xor_add_segment+0x513/0x610 [zfs]
? srso_return_thunk+0x5/0x5f
? srso_return_thunk+0x5/0x5f
range_tree_remove_xor_add+0x6e/0x170 [zfs]
metaslab_sync+0x277/0x950 [zfs]
? srso_return_thunk+0x5/0x5f
? dmu_tx_create_dd+0x7c/0xc0 [zfs]
? srso_return_thunk+0x5/0x5f
vdev_sync+0x6b/0x1e0 [zfs]
spa_sync_iterate_to_convergence+0x185/0x2f0 [zfs]
spa_sync+0x4a3/0x980 [zfs]
txg_sync_thread+0x206/0x3a0 [zfs]
? __pfx_txg_sync_thread+0x10/0x10 [zfs]
? __pfx_thread_generic_wrapper+0x10/0x10 [spl]
thread_generic_wrapper+0x59/0x70 [spl]
kthread+0xce/0x100
? __pfx_kthread+0x10/0x10
ret_from_fork+0x30/0x50
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1a/0x30
</TASK>
After it occurs, zpool scrub
hangs and ends up causing these:
INFO: task txg_sync:11146 blocked for more than 122 seconds.
Tainted: P OE 6.11.5-1.el8.elrepo.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:txg_sync state:D stack:0 pid:11146 tgid:11146 ppid:2 flags:0x00004000
Call Trace:
<TASK>
__schedule+0x225/0x5b0
schedule+0x23/0xa0
vcmn_err+0xd7/0x110 [spl]
? bt_grow_leaf+0xd0/0x160 [zfs]
? srso_return_thunk+0x5/0x5f
? srso_return_thunk+0x5/0x5f
? arc_access+0x295/0x4a0 [zfs]
? srso_return_thunk+0x5/0x5f
? arc_buf_access+0xf8/0x180 [zfs]
? srso_return_thunk+0x5/0x5f
? srso_return_thunk+0x5/0x5f
? dbuf_read+0x287/0x640 [zfs]
? srso_return_thunk+0x5/0x5f
? mutex_lock+0xe/0x30
? srso_return_thunk+0x5/0x5f
? srso_return_thunk+0x5/0x5f
zfs_panic_recover+0x6d/0x90 [zfs]
range_tree_add_impl+0x244/0xe30 [zfs]
? srso_return_thunk+0x5/0x5f
? dbuf_read+0x287/0x640 [zfs]
? srso_return_thunk+0x5/0x5f
range_tree_remove_xor_add_segment+0x513/0x610 [zfs]
? srso_return_thunk+0x5/0x5f
? srso_return_thunk+0x5/0x5f
range_tree_remove_xor_add+0x6e/0x170 [zfs]
metaslab_sync+0x277/0x950 [zfs]
? srso_return_thunk+0x5/0x5f
? dmu_tx_create_dd+0x7c/0xc0 [zfs]
? srso_return_thunk+0x5/0x5f
vdev_sync+0x6b/0x1e0 [zfs]
spa_sync_iterate_to_convergence+0x185/0x2f0 [zfs]
spa_sync+0x4a3/0x980 [zfs]
txg_sync_thread+0x206/0x3a0 [zfs]
? __pfx_txg_sync_thread+0x10/0x10 [zfs]
? __pfx_thread_generic_wrapper+0x10/0x10 [spl]
thread_generic_wrapper+0x59/0x70 [spl]
kthread+0xce/0x100
? __pfx_kthread+0x10/0x10
ret_from_fork+0x30/0x50
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1a/0x30
</TASK>
...
INFO: task zpool:876884 blocked for more than 614 seconds.
Tainted: P OE 6.11.5-1.el8.elrepo.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:zpool state:D stack:0 pid:876884 tgid:876884 ppid:875951 flags:0x00004000
Call Trace:
<TASK>
__schedule+0x225/0x5b0
schedule+0x23/0xa0
io_schedule+0x42/0x70
cv_wait_common+0xab/0x130 [spl]
? __pfx_autoremove_wake_function+0x10/0x10
txg_wait_synced_impl+0xfc/0x140 [zfs]
txg_wait_synced+0xc/0x40 [zfs]
dsl_sync_task_common+0x1bf/0x2a0 [zfs]
? __pfx_dsl_scan_setup_sync+0x10/0x10 [zfs]
? __pfx_dsl_scan_setup_check+0x10/0x10 [zfs]
? __pfx_dsl_scan_setup_check+0x10/0x10 [zfs]
? __pfx_dsl_scan_setup_sync+0x10/0x10 [zfs]
dsl_sync_task+0x16/0x20 [zfs]
dsl_scan+0xaa/0x220 [zfs]
zfs_ioc_pool_scrub+0x9c/0x160 [zfs]
zfsdev_ioctl_common+0x297/0x760 [zfs]
? srso_return_thunk+0x5/0x5f
zfsdev_ioctl+0x4f/0xd0 [zfs]
__x64_sys_ioctl+0x93/0xd0
do_syscall_64+0x60/0x170
entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7fd909c3922b
RSP: 002b:00007ffe74bfbde8 EFLAGS: 00000206 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 0000000000005a57 RCX: 00007fd909c3922b
RDX: 00007ffe74bfbe10 RSI: 0000000000005a57 RDI: 0000000000000004
RBP: 00007ffe74bff3f0 R08: 0000000000000000 R09: 0000000016fb9880
R10: 0000000000000000 R11: 0000000000000206 R12: 00007ffe74bfbe10
R13: 0000000000005a57 R14: 0000000016f91c00 R15: 0000000000000000
</TASK>
These continue indefinitely and are only resolved by a reboot, after which a scrub works as normal and doesn't find any problems.
Seeing as how a txg sync never happened, though, the pool may be consistent, but that suggests data loss, no?
Other pool activity appears to continue while those hung tasks remain hung forever.
If it helps at all, there are SHA512 and BLAKE3 ZAPs on that pool, and they're quite big for the pool, so I'm trying to get rid of the old SHA512 data so a big chunk can go away. It was twice as large before I got rid of the SHA256 ZAP that was also there from even earlier.
Here's a zdb -DD
on that pool:
DDT-sha512-zap-duplicate: 5950988 entries, size 14914019328 on disk, 4815867904 in core
DDT-sha512-zap-unique: 39829815 entries, size 93688590336 on disk, 30253473792 in core
DDT-blake3-zap-duplicate: 903049 entries, size 413208576 on disk, 268533760 in core
DDT-blake3-zap-unique: 11498977 entries, size 4799926272 on disk, 3097395200 in core
DDT-log-blake3-0: 18590 log entries
DDT-log-blake3-1: 87105 log entries
Confirmed that the delete of two CentOS ISOs (11GB each) that caused the most recent instance of this did not actually get committed to the pool, so must have been in that hung txg.
Upon further testing, it happens when trying to remove one of those files, in particular.
Reboot after the hang and the files are back.
Removed the "safe" one and that was fine. Ran a zpool sync. No problem.
Remove the other one, and it immediately panicked exactly as above, and that txg never got committed.
Reboot after that, and the problem file is back again. Deleted it, and the same thing happened again. So I can recreate this panic on demand, with next to no load on the system at all.
Anything y'all want me to look at specifically?
Also note that the file appears to be gone after deleting it, even though that txg is uncommitted. It only reappears after reboot and import of the pool.
System information
Describe the problem you're observing
Randomly (not at boot or pool import) system shows panic messages (see below) on the console. The system seemed to be operational. Upon trying to reboot I was unable to get a clean shutdown. Had several
Failed unmounting
messages with references to/root
and/var/log
. Along with a failed to start Journal Service. Eventually it hung at asystemd-shutdown[1]: Syncing filesystems and block devices - timed out, issuing SIGKILL to PID ...
Had to kill the power after waiting ~20 minutes.Upon reboot, everything seems fine. No issues with pool import,
zpool status
was good. After an hour or so I would see the first panic message. Usually within 24 to 48 hours, had to reboot again.. it would not be a clean shutdown, required manual power off.The PANIC messages I have is very close to #13483 however, that discussion appears to be on pool import issues. I'm not having any issues with pool import so I'm documenting a new issue as the panic seems to have a different cause.
Include any warning/errors/backtraces from the system logs
NOTE: System is using ZFS Boot Menu, so this is ZFS on Root, but does not require the commonly used separate BOOT and ROOT pools. This method allows for a single ZFS pool (with encryption).
After reboot, status is good:
I tried to issue a
sudo zpool scrub k3s06
and it hanged (did not return to prompt). I open another SSH session,zpool status
does not indicate a scrub is in progress. Checked the console and it had panic message scrolling on the screen.I did another unclean shutdown again and rebooted. Tried some of the suggestions within #13483
Then issues another scrub, which completed successfully:
As expected from the tunables, panic messages now became warning messages:
This seem to run clean as well:
This time I was able to get a clean shutdown. Rebooted without the tunables.
System is now running 3+ hours under normal load, no panic messages have been seen since yet. Will keep monitoring.