openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.69k stars 1.75k forks source link

PANIC: zfs: adding existent segment to range tree #15030

Open reefland opened 1 year ago

reefland commented 1 year ago

System information

Type Version/Name
Distribution Name Ubuntu
Distribution Version 22.04.1
Kernel Version 5.15.0-75-generic
Architecture x86_64
OpenZFS Version 2.1.5

Describe the problem you're observing

Randomly (not at boot or pool import) system shows panic messages (see below) on the console. The system seemed to be operational. Upon trying to reboot I was unable to get a clean shutdown. Had several Failed unmounting messages with references to /root and /var/log. Along with a failed to start Journal Service. Eventually it hung at a systemd-shutdown[1]: Syncing filesystems and block devices - timed out, issuing SIGKILL to PID ... Had to kill the power after waiting ~20 minutes.

Upon reboot, everything seems fine. No issues with pool import, zpool status was good. After an hour or so I would see the first panic message. Usually within 24 to 48 hours, had to reboot again.. it would not be a clean shutdown, required manual power off.

The PANIC messages I have is very close to #13483 however, that discussion appears to be on pool import issues. I'm not having any issues with pool import so I'm documenting a new issue as the panic seems to have a different cause.

Include any warning/errors/backtraces from the system logs

Jul 02 16:09:43 k3s06 kernel: PANIC: zfs: adding existent segment to range tree (offset=279691ea000 size=1000)
Jul 02 16:09:43 k3s06 kernel: Showing stack for process 1111
Jul 02 16:09:43 k3s06 kernel: CPU: 10 PID: 1111 Comm: txg_sync Tainted: P           O      5.15.0-75-generic #82-Ubuntu
Jul 02 16:09:43 k3s06 kernel: Hardware name: ASRock B660M Steel Legend/B660M Steel Legend, BIOS 10.02 02/10/2023
Jul 02 16:09:43 k3s06 kernel: Call Trace:
Jul 02 16:09:43 k3s06 kernel:  <TASK>
Jul 02 16:09:43 k3s06 kernel:  show_stack+0x52/0x5c
Jul 02 16:09:43 k3s06 kernel:  dump_stack_lvl+0x4a/0x63
Jul 02 16:09:43 k3s06 kernel:  dump_stack+0x10/0x16
Jul 02 16:09:43 k3s06 kernel:  spl_dumpstack+0x29/0x2f [spl]
Jul 02 16:09:43 k3s06 kernel:  vcmn_err.cold+0x60/0x78 [spl]
Jul 02 16:09:43 k3s06 kernel:  ? bcpy+0x17/0x20 [zfs]
Jul 02 16:09:43 k3s06 kernel:  ? zfs_btree_insert_leaf_impl+0x3f/0x50 [zfs]
Jul 02 16:09:43 k3s06 kernel:  ? zfs_btree_insert_into_leaf+0x21f/0x2b0 [zfs]
Jul 02 16:09:43 k3s06 kernel:  ? percpu_counter_add+0xf/0x20 [spl]
Jul 02 16:09:43 k3s06 kernel:  ? zfs_btree_find_in_buf+0x5a/0xa0 [zfs]
Jul 02 16:09:43 k3s06 kernel:  zfs_panic_recover+0x6d/0x90 [zfs]
Jul 02 16:09:43 k3s06 kernel:  range_tree_add_impl+0x183/0x610 [zfs]
Jul 02 16:09:43 k3s06 kernel:  ? __raw_spin_unlock+0x9/0x10 [zfs]
Jul 02 16:09:43 k3s06 kernel:  range_tree_add+0x11/0x20 [zfs]
Jul 02 16:09:43 k3s06 kernel:  metaslab_free_concrete+0x146/0x270 [zfs]
Jul 02 16:09:43 k3s06 kernel:  ? do_raw_spin_unlock+0x9/0x10 [zfs]
Jul 02 16:09:43 k3s06 kernel:  metaslab_free_impl+0xb3/0xf0 [zfs]
Jul 02 16:09:43 k3s06 kernel:  metaslab_free_dva+0x61/0x80 [zfs]
Jul 02 16:09:43 k3s06 kernel:  metaslab_free+0x114/0x1d0 [zfs]
Jul 02 16:09:43 k3s06 kernel:  zio_free_sync+0xf1/0x110 [zfs]
Jul 02 16:09:43 k3s06 kernel:  dsl_scan_free_block_cb+0x6e/0x1d0 [zfs]
Jul 02 16:09:43 k3s06 kernel:  bpobj_dsl_scan_free_block_cb+0x11/0x20 [zfs]
Jul 02 16:09:43 k3s06 kernel:  bpobj_iterate_blkptrs+0xf6/0x380 [zfs]
Jul 02 16:09:43 k3s06 kernel:  ? list_head+0xd/0x30 [zfs]
Jul 02 16:09:43 k3s06 kernel:  ? dsl_scan_free_block_cb+0x1d0/0x1d0 [zfs]
Jul 02 16:09:43 k3s06 kernel:  bpobj_iterate_impl+0x23b/0x390 [zfs]
Jul 02 16:09:43 k3s06 kernel:  ? dsl_scan_free_block_cb+0x1d0/0x1d0 [zfs]
Jul 02 16:09:43 k3s06 kernel:  bpobj_iterate+0x17/0x20 [zfs]
Jul 02 16:09:43 k3s06 kernel:  dsl_process_async_destroys+0x2d5/0x580 [zfs]
Jul 02 16:09:43 k3s06 kernel:  dsl_scan_sync+0x1ec/0x910 [zfs]
Jul 02 16:09:43 k3s06 kernel:  ? ddt_sync+0xa8/0xd0 [zfs]
Jul 02 16:09:43 k3s06 kernel:  spa_sync_iterate_to_convergence+0x124/0x1f0 [zfs]
Jul 02 16:09:43 k3s06 kernel:  spa_sync+0x2dc/0x5b0 [zfs]
Jul 02 16:09:43 k3s06 kernel:  txg_sync_thread+0x266/0x2f0 [zfs]
Jul 02 16:09:43 k3s06 kernel:  ? txg_dispatch_callbacks+0x100/0x100 [zfs]
Jul 02 16:09:43 k3s06 kernel:  thread_generic_wrapper+0x61/0x80 [spl]
Jul 02 16:09:43 k3s06 kernel:  ? __thread_exit+0x20/0x20 [spl]
Jul 02 16:09:43 k3s06 kernel:  kthread+0x127/0x150
Jul 02 16:09:43 k3s06 kernel:  ? set_kthread_struct+0x50/0x50
Jul 02 16:09:43 k3s06 kernel:  ret_from_fork+0x1f/0x30
Jul 02 16:09:43 k3s06 kernel:  </TASK>

Jul 02 16:13:19 k3s06 kernel: INFO: task txg_sync:1111 blocked for more than 120 seconds.
Jul 02 16:13:19 k3s06 kernel:       Tainted: P           O      5.15.0-75-generic #82-Ubuntu
Jul 02 16:13:19 k3s06 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 02 16:13:19 k3s06 kernel: task:txg_sync        state:D stack:    0 pid: 1111 ppid:     2 flags:0x00004004
Jul 02 16:13:19 k3s06 kernel: Call Trace:
Jul 02 16:13:19 k3s06 kernel:  <TASK>
Jul 02 16:13:19 k3s06 kernel:  __schedule+0x24e/0x590
Jul 02 16:13:19 k3s06 kernel:  schedule+0x69/0x110
Jul 02 16:13:19 k3s06 kernel:  vcmn_err.cold+0x76/0x78 [spl]
Jul 02 16:13:19 k3s06 kernel:  ? bcpy+0x17/0x20 [zfs]
Jul 02 16:13:19 k3s06 kernel:  ? zfs_btree_insert_leaf_impl+0x3f/0x50 [zfs]
Jul 02 16:13:19 k3s06 kernel:  ? zfs_btree_insert_into_leaf+0x21f/0x2b0 [zfs]
Jul 02 16:13:19 k3s06 kernel:  ? percpu_counter_add+0xf/0x20 [spl]
Jul 02 16:13:19 k3s06 kernel:  ? zfs_btree_find_in_buf+0x5a/0xa0 [zfs]
Jul 02 16:13:19 k3s06 kernel:  zfs_panic_recover+0x6d/0x90 [zfs]
Jul 02 16:13:19 k3s06 kernel:  range_tree_add_impl+0x183/0x610 [zfs]
Jul 02 16:13:19 k3s06 kernel:  ? __raw_spin_unlock+0x9/0x10 [zfs]
Jul 02 16:13:19 k3s06 kernel:  range_tree_add+0x11/0x20 [zfs]
Jul 02 16:13:19 k3s06 kernel:  metaslab_free_concrete+0x146/0x270 [zfs]
Jul 02 16:13:19 k3s06 kernel:  ? do_raw_spin_unlock+0x9/0x10 [zfs]
Jul 02 16:13:19 k3s06 kernel:  metaslab_free_impl+0xb3/0xf0 [zfs]
Jul 02 16:13:19 k3s06 kernel:  metaslab_free_dva+0x61/0x80 [zfs]
Jul 02 16:13:19 k3s06 kernel:  metaslab_free+0x114/0x1d0 [zfs]
Jul 02 16:13:19 k3s06 kernel:  zio_free_sync+0xf1/0x110 [zfs]
Jul 02 16:13:19 k3s06 kernel:  dsl_scan_free_block_cb+0x6e/0x1d0 [zfs]
Jul 02 16:13:19 k3s06 kernel:  bpobj_dsl_scan_free_block_cb+0x11/0x20 [zfs]
Jul 02 16:13:19 k3s06 kernel:  bpobj_iterate_blkptrs+0xf6/0x380 [zfs]
Jul 02 16:13:19 k3s06 kernel:  ? list_head+0xd/0x30 [zfs]
Jul 02 16:13:19 k3s06 kernel:  ? dsl_scan_free_block_cb+0x1d0/0x1d0 [zfs]
Jul 02 16:13:19 k3s06 kernel:  bpobj_iterate_impl+0x23b/0x390 [zfs]
Jul 02 16:13:19 k3s06 kernel:  ? dsl_scan_free_block_cb+0x1d0/0x1d0 [zfs]
Jul 02 16:13:19 k3s06 kernel:  bpobj_iterate+0x17/0x20 [zfs]
Jul 02 16:13:19 k3s06 kernel:  dsl_process_async_destroys+0x2d5/0x580 [zfs]
Jul 02 16:13:19 k3s06 kernel:  dsl_scan_sync+0x1ec/0x910 [zfs]
Jul 02 16:13:19 k3s06 kernel:  ? ddt_sync+0xa8/0xd0 [zfs]
Jul 02 16:13:19 k3s06 kernel:  spa_sync_iterate_to_convergence+0x124/0x1f0 [zfs]
Jul 02 16:13:19 k3s06 kernel:  spa_sync+0x2dc/0x5b0 [zfs]
Jul 02 16:13:19 k3s06 kernel:  txg_sync_thread+0x266/0x2f0 [zfs]
Jul 02 16:13:19 k3s06 kernel:  ? txg_dispatch_callbacks+0x100/0x100 [zfs]
Jul 02 16:13:19 k3s06 kernel:  thread_generic_wrapper+0x61/0x80 [spl]
Jul 02 16:13:19 k3s06 kernel:  ? __thread_exit+0x20/0x20 [spl]
Jul 02 16:13:19 k3s06 kernel:  kthread+0x127/0x150
Jul 02 16:13:19 k3s06 kernel:  ? set_kthread_struct+0x50/0x50
Jul 02 16:13:19 k3s06 kernel:  ret_from_fork+0x1f/0x30
Jul 02 16:13:19 k3s06 kernel:  </TASK>

NOTE: System is using ZFS Boot Menu, so this is ZFS on Root, but does not require the commonly used separate BOOT and ROOT pools. This method allows for a single ZFS pool (with encryption).

After reboot, status is good:

$ zpool status
  pool: k3s06
 state: ONLINE
  scan: scrub repaired 0B in 08:41:02 with 0 errors on Sun Jun 11 09:05:03 2023
config:

        NAME                                            STATE     READ WRITE CKSUM
        k3s06                                           ONLINE       0     0     0
          mirror-0                                      ONLINE       0     0     0
            ata-WDC_WD80EFZZ-68BTXN0_WD-CA109YVK-part3  ONLINE       0     0     0
            ata-WDC_WD80EFZZ-68BTXN0_WD-CA1040NK-part3  ONLINE       0     0     0
          mirror-1                                      ONLINE       0     0     0
            ata-WDC_WD80EFAX-68KNBN0_VGJJ0Y5G-part3     ONLINE       0     0     0
            ata-WDC_WD120EFBX-68B0EN0_5QJ6U62B-part3    ONLINE       0     0     0
        cache
          nvme0n1                                       ONLINE       0     0     0

errors: No known data errors

I tried to issue a sudo zpool scrub k3s06 and it hanged (did not return to prompt). I open another SSH session, zpool status does not indicate a scrub is in progress. Checked the console and it had panic message scrolling on the screen.

Jul 02 19:45:26 k3s06 kernel: PANIC: zfs: adding existent segment to range tree (offset=279691ea000 size=1000)
Jul 02 19:45:26 k3s06 kernel: Showing stack for process 1341
Jul 02 19:45:26 k3s06 kernel: CPU: 2 PID: 1341 Comm: txg_sync Tainted: P           O      5.15.0-75-generic #82-Ubuntu
Jul 02 19:45:26 k3s06 kernel: Hardware name: ASRock B660M Steel Legend/B660M Steel Legend, BIOS 10.02 02/10/2023
Jul 02 19:45:26 k3s06 kernel: Call Trace:
Jul 02 19:45:26 k3s06 kernel:  <TASK>
Jul 02 19:45:26 k3s06 kernel:  show_stack+0x52/0x5c
Jul 02 19:45:26 k3s06 kernel:  dump_stack_lvl+0x4a/0x63
Jul 02 19:45:26 k3s06 kernel:  dump_stack+0x10/0x16
Jul 02 19:45:26 k3s06 kernel:  spl_dumpstack+0x29/0x2f [spl]
Jul 02 19:45:26 k3s06 kernel:  vcmn_err.cold+0x60/0x78 [spl]
Jul 02 19:45:26 k3s06 kernel:  ? bcpy+0x17/0x20 [zfs]
Jul 02 19:45:26 k3s06 kernel:  ? zfs_btree_insert_leaf_impl+0x3f/0x50 [zfs]
Jul 02 19:45:26 k3s06 kernel:  ? zfs_btree_insert_into_leaf+0x21f/0x2b0 [zfs]
Jul 02 19:45:26 k3s06 kernel:  ? percpu_counter_add+0xf/0x20 [spl]
Jul 02 19:45:26 k3s06 kernel:  ? zfs_btree_find_in_buf+0x5a/0xa0 [zfs]
Jul 02 19:45:26 k3s06 kernel:  zfs_panic_recover+0x6d/0x90 [zfs]
Jul 02 19:45:26 k3s06 kernel:  range_tree_add_impl+0x183/0x610 [zfs]
Jul 02 19:45:26 k3s06 kernel:  ? kmem_cache_free+0x24f/0x290
Jul 02 19:45:26 k3s06 kernel:  ? percpu_counter_dec+0x10/0x20 [spl]
Jul 02 19:45:26 k3s06 kernel:  ? spl_kmem_cache_free+0xda/0x140 [spl]
Jul 02 19:45:26 k3s06 kernel:  ? do_raw_spin_unlock+0x9/0x10 [zfs]
Jul 02 19:45:26 k3s06 kernel:  range_tree_add+0x11/0x20 [zfs]
Jul 02 19:45:26 k3s06 kernel:  metaslab_free_concrete+0x146/0x270 [zfs]
Jul 02 19:45:26 k3s06 kernel:  ? do_raw_spin_unlock+0x9/0x10 [zfs]
Jul 02 19:45:26 k3s06 kernel:  metaslab_free_impl+0xb3/0xf0 [zfs]
Jul 02 19:45:26 k3s06 kernel:  metaslab_free_dva+0x61/0x80 [zfs]
Jul 02 19:45:26 k3s06 kernel:  metaslab_free+0x114/0x1d0 [zfs]
Jul 02 19:45:26 k3s06 kernel:  zio_free_sync+0xf1/0x110 [zfs]
Jul 02 19:45:26 k3s06 kernel:  dsl_scan_free_block_cb+0x6e/0x1d0 [zfs]
Jul 02 19:45:26 k3s06 kernel:  bpobj_dsl_scan_free_block_cb+0x11/0x20 [zfs]
Jul 02 19:45:26 k3s06 kernel:  bpobj_iterate_blkptrs+0xf6/0x380 [zfs]
Jul 02 19:45:26 k3s06 kernel:  ? list_head+0xd/0x30 [zfs]
Jul 02 19:45:26 k3s06 kernel:  ? dsl_scan_free_block_cb+0x1d0/0x1d0 [zfs]
Jul 02 19:45:26 k3s06 kernel:  bpobj_iterate_impl+0x23b/0x390 [zfs]
Jul 02 19:45:26 k3s06 kernel:  ? dsl_scan_free_block_cb+0x1d0/0x1d0 [zfs]
Jul 02 19:45:26 k3s06 kernel:  bpobj_iterate+0x17/0x20 [zfs]
Jul 02 19:45:26 k3s06 kernel:  dsl_process_async_destroys+0x2d5/0x580 [zfs]
Jul 02 19:45:26 k3s06 kernel:  dsl_scan_sync+0x1ec/0x910 [zfs]
Jul 02 19:45:26 k3s06 kernel:  ? ddt_sync+0xa8/0xd0 [zfs]
Jul 02 19:45:26 k3s06 kernel:  spa_sync_iterate_to_convergence+0x124/0x1f0 [zfs]
Jul 02 19:45:26 k3s06 kernel:  spa_sync+0x2dc/0x5b0 [zfs]
Jul 02 19:45:26 k3s06 kernel:  txg_sync_thread+0x266/0x2f0 [zfs]
Jul 02 19:45:26 k3s06 kernel:  ? txg_dispatch_callbacks+0x100/0x100 [zfs]
Jul 02 19:45:26 k3s06 kernel:  thread_generic_wrapper+0x61/0x80 [spl]
Jul 02 19:45:26 k3s06 kernel:  ? __thread_exit+0x20/0x20 [spl]
Jul 02 19:45:26 k3s06 kernel:  kthread+0x127/0x150
Jul 02 19:45:26 k3s06 kernel:  ? set_kthread_struct+0x50/0x50
Jul 02 19:45:26 k3s06 kernel:  ret_from_fork+0x1f/0x30
Jul 02 19:45:26 k3s06 kernel:  </TASK>
Jul 02 19:49:02 k3s06 kernel: INFO: task txg_sync:1341 blocked for more than 120 seconds.
Jul 02 19:49:02 k3s06 kernel:       Tainted: P           O      5.15.0-75-generic #82-Ubuntu
Jul 02 19:49:02 k3s06 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 02 19:49:02 k3s06 kernel: task:txg_sync        state:D stack:    0 pid: 1341 ppid:     2 flags:0x00004004
Jul 02 19:49:02 k3s06 kernel: Call Trace:
Jul 02 19:49:02 k3s06 kernel:  <TASK>
Jul 02 19:49:02 k3s06 kernel:  __schedule+0x24e/0x590
Jul 02 19:49:02 k3s06 kernel:  schedule+0x69/0x110
Jul 02 19:49:02 k3s06 kernel:  vcmn_err.cold+0x76/0x78 [spl]
Jul 02 19:49:02 k3s06 kernel:  ? bcpy+0x17/0x20 [zfs]
Jul 02 19:49:02 k3s06 kernel:  ? zfs_btree_insert_leaf_impl+0x3f/0x50 [zfs]
Jul 02 19:49:02 k3s06 kernel:  ? zfs_btree_insert_into_leaf+0x21f/0x2b0 [zfs]
Jul 02 19:49:02 k3s06 kernel:  ? percpu_counter_add+0xf/0x20 [spl]
Jul 02 19:49:02 k3s06 kernel:  ? zfs_btree_find_in_buf+0x5a/0xa0 [zfs]
Jul 02 19:49:02 k3s06 kernel:  zfs_panic_recover+0x6d/0x90 [zfs]
Jul 02 19:49:02 k3s06 kernel:  range_tree_add_impl+0x183/0x610 [zfs]
Jul 02 19:49:02 k3s06 kernel:  ? kmem_cache_free+0x24f/0x290
Jul 02 19:49:02 k3s06 kernel:  ? percpu_counter_dec+0x10/0x20 [spl]
Jul 02 19:49:02 k3s06 kernel:  ? spl_kmem_cache_free+0xda/0x140 [spl]
Jul 02 19:49:02 k3s06 kernel:  ? do_raw_spin_unlock+0x9/0x10 [zfs]
Jul 02 19:49:02 k3s06 kernel:  range_tree_add+0x11/0x20 [zfs]
Jul 02 19:49:02 k3s06 kernel:  metaslab_free_concrete+0x146/0x270 [zfs]
Jul 02 19:49:02 k3s06 kernel:  ? do_raw_spin_unlock+0x9/0x10 [zfs]
Jul 02 19:49:02 k3s06 kernel:  metaslab_free_impl+0xb3/0xf0 [zfs]
Jul 02 19:49:02 k3s06 kernel:  metaslab_free_dva+0x61/0x80 [zfs]
Jul 02 19:49:02 k3s06 kernel:  metaslab_free+0x114/0x1d0 [zfs]
Jul 02 19:49:02 k3s06 kernel:  zio_free_sync+0xf1/0x110 [zfs]
Jul 02 19:49:02 k3s06 kernel:  dsl_scan_free_block_cb+0x6e/0x1d0 [zfs]
Jul 02 19:49:02 k3s06 kernel:  bpobj_dsl_scan_free_block_cb+0x11/0x20 [zfs]
Jul 02 19:49:02 k3s06 kernel:  bpobj_iterate_blkptrs+0xf6/0x380 [zfs]
Jul 02 19:49:02 k3s06 kernel:  ? list_head+0xd/0x30 [zfs]
Jul 02 19:49:02 k3s06 kernel:  ? dsl_scan_free_block_cb+0x1d0/0x1d0 [zfs]
Jul 02 19:49:02 k3s06 kernel:  bpobj_iterate_impl+0x23b/0x390 [zfs]
Jul 02 19:49:02 k3s06 kernel:  ? dsl_scan_free_block_cb+0x1d0/0x1d0 [zfs]
Jul 02 19:49:02 k3s06 kernel:  bpobj_iterate+0x17/0x20 [zfs]
Jul 02 19:49:02 k3s06 kernel:  dsl_process_async_destroys+0x2d5/0x580 [zfs]
Jul 02 19:49:02 k3s06 kernel:  dsl_scan_sync+0x1ec/0x910 [zfs]
Jul 02 19:49:02 k3s06 kernel:  ? ddt_sync+0xa8/0xd0 [zfs]
Jul 02 19:49:02 k3s06 kernel:  spa_sync_iterate_to_convergence+0x124/0x1f0 [zfs]
Jul 02 19:49:02 k3s06 kernel:  spa_sync+0x2dc/0x5b0 [zfs]
Jul 02 19:49:02 k3s06 kernel:  txg_sync_thread+0x266/0x2f0 [zfs]
Jul 02 19:49:02 k3s06 kernel:  ? txg_dispatch_callbacks+0x100/0x100 [zfs]
Jul 02 19:49:02 k3s06 kernel:  thread_generic_wrapper+0x61/0x80 [spl]
Jul 02 19:49:02 k3s06 kernel:  ? __thread_exit+0x20/0x20 [spl]
Jul 02 19:49:02 k3s06 kernel:  kthread+0x127/0x150
Jul 02 19:49:02 k3s06 kernel:  ? set_kthread_struct+0x50/0x50
Jul 02 19:49:02 k3s06 kernel:  ret_from_fork+0x1f/0x30
Jul 02 19:49:02 k3s06 kernel:  </TASK>

I did another unclean shutdown again and rebooted. Tried some of the suggestions within #13483

echo 1 > /sys/module/zfs/parameters/zil_replay_disable
echo 1 > /sys/module/zfs/parameters/zfs_recover

Then issues another scrub, which completed successfully:

 zpool status
  pool: k3s06
 state: ONLINE
  scan: scrub repaired 0B in 09:07:12 with 0 errors on Mon Jul  3 05:54:45 2023
config:

        NAME                                            STATE     READ WRITE CKSUM
        k3s06                                           ONLINE       0     0     0
          mirror-0                                      ONLINE       0     0     0
            ata-WDC_WD80EFZZ-68BTXN0_WD-CA109YVK-part3  ONLINE       0     0     0
            ata-WDC_WD80EFZZ-68BTXN0_WD-CA1040NK-part3  ONLINE       0     0     0
          mirror-1                                      ONLINE       0     0     0
            ata-WDC_WD80EFAX-68KNBN0_VGJJ0Y5G-part3     ONLINE       0     0     0
            ata-WDC_WD120EFBX-68B0EN0_5QJ6U62B-part3    ONLINE       0     0     0
        cache
          nvme0n1                                       ONLINE       0     0     0

errors: No known data errors

As expected from the tunables, panic messages now became warning messages:

Jul 02 21:01:47 k3s06 kernel: WARNING: zfs: adding existent segment to range tree (offset=279691ea000 size=1000)
Jul 02 21:01:47 k3s06 kernel: WARNING: zfs: adding existent segment to range tree (offset=27691830000 size=1000)
Jul 02 21:01:47 k3s06 kernel: WARNING: zfs: adding existent segment to range tree (offset=276917fa000 size=1000)
Jul 02 21:01:47 k3s06 kernel: WARNING: zfs: adding existent segment to range tree (offset=279691bc000 size=1000)
Jul 02 21:01:47 k3s06 kernel: WARNING: zfs: adding existent segment to range tree (offset=28d90833000 size=1000)
Jul 02 21:01:47 k3s06 kernel: WARNING: zfs: adding existent segment to range tree (offset=276917fb000 size=1000)
Jul 02 21:01:47 k3s06 kernel: WARNING: zfs: adding existent segment to range tree (offset=28d907b0000 size=1000)
Jul 02 21:01:47 k3s06 kernel: WARNING: zfs: adding existent segment to range tree (offset=197224c4000 size=1000)
Jul 02 21:01:47 k3s06 kernel: WARNING: zfs: adding existent segment to range tree (offset=291de47e000 size=1000)
Jul 02 21:01:47 k3s06 kernel: WARNING: zfs: adding existent segment to range tree (offset=28a52005000 size=1000)
Jul 02 21:01:47 k3s06 kernel: WARNING: zfs: adding existent segment to range tree (offset=2c13ab29000 size=1000)
Jul 02 21:01:47 k3s06 kernel: WARNING: zfs: adding existent segment to range tree (offset=2c118da9000 size=2000)
Jul 02 21:01:47 k3s06 kernel: WARNING: zfs: adding existent segment to range tree (offset=27ecdfbe000 size=1000)
Jul 02 21:01:47 k3s06 kernel: WARNING: zfs: adding existent segment to range tree (offset=27341d0b000 size=1000)
Jul 02 21:01:47 k3s06 kernel: WARNING: zfs: adding existent segment to range tree (offset=2915ddb2000 size=1000)
Jul 02 21:01:47 k3s06 kernel: WARNING: zfs: adding existent segment to range tree (offset=28d4ea04000 size=1000)
Jul 02 21:01:47 k3s06 kernel: WARNING: zfs: adding existent segment to range tree (offset=196d5824000 size=1000)
Jul 02 21:01:47 k3s06 kernel: WARNING: zfs: adding existent segment to range tree (offset=2529b350000 size=f000)
Jul 02 21:01:47 k3s06 kernel: WARNING: zfs: adding existent segment to range tree (offset=279350b0000 size=1000)
Jul 03 01:00:59 k3s06 kernel: WARNING: zfs: adding existent segment to range tree (offset=27ac475e000 size=1000)
Jul 03 01:00:59 k3s06 kernel: WARNING: zfs: adding existent segment to range tree (offset=2824d6d6000 size=1000)
Jul 03 01:00:59 k3s06 kernel: WARNING: zfs: adding existent segment to range tree (offset=2824d072000 size=1000)
Jul 03 01:00:59 k3s06 kernel: WARNING: zfs: adding existent segment to range tree (offset=28afcbcf000 size=1000)

This seem to run clean as well:

sudo zdb -AAA -b k3s06

Traversing all blocks to verify nothing leaked ...

loading concrete vdev 1, metaslab 464 of 465 ...
8.91T completed (10118MB/s) estimated time remaining: 0hr 00min 00sec        
        No leaks (block sum matches space maps exactly)

        bp count:              68746233
        ganged count:                 0
        bp logical:       9955783857664      avg: 144819
        bp physical:      9789308441088      avg: 142397     compression:   1.02
        bp allocated:     9798470672384      avg: 142531     compression:   1.02
        bp deduped:                   0    ref>1:      0   deduplication:   1.00
        Normal class:     9798686769152     used: 61.46%
        Embedded log class         192512     used:  0.00%

        additional, non-pointer bps of type 0:      10841
        Dittoed blocks on same vdev: 77922

This time I was able to get a clean shutdown. Rebooted without the tunables.

System is now running 3+ hours under normal load, no panic messages have been seen since yet. Will keep monitoring.

reefland commented 1 year ago

Got about 12 hours from it, before it got wonky and panic again.

First few entries seem to be related to ext4, the system does have one zvol with ext4.

Jul 04 00:44:50 k3s06 kernel: INFO: task python3:584178 blocked for more than 120 seconds.
Jul 04 00:44:50 k3s06 kernel:       Tainted: P           O      5.15.0-76-generic #83-Ubuntu
Jul 04 00:44:50 k3s06 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 04 00:44:50 k3s06 kernel: task:python3         state:D stack:    0 pid:584178 ppid:579848 flags:0x00004000
Jul 04 00:44:50 k3s06 kernel: Call Trace:
Jul 04 00:44:50 k3s06 kernel:  <TASK>
Jul 04 00:44:50 k3s06 kernel:  __schedule+0x24e/0x590
Jul 04 00:44:50 k3s06 kernel:  schedule+0x69/0x110
Jul 04 00:44:50 k3s06 kernel:  io_schedule+0x46/0x80
Jul 04 00:44:50 k3s06 kernel:  wait_on_page_bit_common+0x10c/0x3d0
Jul 04 00:44:50 k3s06 kernel:  ? filemap_invalidate_unlock_two+0x50/0x50
Jul 04 00:44:50 k3s06 kernel:  wait_on_page_bit+0x3f/0x50
Jul 04 00:44:50 k3s06 kernel:  wait_on_page_writeback+0x26/0x80
Jul 04 00:44:50 k3s06 kernel:  __filemap_fdatawait_range+0x97/0x120
Jul 04 00:44:50 k3s06 kernel:  ? dentry_lru_isolate+0xcf/0x140
Jul 04 00:44:50 k3s06 kernel:  file_write_and_wait_range+0xd3/0xf0
Jul 04 00:44:50 k3s06 kernel:  ext4_sync_file+0x7b/0x330
Jul 04 00:44:50 k3s06 kernel:  __x64_sys_fdatasync+0x48/0x90
Jul 04 00:44:50 k3s06 kernel:  do_syscall_64+0x59/0xc0
Jul 04 00:44:50 k3s06 kernel:  ? exit_to_user_mode_prepare+0x37/0xb0
Jul 04 00:44:50 k3s06 kernel:  ? syscall_exit_to_user_mode+0x27/0x50
Jul 04 00:44:50 k3s06 kernel:  ? __x64_sys_write+0x19/0x20
Jul 04 00:44:50 k3s06 kernel:  ? do_syscall_64+0x69/0xc0
Jul 04 00:44:50 k3s06 kernel:  ? syscall_exit_to_user_mode+0x27/0x50
Jul 04 00:44:50 k3s06 kernel:  ? __x64_sys_lseek+0x18/0x20
Jul 04 00:44:50 k3s06 kernel:  ? do_syscall_64+0x69/0xc0
Jul 04 00:44:50 k3s06 kernel:  ? do_syscall_64+0x69/0xc0
Jul 04 00:44:50 k3s06 kernel:  entry_SYSCALL_64_after_hwframe+0x61/0xcb
Jul 04 00:44:50 k3s06 kernel: RIP: 0033:0x7f03aa537fac
Jul 04 00:44:50 k3s06 kernel: RSP: 002b:00007f039defe4a8 EFLAGS: 00000246 ORIG_RAX: 000000000000004b
Jul 04 00:44:50 k3s06 kernel: RAX: ffffffffffffffda RBX: 000000000000004b RCX: 00007f03aa537fac
Jul 04 00:44:50 k3s06 kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000009
Jul 04 00:44:50 k3s06 kernel: RBP: 00007f039deffb38 R08: 0000000000000000 R09: 0000000000000000
Jul 04 00:44:50 k3s06 kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 00007f02ec89a008
Jul 04 00:44:50 k3s06 kernel: R13: 0000000000000000 R14: 0000000000000408 R15: 0000000000000000
Jul 04 00:44:50 k3s06 kernel:  </TASK>
Jul 04 00:44:50 k3s06 kernel: INFO: task node:615723 blocked for more than 120 seconds.
Jul 04 00:44:50 k3s06 kernel:       Tainted: P           O      5.15.0-76-generic #83-Ubuntu
Jul 04 00:44:50 k3s06 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 04 00:44:50 k3s06 kernel: task:node            state:D stack:    0 pid:615723 ppid:615555 flags:0x00000004
Jul 04 00:44:50 k3s06 kernel: Call Trace:
Jul 04 00:44:50 k3s06 kernel:  <TASK>
Jul 04 00:44:50 k3s06 kernel:  __schedule+0x24e/0x590
Jul 04 00:44:50 k3s06 kernel:  schedule+0x69/0x110
Jul 04 00:44:50 k3s06 kernel:  io_schedule+0x46/0x80
Jul 04 00:44:50 k3s06 kernel:  wait_on_page_bit_common+0x10c/0x3d0
Jul 04 00:44:50 k3s06 kernel:  ? filemap_invalidate_unlock_two+0x50/0x50
Jul 04 00:44:50 k3s06 kernel:  wait_on_page_bit+0x3f/0x50
Jul 04 00:44:50 k3s06 kernel:  wait_on_page_writeback+0x26/0x80
Jul 04 00:44:50 k3s06 kernel:  __filemap_fdatawait_range+0x97/0x120
Jul 04 00:44:50 k3s06 kernel:  ? __filemap_fdatawait_range+0x3a/0x120
Jul 04 00:44:50 k3s06 kernel:  file_write_and_wait_range+0xd3/0xf0
Jul 04 00:44:50 k3s06 kernel:  ext4_sync_file+0x7b/0x330
Jul 04 00:44:50 k3s06 kernel:  vfs_fsync_range+0x46/0x90
Jul 04 00:44:50 k3s06 kernel:  ? __fget_light+0x39/0x90
Jul 04 00:44:50 k3s06 kernel:  __x64_sys_fsync+0x38/0x70
Jul 04 00:44:50 k3s06 kernel:  do_syscall_64+0x59/0xc0
Jul 04 00:44:50 k3s06 kernel:  ? syscall_exit_to_user_mode+0x27/0x50
Jul 04 00:44:50 k3s06 kernel:  ? __x64_sys_write+0x19/0x20
Jul 04 00:44:50 k3s06 kernel:  ? do_syscall_64+0x69/0xc0
Jul 04 00:44:50 k3s06 kernel:  ? exit_to_user_mode_prepare+0x37/0xb0
Jul 04 00:44:50 k3s06 kernel:  ? syscall_exit_to_user_mode+0x27/0x50
Jul 04 00:44:50 k3s06 kernel:  ? __x64_sys_write+0x19/0x20
Jul 04 00:44:50 k3s06 kernel:  ? do_syscall_64+0x69/0xc0
Jul 04 00:44:50 k3s06 kernel:  ? __x64_sys_lseek+0x18/0x20
Jul 04 00:44:50 k3s06 kernel:  ? do_syscall_64+0x69/0xc0
Jul 04 00:44:50 k3s06 kernel:  ? syscall_exit_to_user_mode+0x27/0x50
Jul 04 00:44:50 k3s06 kernel:  ? __x64_sys_write+0x19/0x20
Jul 04 00:44:50 k3s06 kernel:  ? do_syscall_64+0x69/0xc0
Jul 04 00:44:50 k3s06 kernel:  ? do_syscall_64+0x69/0xc0
Jul 04 00:44:50 k3s06 kernel:  entry_SYSCALL_64_after_hwframe+0x61/0xcb
Jul 04 00:44:50 k3s06 kernel: RIP: 0033:0x7f8a92ba2a97
Jul 04 00:44:50 k3s06 kernel: RSP: 002b:00007f8a82b19890 EFLAGS: 00000293 ORIG_RAX: 000000000000004a
Jul 04 00:44:50 k3s06 kernel: RAX: ffffffffffffffda RBX: 0000000000000013 RCX: 00007f8a92ba2a97
Jul 04 00:44:50 k3s06 kernel: RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000000013
Jul 04 00:44:50 k3s06 kernel: RBP: 00007f8a82b198e0 R08: 0000000000000000 R09: 0000000000000001
Jul 04 00:44:50 k3s06 kernel: R10: 0000000000000000 R11: 0000000000000293 R12: 00007f8a82b19960
Jul 04 00:44:50 k3s06 kernel: R13: 0000000000000002 R14: 00007f8a6413cb50 R15: 00007f8a64016238
Jul 04 00:44:50 k3s06 kernel:  </TASK>
Jul 04 00:44:50 k3s06 kernel: INFO: task WTCheck.tThread:616605 blocked for more than 120 seconds.
Jul 04 00:44:50 k3s06 kernel:       Tainted: P           O      5.15.0-76-generic #83-Ubuntu
Jul 04 00:44:50 k3s06 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 04 00:44:50 k3s06 kernel: task:WTCheck.tThread state:D stack:    0 pid:616605 ppid:616510 flags:0x00000000
Jul 04 00:44:50 k3s06 kernel: Call Trace:
Jul 04 00:44:50 k3s06 kernel:  <TASK>
Jul 04 00:44:50 k3s06 kernel:  __schedule+0x24e/0x590
Jul 04 00:44:50 k3s06 kernel:  schedule+0x69/0x110
Jul 04 00:44:50 k3s06 kernel:  io_schedule+0x46/0x80
Jul 04 00:44:50 k3s06 kernel:  wait_on_page_bit_common+0x10c/0x3d0
Jul 04 00:44:50 k3s06 kernel:  ? filemap_invalidate_unlock_two+0x50/0x50
Jul 04 00:44:50 k3s06 kernel:  wait_on_page_bit+0x3f/0x50
Jul 04 00:44:50 k3s06 kernel:  wait_on_page_writeback+0x26/0x80
Jul 04 00:44:50 k3s06 kernel:  __filemap_fdatawait_range+0x97/0x120
Jul 04 00:44:50 k3s06 kernel:  ? wbc_detach_inode+0x142/0x210
Jul 04 00:44:50 k3s06 kernel:  file_write_and_wait_range+0xd3/0xf0
Jul 04 00:44:50 k3s06 kernel:  ext4_sync_file+0x7b/0x330
Jul 04 00:44:50 k3s06 kernel:  __x64_sys_fdatasync+0x48/0x90
Jul 04 00:44:50 k3s06 kernel:  do_syscall_64+0x59/0xc0
Jul 04 00:44:50 k3s06 kernel:  ? do_syscall_64+0x69/0xc0
Jul 04 00:44:50 k3s06 kernel:  ? do_syscall_64+0x69/0xc0
Jul 04 00:44:50 k3s06 kernel:  ? do_syscall_64+0x69/0xc0
Jul 04 00:44:50 k3s06 kernel:  ? sysvec_apic_timer_interrupt+0x4e/0x90
Jul 04 00:44:50 k3s06 kernel:  entry_SYSCALL_64_after_hwframe+0x61/0xcb
Jul 04 00:44:50 k3s06 kernel: RIP: 0033:0x7fcd4954ffc7
Jul 04 00:44:50 k3s06 kernel: RSP: 002b:00007fcd4357a7b0 EFLAGS: 00000293 ORIG_RAX: 000000000000004b
Jul 04 00:44:50 k3s06 kernel: RAX: ffffffffffffffda RBX: 0000000000000093 RCX: 00007fcd4954ffc7
Jul 04 00:44:50 k3s06 kernel: RDX: 0000000000000000 RSI: 0000000000000093 RDI: 0000000000000093
Jul 04 00:44:50 k3s06 kernel: RBP: 00007fcd4357a810 R08: 0000000000000020 R09: 0000000000000000
Jul 04 00:44:50 k3s06 kernel: R10: 0000000000000017 R11: 0000000000000293 R12: 000000000000000a
Jul 04 00:44:50 k3s06 kernel: R13: 0000000000000001 R14: 00005630ccb039c0 R15: 00005630c5c46269
Jul 04 00:44:50 k3s06 kernel:  </TASK>
Jul 04 00:46:51 k3s06 kernel: INFO: task python3:584178 blocked for more than 241 seconds.
Jul 04 00:46:51 k3s06 kernel:       Tainted: P           O      5.15.0-76-generic #83-Ubuntu
Jul 04 00:46:51 k3s06 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 04 00:46:51 k3s06 kernel: task:python3         state:D stack:    0 pid:584178 ppid:579848 flags:0x00004000
Jul 04 00:46:51 k3s06 kernel: Call Trace:
Jul 04 00:46:51 k3s06 kernel:  <TASK>
Jul 04 00:46:51 k3s06 kernel:  __schedule+0x24e/0x590
Jul 04 00:46:51 k3s06 kernel:  schedule+0x69/0x110
Jul 04 00:46:51 k3s06 kernel:  io_schedule+0x46/0x80
Jul 04 00:46:51 k3s06 kernel:  wait_on_page_bit_common+0x10c/0x3d0
Jul 04 00:46:51 k3s06 kernel:  ? filemap_invalidate_unlock_two+0x50/0x50
Jul 04 00:46:51 k3s06 kernel:  wait_on_page_bit+0x3f/0x50
Jul 04 00:46:51 k3s06 kernel:  wait_on_page_writeback+0x26/0x80
Jul 04 00:46:51 k3s06 kernel:  __filemap_fdatawait_range+0x97/0x120
Jul 04 00:46:51 k3s06 kernel:  ? dentry_lru_isolate+0xcf/0x140
Jul 04 00:46:51 k3s06 kernel:  file_write_and_wait_range+0xd3/0xf0
Jul 04 00:46:51 k3s06 kernel:  ext4_sync_file+0x7b/0x330
Jul 04 00:46:51 k3s06 kernel:  __x64_sys_fdatasync+0x48/0x90
Jul 04 00:46:51 k3s06 kernel:  do_syscall_64+0x59/0xc0
Jul 04 00:46:51 k3s06 kernel:  ? exit_to_user_mode_prepare+0x37/0xb0
Jul 04 00:46:51 k3s06 kernel:  ? syscall_exit_to_user_mode+0x27/0x50
Jul 04 00:46:51 k3s06 kernel:  ? __x64_sys_write+0x19/0x20
Jul 04 00:46:51 k3s06 kernel:  ? do_syscall_64+0x69/0xc0
Jul 04 00:46:51 k3s06 kernel:  ? syscall_exit_to_user_mode+0x27/0x50
Jul 04 00:46:51 k3s06 kernel:  ? __x64_sys_lseek+0x18/0x20
Jul 04 00:46:51 k3s06 kernel:  ? do_syscall_64+0x69/0xc0
Jul 04 00:46:51 k3s06 kernel:  ? do_syscall_64+0x69/0xc0
Jul 04 00:46:51 k3s06 kernel:  entry_SYSCALL_64_after_hwframe+0x61/0xcb
Jul 04 00:46:51 k3s06 kernel: RIP: 0033:0x7f03aa537fac
Jul 04 00:46:51 k3s06 kernel: RSP: 002b:00007f039defe4a8 EFLAGS: 00000246 ORIG_RAX: 000000000000004b
Jul 04 00:46:51 k3s06 kernel: RAX: ffffffffffffffda RBX: 000000000000004b RCX: 00007f03aa537fac
Jul 04 00:46:51 k3s06 kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000009
Jul 04 00:46:51 k3s06 kernel: RBP: 00007f039deffb38 R08: 0000000000000000 R09: 0000000000000000
Jul 04 00:46:51 k3s06 kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 00007f02ec89a008
Jul 04 00:46:51 k3s06 kernel: R13: 0000000000000000 R14: 0000000000000408 R15: 0000000000000000
Jul 04 00:46:51 k3s06 kernel:  </TASK>
Jul 04 00:46:51 k3s06 kernel: INFO: task node:615723 blocked for more than 241 seconds.
Jul 04 00:46:51 k3s06 kernel:       Tainted: P           O      5.15.0-76-generic #83-Ubuntu
Jul 04 00:46:51 k3s06 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 04 00:46:51 k3s06 kernel: task:node            state:D stack:    0 pid:615723 ppid:615555 flags:0x00000004
Jul 04 00:46:51 k3s06 kernel: Call Trace:
Jul 04 00:46:51 k3s06 kernel:  <TASK>
Jul 04 00:46:51 k3s06 kernel:  __schedule+0x24e/0x590
Jul 04 00:46:51 k3s06 kernel:  schedule+0x69/0x110
Jul 04 00:46:51 k3s06 kernel:  io_schedule+0x46/0x80
Jul 04 00:46:51 k3s06 kernel:  wait_on_page_bit_common+0x10c/0x3d0
Jul 04 00:46:51 k3s06 kernel:  ? filemap_invalidate_unlock_two+0x50/0x50
Jul 04 00:46:51 k3s06 kernel:  wait_on_page_bit+0x3f/0x50
Jul 04 00:46:51 k3s06 kernel:  wait_on_page_writeback+0x26/0x80
Jul 04 00:46:51 k3s06 kernel:  __filemap_fdatawait_range+0x97/0x120
Jul 04 00:46:51 k3s06 kernel:  ? __filemap_fdatawait_range+0x3a/0x120
Jul 04 00:46:51 k3s06 kernel:  file_write_and_wait_range+0xd3/0xf0
Jul 04 00:46:51 k3s06 kernel:  ext4_sync_file+0x7b/0x330
Jul 04 00:46:51 k3s06 kernel:  vfs_fsync_range+0x46/0x90
Jul 04 00:46:51 k3s06 kernel:  ? __fget_light+0x39/0x90
Jul 04 00:46:51 k3s06 kernel:  __x64_sys_fsync+0x38/0x70
Jul 04 00:46:51 k3s06 kernel:  do_syscall_64+0x59/0xc0
Jul 04 00:46:51 k3s06 kernel:  ? syscall_exit_to_user_mode+0x27/0x50
Jul 04 00:46:51 k3s06 kernel:  ? __x64_sys_write+0x19/0x20
Jul 04 00:46:51 k3s06 kernel:  ? do_syscall_64+0x69/0xc0
Jul 04 00:46:51 k3s06 kernel:  ? exit_to_user_mode_prepare+0x37/0xb0
Jul 04 00:46:51 k3s06 kernel:  ? syscall_exit_to_user_mode+0x27/0x50
Jul 04 00:46:51 k3s06 kernel:  ? __x64_sys_write+0x19/0x20
Jul 04 00:46:51 k3s06 kernel:  ? do_syscall_64+0x69/0xc0
Jul 04 00:46:51 k3s06 kernel:  ? __x64_sys_lseek+0x18/0x20
Jul 04 00:46:51 k3s06 kernel:  ? do_syscall_64+0x69/0xc0
Jul 04 00:46:51 k3s06 kernel:  ? syscall_exit_to_user_mode+0x27/0x50
Jul 04 00:46:51 k3s06 kernel:  ? __x64_sys_write+0x19/0x20
Jul 04 00:46:51 k3s06 kernel:  ? do_syscall_64+0x69/0xc0
Jul 04 00:46:51 k3s06 kernel:  ? do_syscall_64+0x69/0xc0
Jul 04 00:46:51 k3s06 kernel:  entry_SYSCALL_64_after_hwframe+0x61/0xcb
Jul 04 00:46:51 k3s06 kernel: RIP: 0033:0x7f8a92ba2a97
Jul 04 00:46:51 k3s06 kernel: RSP: 002b:00007f8a82b19890 EFLAGS: 00000293 ORIG_RAX: 000000000000004a
Jul 04 00:46:51 k3s06 kernel: RAX: ffffffffffffffda RBX: 0000000000000013 RCX: 00007f8a92ba2a97
Jul 04 00:46:51 k3s06 kernel: RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000000013
Jul 04 00:46:51 k3s06 kernel: RBP: 00007f8a82b198e0 R08: 0000000000000000 R09: 0000000000000001
Jul 04 00:46:51 k3s06 kernel: R10: 0000000000000000 R11: 0000000000000293 R12: 00007f8a82b19960
Jul 04 00:46:51 k3s06 kernel: R13: 0000000000000002 R14: 00007f8a6413cb50 R15: 00007f8a64016238
Jul 04 00:46:51 k3s06 kernel:  </TASK>
Jul 04 00:46:51 k3s06 kernel: INFO: task WTCheck.tThread:616605 blocked for more than 241 seconds.
Jul 04 00:46:51 k3s06 kernel:       Tainted: P           O      5.15.0-76-generic #83-Ubuntu
Jul 04 00:46:51 k3s06 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 04 00:46:51 k3s06 kernel: task:WTCheck.tThread state:D stack:    0 pid:616605 ppid:616510 flags:0x00000000
Jul 04 00:46:51 k3s06 kernel: Call Trace:
Jul 04 00:46:51 k3s06 kernel:  <TASK>
Jul 04 00:46:51 k3s06 kernel:  __schedule+0x24e/0x590
Jul 04 00:46:51 k3s06 kernel:  schedule+0x69/0x110
Jul 04 00:46:51 k3s06 kernel:  io_schedule+0x46/0x80
Jul 04 00:46:51 k3s06 kernel:  wait_on_page_bit_common+0x10c/0x3d0
Jul 04 00:46:51 k3s06 kernel:  ? filemap_invalidate_unlock_two+0x50/0x50
Jul 04 00:46:51 k3s06 kernel:  wait_on_page_bit+0x3f/0x50
Jul 04 00:46:51 k3s06 kernel:  wait_on_page_writeback+0x26/0x80
Jul 04 00:46:51 k3s06 kernel:  __filemap_fdatawait_range+0x97/0x120
Jul 04 00:46:51 k3s06 kernel:  ? wbc_detach_inode+0x142/0x210
Jul 04 00:46:51 k3s06 kernel:  file_write_and_wait_range+0xd3/0xf0
Jul 04 00:46:51 k3s06 kernel:  ext4_sync_file+0x7b/0x330
Jul 04 00:46:51 k3s06 kernel:  __x64_sys_fdatasync+0x48/0x90
Jul 04 00:46:51 k3s06 kernel:  do_syscall_64+0x59/0xc0
Jul 04 00:46:51 k3s06 kernel:  ? do_syscall_64+0x69/0xc0
Jul 04 00:46:51 k3s06 kernel:  ? do_syscall_64+0x69/0xc0
Jul 04 00:46:51 k3s06 kernel:  ? do_syscall_64+0x69/0xc0
Jul 04 00:46:51 k3s06 kernel:  ? sysvec_apic_timer_interrupt+0x4e/0x90
Jul 04 00:46:51 k3s06 kernel:  entry_SYSCALL_64_after_hwframe+0x61/0xcb
Jul 04 00:46:51 k3s06 kernel: RIP: 0033:0x7fcd4954ffc7
Jul 04 00:46:51 k3s06 kernel: RSP: 002b:00007fcd4357a7b0 EFLAGS: 00000293 ORIG_RAX: 000000000000004b
Jul 04 00:46:51 k3s06 kernel: RAX: ffffffffffffffda RBX: 0000000000000093 RCX: 00007fcd4954ffc7
Jul 04 00:46:51 k3s06 kernel: RDX: 0000000000000000 RSI: 0000000000000093 RDI: 0000000000000093
Jul 04 00:46:51 k3s06 kernel: RBP: 00007fcd4357a810 R08: 0000000000000020 R09: 0000000000000000
Jul 04 00:46:51 k3s06 kernel: R10: 0000000000000017 R11: 0000000000000293 R12: 000000000000000a
Jul 04 00:46:51 k3s06 kernel: R13: 0000000000000001 R14: 00005630ccb039c0 R15: 00005630c5c46269
Jul 04 00:46:51 k3s06 kernel:  </TASK>
Jul 04 00:48:52 k3s06 kernel: INFO: task jbd2/rbd7-8:571525 blocked for more than 120 seconds.
Jul 04 00:48:52 k3s06 kernel:       Tainted: P           O      5.15.0-76-generic #83-Ubuntu
Jul 04 00:48:52 k3s06 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 04 00:48:52 k3s06 kernel: task:jbd2/rbd7-8     state:D stack:    0 pid:571525 ppid:     2 flags:0x00004000
Jul 04 00:48:52 k3s06 kernel: Call Trace:
Jul 04 00:48:52 k3s06 kernel:  <TASK>
Jul 04 00:48:52 k3s06 kernel:  __schedule+0x24e/0x590
Jul 04 00:48:52 k3s06 kernel:  schedule+0x69/0x110
Jul 04 00:48:52 k3s06 kernel:  jbd2_journal_commit_transaction+0x307/0x17a0
Jul 04 00:48:52 k3s06 kernel:  ? __update_idle_core+0x93/0x120
Jul 04 00:48:52 k3s06 kernel:  ? psi_task_switch+0xc6/0x220
Jul 04 00:48:52 k3s06 kernel:  ? wait_woken+0x70/0x70
Jul 04 00:48:52 k3s06 kernel:  kjournald2+0xa9/0x280
Jul 04 00:48:52 k3s06 kernel:  ? wait_woken+0x70/0x70
Jul 04 00:48:52 k3s06 kernel:  ? load_superblock.part.0+0xc0/0xc0
Jul 04 00:48:52 k3s06 kernel:  kthread+0x127/0x150
Jul 04 00:48:52 k3s06 kernel:  ? set_kthread_struct+0x50/0x50
Jul 04 00:48:52 k3s06 kernel:  ret_from_fork+0x1f/0x30
Jul 04 00:48:52 k3s06 kernel:  </TASK>
Jul 04 00:48:52 k3s06 kernel: INFO: task python3:584178 blocked for more than 362 seconds.
Jul 04 00:48:52 k3s06 kernel:       Tainted: P           O      5.15.0-76-generic #83-Ubuntu
Jul 04 00:48:52 k3s06 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 04 00:48:52 k3s06 kernel: task:python3         state:D stack:    0 pid:584178 ppid:579848 flags:0x00004000
Jul 04 00:48:52 k3s06 kernel: Call Trace:
Jul 04 00:48:52 k3s06 kernel:  <TASK>
Jul 04 00:48:52 k3s06 kernel:  __schedule+0x24e/0x590
Jul 04 00:48:52 k3s06 kernel:  schedule+0x69/0x110
Jul 04 00:48:52 k3s06 kernel:  io_schedule+0x46/0x80
Jul 04 00:48:52 k3s06 kernel:  wait_on_page_bit_common+0x10c/0x3d0
Jul 04 00:48:52 k3s06 kernel:  ? filemap_invalidate_unlock_two+0x50/0x50
Jul 04 00:48:52 k3s06 kernel:  wait_on_page_bit+0x3f/0x50
Jul 04 00:48:52 k3s06 kernel:  wait_on_page_writeback+0x26/0x80
Jul 04 00:48:52 k3s06 kernel:  __filemap_fdatawait_range+0x97/0x120
Jul 04 00:48:52 k3s06 kernel:  ? dentry_lru_isolate+0xcf/0x140
Jul 04 00:48:52 k3s06 kernel:  file_write_and_wait_range+0xd3/0xf0
Jul 04 00:48:52 k3s06 kernel:  ext4_sync_file+0x7b/0x330
Jul 04 00:48:52 k3s06 kernel:  __x64_sys_fdatasync+0x48/0x90
Jul 04 00:48:52 k3s06 kernel:  do_syscall_64+0x59/0xc0
Jul 04 00:48:52 k3s06 kernel:  ? exit_to_user_mode_prepare+0x37/0xb0
Jul 04 00:48:52 k3s06 kernel:  ? syscall_exit_to_user_mode+0x27/0x50
Jul 04 00:48:52 k3s06 kernel:  ? __x64_sys_write+0x19/0x20
Jul 04 00:48:52 k3s06 kernel:  ? do_syscall_64+0x69/0xc0
Jul 04 00:48:52 k3s06 kernel:  ? syscall_exit_to_user_mode+0x27/0x50
Jul 04 00:48:52 k3s06 kernel:  ? __x64_sys_lseek+0x18/0x20
Jul 04 00:48:52 k3s06 kernel:  ? do_syscall_64+0x69/0xc0
Jul 04 00:48:52 k3s06 kernel:  ? do_syscall_64+0x69/0xc0
Jul 04 00:48:52 k3s06 kernel:  entry_SYSCALL_64_after_hwframe+0x61/0xcb
Jul 04 00:48:52 k3s06 kernel: RIP: 0033:0x7f03aa537fac
Jul 04 00:48:52 k3s06 kernel: RSP: 002b:00007f039defe4a8 EFLAGS: 00000246 ORIG_RAX: 000000000000004b
Jul 04 00:48:52 k3s06 kernel: RAX: ffffffffffffffda RBX: 000000000000004b RCX: 00007f03aa537fac
Jul 04 00:48:52 k3s06 kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000009
Jul 04 00:48:52 k3s06 kernel: RBP: 00007f039deffb38 R08: 0000000000000000 R09: 0000000000000000
Jul 04 00:48:52 k3s06 kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 00007f02ec89a008
Jul 04 00:48:52 k3s06 kernel: R13: 0000000000000000 R14: 0000000000000408 R15: 0000000000000000
Jul 04 00:48:52 k3s06 kernel:  </TASK>
Jul 04 00:48:52 k3s06 kernel: INFO: task node:615723 blocked for more than 362 seconds.
Jul 04 00:48:52 k3s06 kernel:       Tainted: P           O      5.15.0-76-generic #83-Ubuntu
Jul 04 00:48:52 k3s06 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 04 00:48:52 k3s06 kernel: task:node            state:D stack:    0 pid:615723 ppid:615555 flags:0x00000004
Jul 04 00:48:52 k3s06 kernel: Call Trace:
Jul 04 00:48:52 k3s06 kernel:  <TASK>
Jul 04 00:48:52 k3s06 kernel:  __schedule+0x24e/0x590
Jul 04 00:48:52 k3s06 kernel:  schedule+0x69/0x110
Jul 04 00:48:52 k3s06 kernel:  io_schedule+0x46/0x80
Jul 04 00:48:52 k3s06 kernel:  wait_on_page_bit_common+0x10c/0x3d0
Jul 04 00:48:52 k3s06 kernel:  ? filemap_invalidate_unlock_two+0x50/0x50
Jul 04 00:48:52 k3s06 kernel:  wait_on_page_bit+0x3f/0x50
Jul 04 00:48:52 k3s06 kernel:  wait_on_page_writeback+0x26/0x80
Jul 04 00:48:52 k3s06 kernel:  __filemap_fdatawait_range+0x97/0x120
Jul 04 00:48:52 k3s06 kernel:  ? __filemap_fdatawait_range+0x3a/0x120
Jul 04 00:48:52 k3s06 kernel:  file_write_and_wait_range+0xd3/0xf0
Jul 04 00:48:52 k3s06 kernel:  ext4_sync_file+0x7b/0x330
Jul 04 00:48:52 k3s06 kernel:  vfs_fsync_range+0x46/0x90
Jul 04 00:48:52 k3s06 kernel:  ? __fget_light+0x39/0x90
Jul 04 00:48:52 k3s06 kernel:  __x64_sys_fsync+0x38/0x70
Jul 04 00:48:52 k3s06 kernel:  do_syscall_64+0x59/0xc0
Jul 04 00:48:52 k3s06 kernel:  ? syscall_exit_to_user_mode+0x27/0x50
Jul 04 00:48:52 k3s06 kernel:  ? __x64_sys_write+0x19/0x20
Jul 04 00:48:52 k3s06 kernel:  ? do_syscall_64+0x69/0xc0
Jul 04 00:48:52 k3s06 kernel:  ? exit_to_user_mode_prepare+0x37/0xb0
Jul 04 00:48:52 k3s06 kernel:  ? syscall_exit_to_user_mode+0x27/0x50
Jul 04 00:48:52 k3s06 kernel:  ? __x64_sys_write+0x19/0x20
Jul 04 00:48:52 k3s06 kernel:  ? do_syscall_64+0x69/0xc0
Jul 04 00:48:52 k3s06 kernel:  ? __x64_sys_lseek+0x18/0x20
Jul 04 00:48:52 k3s06 kernel:  ? do_syscall_64+0x69/0xc0
Jul 04 00:48:52 k3s06 kernel:  ? syscall_exit_to_user_mode+0x27/0x50
Jul 04 00:48:52 k3s06 kernel:  ? __x64_sys_write+0x19/0x20
Jul 04 00:48:52 k3s06 kernel:  ? do_syscall_64+0x69/0xc0
Jul 04 00:48:52 k3s06 kernel:  ? do_syscall_64+0x69/0xc0
Jul 04 00:48:52 k3s06 kernel:  entry_SYSCALL_64_after_hwframe+0x61/0xcb
Jul 04 00:48:52 k3s06 kernel: RIP: 0033:0x7f8a92ba2a97
Jul 04 00:48:52 k3s06 kernel: RSP: 002b:00007f8a82b19890 EFLAGS: 00000293 ORIG_RAX: 000000000000004a
Jul 04 00:48:52 k3s06 kernel: RAX: ffffffffffffffda RBX: 0000000000000013 RCX: 00007f8a92ba2a97
Jul 04 00:48:52 k3s06 kernel: RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000000013
Jul 04 00:48:52 k3s06 kernel: RBP: 00007f8a82b198e0 R08: 0000000000000000 R09: 0000000000000001
Jul 04 00:48:52 k3s06 kernel: R10: 0000000000000000 R11: 0000000000000293 R12: 00007f8a82b19960
Jul 04 00:48:52 k3s06 kernel: R13: 0000000000000002 R14: 00007f8a6413cb50 R15: 00007f8a64016238
Jul 04 00:48:52 k3s06 kernel:  </TASK>
Jul 04 00:48:52 k3s06 kernel: INFO: task WTJourn.Flusher:616604 blocked for more than 120 seconds.
Jul 04 00:48:52 k3s06 kernel:       Tainted: P           O      5.15.0-76-generic #83-Ubuntu
Jul 04 00:48:52 k3s06 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 04 00:48:52 k3s06 kernel: task:WTJourn.Flusher state:D stack:    0 pid:616604 ppid:616510 flags:0x00000000
Jul 04 00:48:52 k3s06 kernel: Call Trace:
Jul 04 00:48:52 k3s06 kernel:  <TASK>
Jul 04 00:48:52 k3s06 kernel:  __schedule+0x24e/0x590
Jul 04 00:48:52 k3s06 kernel:  ? update_sd_lb_stats.constprop.0+0xf2/0x390
Jul 04 00:48:52 k3s06 kernel:  schedule+0x69/0x110
Jul 04 00:48:52 k3s06 kernel:  wait_transaction_locked+0x8e/0xd0
Jul 04 00:48:52 k3s06 kernel:  ? wait_woken+0x70/0x70
Jul 04 00:48:52 k3s06 kernel:  add_transaction_credits+0xdb/0x2d0
Jul 04 00:48:52 k3s06 kernel:  start_this_handle+0xfd/0x510
Jul 04 00:48:52 k3s06 kernel:  ? kmem_cache_alloc+0x1ab/0x2f0
Jul 04 00:48:52 k3s06 kernel:  ? jbd2__journal_start+0x95/0x1f0
Jul 04 00:48:52 k3s06 kernel:  jbd2__journal_start+0x103/0x1f0
Jul 04 00:48:52 k3s06 kernel:  __ext4_journal_start_sb+0x10b/0x130
Jul 04 00:48:52 k3s06 kernel:  ext4_dirty_inode+0x3a/0x80
Jul 04 00:48:52 k3s06 kernel:  __mark_inode_dirty+0x5b/0x330
Jul 04 00:48:52 k3s06 kernel:  generic_update_time+0x6c/0xd0
Jul 04 00:48:52 k3s06 kernel:  file_update_time+0x127/0x140
Jul 04 00:48:52 k3s06 kernel:  file_modified+0x27/0x40
Jul 04 00:48:52 k3s06 kernel:  ext4_buffered_write_iter+0x5f/0x180
Jul 04 00:48:52 k3s06 kernel:  ext4_file_write_iter+0x43/0x60
Jul 04 00:48:52 k3s06 kernel:  new_sync_write+0x111/0x1a0
Jul 04 00:48:52 k3s06 kernel:  vfs_write+0x1d5/0x270
Jul 04 00:48:52 k3s06 kernel:  __x64_sys_pwrite64+0x96/0xc0
Jul 04 00:48:52 k3s06 kernel:  do_syscall_64+0x59/0xc0
Jul 04 00:48:52 k3s06 kernel:  ? restore_fpregs_from_fpstate+0x4e/0xc0
Jul 04 00:48:52 k3s06 kernel:  ? switch_fpu_return+0x4e/0xc0
Jul 04 00:48:52 k3s06 kernel:  ? exit_to_user_mode_prepare+0x96/0xb0
Jul 04 00:48:52 k3s06 kernel:  ? syscall_exit_to_user_mode+0x27/0x50
Jul 04 00:48:52 k3s06 kernel:  ? do_syscall_64+0x69/0xc0
Jul 04 00:48:52 k3s06 kernel:  ? syscall_exit_to_user_mode+0x27/0x50
Jul 04 00:48:52 k3s06 kernel:  ? do_syscall_64+0x69/0xc0
Jul 04 00:48:52 k3s06 kernel:  ? exit_to_user_mode_prepare+0x96/0xb0
Jul 04 00:48:52 k3s06 kernel:  ? syscall_exit_to_user_mode+0x27/0x50
Jul 04 00:48:52 k3s06 kernel:  ? do_syscall_64+0x69/0xc0
Jul 04 00:48:52 k3s06 kernel:  entry_SYSCALL_64_after_hwframe+0x61/0xcb
Jul 04 00:48:52 k3s06 kernel: RIP: 0033:0x7fcd4983c13f
Jul 04 00:48:52 k3s06 kernel: RSP: 002b:00007fcd43d7b570 EFLAGS: 00000293 ORIG_RAX: 0000000000000012
Jul 04 00:48:52 k3s06 kernel: RAX: ffffffffffffffda RBX: 0000000000000010 RCX: 00007fcd4983c13f
Jul 04 00:48:52 k3s06 kernel: RDX: 0000000000000500 RSI: 00005630cbdec000 RDI: 0000000000000010
Jul 04 00:48:52 k3s06 kernel: RBP: 00005630cbdec000 R08: 0000000000000000 R09: 0000000000000500
Jul 04 00:48:52 k3s06 kernel: R10: 0000000000b6d300 R11: 0000000000000293 R12: 0000000000000500
Jul 04 00:48:52 k3s06 kernel: R13: 0000000000b6d300 R14: 00005630c8968b40 R15: 0000000000b6d300
Jul 04 00:48:52 k3s06 kernel:  </TASK>

The ZFS panic does not show up until a bit later:

Jul 04 01:20:35 k3s06 kernel: PANIC: zfs: adding existent segment to range tree (offset=2941f81f000 size=1000)
Jul 04 01:20:35 k3s06 kernel: Showing stack for process 1111
Jul 04 01:20:35 k3s06 kernel: CPU: 18 PID: 1111 Comm: txg_sync Tainted: P           O      5.15.0-76-generic #83-Ubuntu
Jul 04 01:20:35 k3s06 kernel: Hardware name: ASRock B660M Steel Legend/B660M Steel Legend, BIOS 10.02 02/10/2023
Jul 04 01:20:35 k3s06 kernel: Call Trace:
Jul 04 01:20:35 k3s06 kernel:  <TASK>
Jul 04 01:20:35 k3s06 kernel:  show_stack+0x52/0x5c
Jul 04 01:20:35 k3s06 kernel:  dump_stack_lvl+0x4a/0x63
Jul 04 01:20:35 k3s06 kernel:  dump_stack+0x10/0x16
Jul 04 01:20:35 k3s06 kernel:  spl_dumpstack+0x29/0x2f [spl]
Jul 04 01:20:35 k3s06 kernel:  vcmn_err.cold+0x60/0x78 [spl]
Jul 04 01:20:35 k3s06 kernel:  ? bcpy+0x17/0x20 [zfs]
Jul 04 01:20:35 k3s06 kernel:  ? zfs_btree_insert_leaf_impl+0x3f/0x50 [zfs]
Jul 04 01:20:35 k3s06 kernel:  ? zfs_btree_insert_into_leaf+0x21f/0x2b0 [zfs]
Jul 04 01:20:35 k3s06 kernel:  ? percpu_counter_add+0xf/0x20 [spl]
Jul 04 01:20:35 k3s06 kernel:  ? zfs_btree_find_in_buf+0x5a/0xa0 [zfs]
Jul 04 01:20:35 k3s06 kernel:  zfs_panic_recover+0x6d/0x90 [zfs]
Jul 04 01:20:35 k3s06 kernel:  range_tree_add_impl+0x183/0x610 [zfs]
Jul 04 01:20:35 k3s06 kernel:  ? __raw_spin_unlock+0x9/0x10 [zfs]
Jul 04 01:20:35 k3s06 kernel:  range_tree_add+0x11/0x20 [zfs]
Jul 04 01:20:35 k3s06 kernel:  metaslab_free_concrete+0x146/0x270 [zfs]
Jul 04 01:20:35 k3s06 kernel:  ? do_raw_spin_unlock+0x9/0x10 [zfs]
Jul 04 01:20:35 k3s06 kernel:  metaslab_free_impl+0xb3/0xf0 [zfs]
Jul 04 01:20:35 k3s06 kernel:  metaslab_free_dva+0x61/0x80 [zfs]
Jul 04 01:20:35 k3s06 kernel:  metaslab_free+0x114/0x1d0 [zfs]
Jul 04 01:20:35 k3s06 kernel:  zio_free_sync+0xf1/0x110 [zfs]
Jul 04 01:20:35 k3s06 kernel:  dsl_scan_free_block_cb+0x6e/0x1d0 [zfs]
Jul 04 01:20:35 k3s06 kernel:  bpobj_dsl_scan_free_block_cb+0x11/0x20 [zfs]
Jul 04 01:20:35 k3s06 kernel:  bpobj_iterate_blkptrs+0xf6/0x380 [zfs]
Jul 04 01:20:35 k3s06 kernel:  ? list_head+0xd/0x30 [zfs]
Jul 04 01:20:35 k3s06 kernel:  ? dsl_scan_free_block_cb+0x1d0/0x1d0 [zfs]
Jul 04 01:20:35 k3s06 kernel:  bpobj_iterate_impl+0x23b/0x390 [zfs]
Jul 04 01:20:35 k3s06 kernel:  ? dsl_scan_free_block_cb+0x1d0/0x1d0 [zfs]
Jul 04 01:20:35 k3s06 kernel:  bpobj_iterate+0x17/0x20 [zfs]
Jul 04 01:20:35 k3s06 kernel:  dsl_process_async_destroys+0x2d5/0x580 [zfs]
Jul 04 01:20:35 k3s06 kernel:  dsl_scan_sync+0x1ec/0x910 [zfs]
Jul 04 01:20:35 k3s06 kernel:  ? ddt_sync+0xa8/0xd0 [zfs]
Jul 04 01:20:35 k3s06 kernel:  spa_sync_iterate_to_convergence+0x124/0x1f0 [zfs]
Jul 04 01:20:35 k3s06 kernel:  spa_sync+0x2dc/0x5b0 [zfs]
Jul 04 01:20:35 k3s06 kernel:  txg_sync_thread+0x266/0x2f0 [zfs]
Jul 04 01:20:35 k3s06 kernel:  ? txg_dispatch_callbacks+0x100/0x100 [zfs]
Jul 04 01:20:35 k3s06 kernel:  thread_generic_wrapper+0x61/0x80 [spl]
Jul 04 01:20:35 k3s06 kernel:  ? __thread_exit+0x20/0x20 [spl]
Jul 04 01:20:35 k3s06 kernel:  kthread+0x127/0x150
Jul 04 01:20:35 k3s06 kernel:  ? set_kthread_struct+0x50/0x50
Jul 04 01:20:35 k3s06 kernel:  ret_from_fork+0x1f/0x30
Jul 04 01:20:35 k3s06 kernel:  </TASK>
AdamLantos commented 1 year ago

I was having similar symptoms on a TrueNAS Scale system. My backup pool of 2 disks got corrupted somehow. Scrub is clean, but any time I'm pushing a particular dataset to it, it panics. I was able to reproduce the panic by manually deleting one snapshot from the dataset. Export and reimport works, but deleting snapshots immediately panics the system and needs hard reset since normal shutdown never completes after zfs panic.

Since the backup pool is somewhat important for me, I enabled zfs_recover parameter and deleted the offending dataset and replicated from scratch. For now the system appears to behave, but I'm not sure if I have confidence in the pool any more. I will keep the system running with zfs_recover enabled for a few days, and will wipe/recreate the pool when this reoccurs.

reefland commented 1 year ago

I left it running for about a week with zil_replay_disable and zfs_recover set. All that time, did not have a panic or warning message. Rebooted and completed a full scrub without a panic message.

I've had no more zfs_panic_recover or WARNING: zfs: adding existent segment to range tree since then... close to 2 weeks now.

I assume this is from zfs constantly rewriting spacemaps in condensing operation. Running the pool in recovery mode for a while eventually removed the duplicated range descriptions during the rewrite operations.... or something like that.

satmandu commented 1 year ago

I'm also seeing this on a pool using kernel 6.5-rc6. Here is my zdb output: https://dpaste.com/DL9KS6SQ5

I've set boot variables for vfs.zfs.zil.replay_disable=1 and vfs.zfs.zfs_recover=1 for now, so hopefully that helps?

satmandu commented 1 year ago

@reefland Is there a way to force spacemap condensing to happen faster to try to resolve this?

reefland commented 1 year ago

@satmandu - sorry, no idea. No clue how to monitor the spacemap condensing operation - trial and error. Hopefully someone more knowledgeable can answer.

satmandu commented 1 year ago

Thanks. My pool has clean scrubs, but my system still panics if I mount the pool without the special zfs options... but I also haven't had the pool mounted with the special options for more than 16 hours...

rincebrain commented 1 year ago

If rewriting the spacemaps removes the assertion, that might suggest you still have lurking excitement eventually, since in theory that should be equally upset by duplicate range insertion, no?

My wild guess would be when it threw out the log after you imported the pool and unlocked the datasets with replaying disabled, it no longer had a pending insertion that was invalid. But that's just a wild guess.

satmandu commented 1 year ago

I hit this again... and had to set zfs.zil.replay_disable=1 and zfs.zfs_recover=1 as kernel cmdline arguments to get boot to succeed.

reefland commented 1 year ago

I've had this multiple times on different systems (it has not duplicated on the same system). It's just a process I follow now to recover the system. At least its a workaround. I'm no closer to finding a cause / pattern.

satmandu commented 1 year ago

I destroyed a zfs subvolume (that may have had this issue because that's where intense writes from docker were happening) and still have the issue when importing the parent volume.

I will copy all the data to a fresh volume and recreate the volume.

If I delete all the data and the volume still has these issues, could someone look at an image of the disk to help figure out how to repair this issue, even if figuring out how this damage is being created isn't proximally resolved?

Would that be useful at all?

reefland commented 1 year ago

@satmandu - I can't comment on your disk image question.

But I've had to destroy a number of ZVOLs (extract data, delete ZVOL, recreate, restore, etc) related to this as well. I've since switched all ZVOLs to forced sync writes. Yeah, performance impact, but I haven't had any ZVOLs get corrupted since.

I was frequently getting this after a ZFS panic:

errors: Permanent errors have been detected in the following files:
        rpool/rancher:<0x1> 

Multiple scrubs were not able to fix it. I'm using CRI / K3s instead of Docker, but we're probably using ZVOLs for similar reasons.

satmandu commented 1 year ago

I wonder if we can create a test case of amplifying read/writes to replicate the issue.

Or maybe if there is an existing test case we should be trying on our hardware?

This sounds like it should be replicable in a test environment where one is reaching the I/O limits of a disk subsystem or a disk.

kaitallaoua commented 1 month ago

I've also encounters this when moving a nvme to another system, I don't know what I did wrong. Only thing I can think of it inserting into the new system while PSU was plugged in but pc was off. Forgot to unplug it. I also noticed getting i/o errors, I don't think the drive is faulty but have not tested it yet. Computer passed memtest

dodexahedron commented 3 weeks ago

I've been encountering this on 2.3.0-rc2 after deleting large files (4+GB each) from a pool with dedup active.

PANIC: zfs: adding existent segment to range tree (offset=78cca32000 size=6000)
Showing stack for process 11146
CPU: 0 UID: 0 PID: 11146 Comm: txg_sync Tainted: P           OE      6.11.5-1.el8.elrepo.x86_64 #1
Tainted: [P]=PROPRIETARY_MODULE, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
Hardware name: Supermicro AS -2014TP-HTR/H12SST-PS, BIOS 2.8 01/19/2024
Call Trace:
 <TASK>
 dump_stack_lvl+0x60/0x80
 vcmn_err+0xca/0x110 [spl]
 ? bt_grow_leaf+0xd0/0x160 [zfs]
 ? srso_return_thunk+0x5/0x5f
 ? srso_return_thunk+0x5/0x5f
 ? arc_access+0x295/0x4a0 [zfs]
 ? srso_return_thunk+0x5/0x5f
 ? arc_buf_access+0xf8/0x180 [zfs]
 ? srso_return_thunk+0x5/0x5f
 ? srso_return_thunk+0x5/0x5f
 ? dbuf_read+0x287/0x640 [zfs]
 ? srso_return_thunk+0x5/0x5f
 ? mutex_lock+0xe/0x30
 ? srso_return_thunk+0x5/0x5f
 ? srso_return_thunk+0x5/0x5f
 zfs_panic_recover+0x6d/0x90 [zfs]
 range_tree_add_impl+0x244/0xe30 [zfs]
 ? srso_return_thunk+0x5/0x5f
 ? dbuf_read+0x287/0x640 [zfs]
 ? srso_return_thunk+0x5/0x5f
 range_tree_remove_xor_add_segment+0x513/0x610 [zfs]
 ? srso_return_thunk+0x5/0x5f
 ? srso_return_thunk+0x5/0x5f
 range_tree_remove_xor_add+0x6e/0x170 [zfs]
 metaslab_sync+0x277/0x950 [zfs]
 ? srso_return_thunk+0x5/0x5f
 ? dmu_tx_create_dd+0x7c/0xc0 [zfs]
 ? srso_return_thunk+0x5/0x5f
 vdev_sync+0x6b/0x1e0 [zfs]
 spa_sync_iterate_to_convergence+0x185/0x2f0 [zfs]
 spa_sync+0x4a3/0x980 [zfs]
 txg_sync_thread+0x206/0x3a0 [zfs]
 ? __pfx_txg_sync_thread+0x10/0x10 [zfs]
 ? __pfx_thread_generic_wrapper+0x10/0x10 [spl]
 thread_generic_wrapper+0x59/0x70 [spl]
 kthread+0xce/0x100
 ? __pfx_kthread+0x10/0x10
 ret_from_fork+0x30/0x50
 ? __pfx_kthread+0x10/0x10
 ret_from_fork_asm+0x1a/0x30
 </TASK>

After it occurs, zpool scrub hangs and ends up causing these:

INFO: task txg_sync:11146 blocked for more than 122 seconds.
      Tainted: P           OE      6.11.5-1.el8.elrepo.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:txg_sync        state:D stack:0     pid:11146 tgid:11146 ppid:2      flags:0x00004000
Call Trace:
 <TASK>
 __schedule+0x225/0x5b0
 schedule+0x23/0xa0
 vcmn_err+0xd7/0x110 [spl]
 ? bt_grow_leaf+0xd0/0x160 [zfs]
 ? srso_return_thunk+0x5/0x5f
 ? srso_return_thunk+0x5/0x5f
 ? arc_access+0x295/0x4a0 [zfs]
 ? srso_return_thunk+0x5/0x5f
 ? arc_buf_access+0xf8/0x180 [zfs]
 ? srso_return_thunk+0x5/0x5f
 ? srso_return_thunk+0x5/0x5f
 ? dbuf_read+0x287/0x640 [zfs]
 ? srso_return_thunk+0x5/0x5f
 ? mutex_lock+0xe/0x30
 ? srso_return_thunk+0x5/0x5f
 ? srso_return_thunk+0x5/0x5f
 zfs_panic_recover+0x6d/0x90 [zfs]
 range_tree_add_impl+0x244/0xe30 [zfs]
 ? srso_return_thunk+0x5/0x5f
 ? dbuf_read+0x287/0x640 [zfs]
 ? srso_return_thunk+0x5/0x5f
 range_tree_remove_xor_add_segment+0x513/0x610 [zfs]
 ? srso_return_thunk+0x5/0x5f
 ? srso_return_thunk+0x5/0x5f
 range_tree_remove_xor_add+0x6e/0x170 [zfs]
 metaslab_sync+0x277/0x950 [zfs]
 ? srso_return_thunk+0x5/0x5f
 ? dmu_tx_create_dd+0x7c/0xc0 [zfs]
 ? srso_return_thunk+0x5/0x5f
 vdev_sync+0x6b/0x1e0 [zfs]
 spa_sync_iterate_to_convergence+0x185/0x2f0 [zfs]
 spa_sync+0x4a3/0x980 [zfs]
 txg_sync_thread+0x206/0x3a0 [zfs]
 ? __pfx_txg_sync_thread+0x10/0x10 [zfs]
 ? __pfx_thread_generic_wrapper+0x10/0x10 [spl]
 thread_generic_wrapper+0x59/0x70 [spl]
 kthread+0xce/0x100
 ? __pfx_kthread+0x10/0x10
 ret_from_fork+0x30/0x50
 ? __pfx_kthread+0x10/0x10
 ret_from_fork_asm+0x1a/0x30
 </TASK>

...

INFO: task zpool:876884 blocked for more than 614 seconds.
      Tainted: P           OE      6.11.5-1.el8.elrepo.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:zpool           state:D stack:0     pid:876884 tgid:876884 ppid:875951 flags:0x00004000
Call Trace:
 <TASK>
 __schedule+0x225/0x5b0
 schedule+0x23/0xa0
 io_schedule+0x42/0x70
 cv_wait_common+0xab/0x130 [spl]
 ? __pfx_autoremove_wake_function+0x10/0x10
 txg_wait_synced_impl+0xfc/0x140 [zfs]
 txg_wait_synced+0xc/0x40 [zfs]
 dsl_sync_task_common+0x1bf/0x2a0 [zfs]
 ? __pfx_dsl_scan_setup_sync+0x10/0x10 [zfs]
 ? __pfx_dsl_scan_setup_check+0x10/0x10 [zfs]
 ? __pfx_dsl_scan_setup_check+0x10/0x10 [zfs]
 ? __pfx_dsl_scan_setup_sync+0x10/0x10 [zfs]
 dsl_sync_task+0x16/0x20 [zfs]
 dsl_scan+0xaa/0x220 [zfs]
 zfs_ioc_pool_scrub+0x9c/0x160 [zfs]
 zfsdev_ioctl_common+0x297/0x760 [zfs]
 ? srso_return_thunk+0x5/0x5f
 zfsdev_ioctl+0x4f/0xd0 [zfs]
 __x64_sys_ioctl+0x93/0xd0
 do_syscall_64+0x60/0x170
 entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7fd909c3922b
RSP: 002b:00007ffe74bfbde8 EFLAGS: 00000206 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 0000000000005a57 RCX: 00007fd909c3922b
RDX: 00007ffe74bfbe10 RSI: 0000000000005a57 RDI: 0000000000000004
RBP: 00007ffe74bff3f0 R08: 0000000000000000 R09: 0000000016fb9880
R10: 0000000000000000 R11: 0000000000000206 R12: 00007ffe74bfbe10
R13: 0000000000005a57 R14: 0000000016f91c00 R15: 0000000000000000
 </TASK>

These continue indefinitely and are only resolved by a reboot, after which a scrub works as normal and doesn't find any problems.

Seeing as how a txg sync never happened, though, the pool may be consistent, but that suggests data loss, no?

Other pool activity appears to continue while those hung tasks remain hung forever.

If it helps at all, there are SHA512 and BLAKE3 ZAPs on that pool, and they're quite big for the pool, so I'm trying to get rid of the old SHA512 data so a big chunk can go away. It was twice as large before I got rid of the SHA256 ZAP that was also there from even earlier.

Here's a zdb -DD on that pool:

DDT-sha512-zap-duplicate: 5950988 entries, size 14914019328 on disk, 4815867904 in core
DDT-sha512-zap-unique: 39829815 entries, size 93688590336 on disk, 30253473792 in core
DDT-blake3-zap-duplicate: 903049 entries, size 413208576 on disk, 268533760 in core
DDT-blake3-zap-unique: 11498977 entries, size 4799926272 on disk, 3097395200 in core
DDT-log-blake3-0: 18590 log entries
DDT-log-blake3-1: 87105 log entries
dodexahedron commented 3 weeks ago

Confirmed that the delete of two CentOS ISOs (11GB each) that caused the most recent instance of this did not actually get committed to the pool, so must have been in that hung txg.

Upon further testing, it happens when trying to remove one of those files, in particular.

Reboot after the hang and the files are back.

Removed the "safe" one and that was fine. Ran a zpool sync. No problem.

Remove the other one, and it immediately panicked exactly as above, and that txg never got committed.

Reboot after that, and the problem file is back again. Deleted it, and the same thing happened again. So I can recreate this panic on demand, with next to no load on the system at all.

Anything y'all want me to look at specifically?

dodexahedron commented 3 weeks ago

Also note that the file appears to be gone after deleting it, even though that txg is uncommitted. It only reappears after reboot and import of the pool.