openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.42k stars 1.73k forks source link

Importing corrupted pool causes PANIC: zfs: adding existent segment to range tree #13483

Open raron opened 2 years ago

raron commented 2 years ago

System information

Type Version/Name
Distribution Name Debian
Distribution Version bullseye (11)
Kernel Version 5.10.0-14
Architecture amd64
OpenZFS Version zfs-2.0.3-9 zfs-kmod-2.0.3-9

Tried on another system

Type Version/Name
Distribution Name hrmpf rescue system / Void Linux
Distribution Version 20211227
Kernel Version 5.15.11_1
Architecture amd64
OpenZFS Version zfs-2.1.2-1 zfs-kmod-2.1.2-1

Describe the problem you're observing

I have a (no redundancy) pool which have been corrupted by a HW failure. Trying to import it causes a PANIC and the zpool import process hangs in "D uninterruptible sleep (usually IO)" state:

PANIC: zfs: adding existent segment to range tree

Maybe issue #13445 is related, there is a similar backtrace there.

The pool can be imported with zpool import -o readonly=true -f rpool or with zpool import -f -T 2676127 rpool.

Describe how to reproduce the problem

I have a dump / image of the pool in the corrupted state on which I can repeatedly reproduce this with both system / ZFS version.
I can (and willing to) try out possible solutions, too. (The original pool have been recovered with the -T txg method.)

Unfortunately I can not share the whole image as it contains personal information, some short hexdump may be possible.

Include any warning/errors/backtraces from the system logs

The results have been reproduced running on qemu/kvm (version 5.2.0) using the image as a virtual disks, the original hypervisor was an ESXi.

The original system (Debian, zfs-2.0.3-9)

[   65.022435] PANIC: zfs: adding existent segment to range tree (offset=76bab1000 size=12000) 
[   65.024094] Showing stack for process 208 
[   65.024915] CPU: 0 PID: 208 Comm: z_wr_iss Tainted: P           OE     5.10.0-14-amd64 #1 Debian 5.10.113-1 
[   65.026795] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014 
[   65.028896] Call Trace: 
[   65.028896]  dump_stack+0x6b/0x83 
[   65.028896]  vcmn_err.cold+0x58/0x80 [spl] 
[   65.028896]  ? metaslab_rangesize64_compare+0x40/0x40 [zfs] 
[   65.028896]  ? zfs_btree_insert_into_leaf+0x233/0x2a0 [zfs] 
[   65.028896]  ? zfs_btree_add_idx+0xd1/0x210 [zfs] 
[   65.028896]  ? zfs_btree_find+0x175/0x300 [zfs] 
[   65.028896]  zfs_panic_recover+0x6d/0x90 [zfs] 
[   65.028896]  range_tree_add_impl+0x305/0xe40 [zfs] 
[   65.028896]  ? range_tree_remove_impl+0xf10/0xf10 [zfs] 
[   65.028896]  range_tree_walk+0xad/0x1e0 [zfs] 
[   65.028896]  metaslab_load+0x359/0x8b0 [zfs] 
[   65.028896]  metaslab_activate+0x4c/0x220 [zfs] 
[   65.028896]  ? metaslab_set_selected_txg+0x7f/0xc0 [zfs] 
[   65.028896]  metaslab_alloc_dva+0x134/0x1210 [zfs] 
[   65.028896]  metaslab_alloc+0xbe/0x250 [zfs] 
[   65.028896]  zio_dva_allocate+0xd4/0x800 [zfs] 
[   65.028896]  ? _cond_resched+0x16/0x40 
[   65.028896]  ? mutex_lock+0xe/0x30 
[   65.028896]  ? metaslab_class_throttle_reserve+0xc3/0xe0 [zfs] 
[   65.028896]  ? zio_io_to_allocate+0x60/0x80 [zfs] 
[   65.028896]  zio_execute+0x81/0x120 [zfs] 
[   65.028896]  taskq_thread+0x2da/0x520 [spl] 
[   65.028896]  ? wake_up_q+0xa0/0xa0 
[   65.028896]  ? zio_destroy+0xf0/0xf0 [zfs] 
[   65.028896]  ? taskq_thread_spawn+0x50/0x50 [spl] 
[   65.028896]  kthread+0x11b/0x140 
[   65.028896]  ? __kthread_bind_mask+0x60/0x60 
[   65.028896]  ret_from_fork+0x22/0x30 

[  242.724547] INFO: task zpool:148 blocked for more than 120 seconds. 
[  242.727882]       Tainted: P           OE     5.10.0-14-amd64 #1 Debian 5.10.113-1 
[  242.732106] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 
[  242.736450] task:zpool           state:D stack:    0 pid:  148 ppid:   131 flags:0x00004002 
[  242.741604] Call Trace: 
[  242.742809]  __schedule+0x282/0x870 
[  242.744808]  schedule+0x46/0xb0 
[  242.745984]  io_schedule+0x42/0x70 
[  242.747203]  cv_wait_common+0xac/0x130 [spl] 
[  242.748743]  ? add_wait_queue_exclusive+0x70/0x70 
[  242.750497]  txg_wait_synced_impl+0xc9/0x110 [zfs] 
[  242.752330]  txg_wait_synced+0xc/0x40 [zfs] 
[  242.754256]  spa_config_update+0x3f/0x170 [zfs] 
[  242.755667]  spa_import+0x5e0/0x840 [zfs] 
[  242.757574]  zfs_ioc_pool_import+0x12f/0x150 [zfs] 
[  242.759000]  zfsdev_ioctl_common+0x697/0x870 [zfs] 
[  242.760111]  ? _copy_from_user+0x28/0x60 
[  242.761065]  zfsdev_ioctl+0x53/0xe0 [zfs] 
[  242.761988]  __x64_sys_ioctl+0x83/0xb0 
[  242.762849]  do_syscall_64+0x33/0x80 
[  242.763672]  entry_SYSCALL_64_after_hwframe+0x44/0xa9 
[  242.764840] RIP: 0033:0x7f6863364cc7 
[  242.765652] RSP: 002b:00007ffe22a5aa28 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 
[  242.767341] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f6863364cc7 
[  242.769220] RDX: 00007ffe22a5aaa0 RSI: 0000000000005a02 RDI: 0000000000000003 
[  242.771302] RBP: 00007ffe22a5e990 R08: 0000000000000000 R09: 00007f686342ebe0 
[  242.775328] R10: 0000000010000000 R11: 0000000000000246 R12: 0000555e380bf320 
[  242.778226] R13: 00007ffe22a5aaa0 R14: 00007f685c001970 R15: 0000000000000000 
[  242.782019] INFO: task z_wr_iss:208 blocked for more than 120 seconds. 
[  242.785910]       Tainted: P           OE     5.10.0-14-amd64 #1 Debian 5.10.113-1 
[  242.790663] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 
[  242.794223] task:z_wr_iss        state:D stack:    0 pid:  208 ppid:     2 flags:0x00004000 
[  242.798323] Call Trace: 
[  242.799648]  __schedule+0x282/0x870 
[  242.801901]  schedule+0x46/0xb0 
[  242.803602]  vcmn_err.cold+0x7e/0x80 [spl] 
[  242.805956]  ? metaslab_rangesize64_compare+0x40/0x40 [zfs] 
[  242.809675]  ? zfs_btree_insert_into_leaf+0x233/0x2a0 [zfs] 
[  242.812331]  ? zfs_btree_add_idx+0xd1/0x210 [zfs] 
[  242.814956]  ? zfs_btree_find+0x175/0x300 [zfs] 
[  242.818007]  zfs_panic_recover+0x6d/0x90 [zfs] 
[  242.820627]  range_tree_add_impl+0x305/0xe40 [zfs] 
[  242.824200]  ? range_tree_remove_impl+0xf10/0xf10 [zfs] 
[  242.826735]  range_tree_walk+0xad/0x1e0 [zfs] 
[  242.828848]  metaslab_load+0x359/0x8b0 [zfs] 
[  242.830868]  metaslab_activate+0x4c/0x220 [zfs] 
[  242.832916]  ? metaslab_set_selected_txg+0x7f/0xc0 [zfs] 
[  242.835249]  metaslab_alloc_dva+0x134/0x1210 [zfs] 
[  242.837181]  metaslab_alloc+0xbe/0x250 [zfs] 
[  242.838851]  zio_dva_allocate+0xd4/0x800 [zfs] 
[  242.841201]  ? _cond_resched+0x16/0x40 
[  242.842367]  ? mutex_lock+0xe/0x30 
[  242.843481]  ? metaslab_class_throttle_reserve+0xc3/0xe0 [zfs] 
[  242.845247]  ? zio_io_to_allocate+0x60/0x80 [zfs] 
[  242.846659]  zio_execute+0x81/0x120 [zfs] 
[  242.847864]  taskq_thread+0x2da/0x520 [spl] 
[  242.849115]  ? wake_up_q+0xa0/0xa0 
[  242.850367]  ? zio_destroy+0xf0/0xf0 [zfs] 
[  242.851759]  ? taskq_thread_spawn+0x50/0x50 [spl] 
[  242.852801]  kthread+0x11b/0x140 
[  242.853513]  ? __kthread_bind_mask+0x60/0x60 
[  242.854447]  ret_from_fork+0x22/0x30 
[  242.855235] INFO: task txg_sync:283 blocked for more than 120 seconds. 
[  242.857674]       Tainted: P           OE     5.10.0-14-amd64 #1 Debian 5.10.113-1 
[  242.859281] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 
[  242.860949] task:txg_sync        state:D stack:    0 pid:  283 ppid:     2 flags:0x00004000 
[  242.862637] Call Trace: 
[  242.863105]  __schedule+0x282/0x870 
[  242.863826]  schedule+0x46/0xb0 
[  242.864409]  schedule_timeout+0x8b/0x140 
[  242.865538]  ? __next_timer_interrupt+0x110/0x110 
[  242.866431]  io_schedule_timeout+0x4c/0x80 
[  242.867213]  __cv_timedwait_common+0x12b/0x160 [spl] 
[  242.868192]  ? add_wait_queue_exclusive+0x70/0x70 
[  242.869155]  __cv_timedwait_io+0x15/0x20 [spl] 
[  242.870039]  zio_wait+0x129/0x2b0 [zfs] 
[  242.870799]  dsl_pool_sync+0x461/0x4f0 [zfs] 
[  242.871655]  spa_sync+0x575/0xfa0 [zfs] 
[  242.872430]  ? mutex_lock+0xe/0x30 
[  242.873927]  ? spa_txg_history_init_io+0x101/0x110 [zfs] 
[  242.875014]  txg_sync_thread+0x2e0/0x4a0 [zfs] 
[  242.875964]  ? txg_fini+0x240/0x240 [zfs] 
[  242.876809]  thread_generic_wrapper+0x6f/0x80 [spl] 
[  242.877766]  ? __thread_exit+0x20/0x20 [spl] 
[  242.878724]  kthread+0x11b/0x140 
[  242.879398]  ? __kthread_bind_mask+0x60/0x60 
[  242.880266]  ret_from_fork+0x22/0x30 
[  260.101414] random: crng init done 

The hrmpf rescue BootCD (zfs-2.1.2-1)

[  103.775614] PANIC: zfs: adding existent segment to range tree (offset=76bab1000 size=12000) 
[  103.778085] Showing stack for process 1201 
[  103.779296] CPU: 0 PID: 1201 Comm: z_wr_iss Tainted: P           O      5.15.11_1 #1 
[  103.780288] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014 
[  103.780288] Call Trace: 
[  103.780288]  <TASK> 
[  103.780288]  dump_stack_lvl+0x46/0x5a 
[  103.780288]  vcmn_err.cold+0x50/0x68 [spl] 
[  103.780288]  ? kmem_cache_alloc+0x280/0x3c0 
[  103.780288]  ? metaslab_rangesize64_compare+0x40/0x40 [zfs] 
[  103.780288]  ? zfs_btree_insert_into_leaf+0x233/0x2a0 [zfs] 
[  103.780288]  ? zfs_btree_insert_into_leaf+0x233/0x2a0 [zfs] 
[  103.780288]  ? zfs_btree_add_idx+0xb0/0x220 [zfs] 
[  103.780288]  ? zfs_btree_find+0x175/0x300 [zfs] 
[  103.780288]  zfs_panic_recover+0x6d/0x90 [zfs] 
[  103.780288]  range_tree_add_impl+0x305/0xe40 [zfs] 
[  103.780288]  ? __schedule+0x1195/0x1480 
[  103.780288]  ? range_tree_remove_impl+0xf00/0xf00 [zfs] 
[  103.780288]  range_tree_walk+0xad/0x1e0 [zfs] 
[  103.780288]  metaslab_load+0x34c/0x8a0 [zfs] 
[  103.780288]  ? range_tree_add_impl+0x754/0xe40 [zfs] 
[  103.780288]  metaslab_activate+0x4c/0x280 [zfs] 
[  103.780288]  ? metaslab_set_selected_txg+0x7f/0xc0 [zfs] 
[  103.780288]  metaslab_alloc_dva+0x2b6/0x1490 [zfs] 
[  103.780288]  metaslab_alloc+0xcf/0x280 [zfs] 
[  103.780288]  zio_dva_allocate+0xd4/0x8d0 [zfs] 
[  103.780288]  ? __kmalloc_node+0x397/0x480 
[  103.780288]  ? spl_kmem_alloc_impl+0xae/0xf0 [spl] 
[  103.780288]  ? zio_io_to_allocate+0x63/0x80 [zfs] 
[  103.780288]  zio_execute+0x81/0x120 [zfs] 
[  103.780288]  taskq_thread+0x2cb/0x500 [spl] 
[  103.780288]  ? wake_up_q+0x90/0x90 
[  103.780288]  ? zio_gang_tree_free+0x60/0x60 [zfs] 
[  103.780288]  ? taskq_thread_spawn+0x50/0x50 [spl] 
[  103.780288]  kthread+0x127/0x150 
[  103.780288]  ? set_kthread_struct+0x40/0x40 
[  103.780288]  ret_from_fork+0x22/0x30 
[  103.780288]  </TASK> 

Thank you!

rincebrain commented 2 years ago

I'd probably try using the tunable zil_replay_disable to throw out the ZIL, since I'm assuming it's panicking because it's trying to add an element on replay (as I don't think it should have been able to persistently add it in the first place without tripping the same condition)

raron commented 2 years ago

Thanks, I tried it with the system from the hrmpf rescue CD, but unfortunately got the same error.

# echo 1 > /sys/module/zfs/parameters/zil_replay_disable
# zpool import -f rpool
[  125.388332] PANIC: zfs: adding existent segment to range tree (offset=76bab1000 size=12000)
[  125.390581] Showing stack for process 1197
[  125.391404] CPU: 0 PID: 1197 Comm: z_wr_iss Tainted: P           O      5.15.11_1 #1
[  125.392388] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
[  125.392388] Call Trace:
[  125.392388]  <TASK>
[  125.392388]  dump_stack_lvl+0x46/0x5a
[  125.392388]  vcmn_err.cold+0x50/0x68 [spl]
[  125.392388]  ? kmem_cache_alloc+0x280/0x3c0
[  125.392388]  ? metaslab_rangesize64_compare+0x40/0x40 [zfs]
[  125.392388]  ? zfs_btree_insert_into_leaf+0x233/0x2a0 [zfs]
[  125.392388]  ? zfs_btree_insert_into_leaf+0x233/0x2a0 [zfs]
[  125.392388]  ? zfs_btree_add_idx+0xb0/0x220 [zfs]
[  125.392388]  ? zfs_btree_find+0x175/0x300 [zfs]
[  125.392388]  zfs_panic_recover+0x6d/0x90 [zfs]
[  125.392388]  range_tree_add_impl+0x305/0xe40 [zfs]
[  125.392388]  ? range_tree_remove_impl+0xf00/0xf00 [zfs]
[  125.392388]  range_tree_walk+0xad/0x1e0 [zfs]
[  125.392388]  metaslab_load+0x34c/0x8a0 [zfs]
[  125.392388]  ? range_tree_add_impl+0x754/0xe40 [zfs]
[  125.392388]  metaslab_activate+0x4c/0x280 [zfs]
[  125.392388]  ? metaslab_set_selected_txg+0x7f/0xc0 [zfs]
[  125.392388]  metaslab_alloc_dva+0x2b6/0x1490 [zfs]
[  125.392388]  metaslab_alloc+0xcf/0x280 [zfs]
[  125.392388]  zio_dva_allocate+0xd4/0x8d0 [zfs]
[  125.392388]  ? __kmalloc_node+0x397/0x480
[  125.392388]  ? spl_kmem_alloc_impl+0xae/0xf0 [spl]
[  125.392388]  ? zio_io_to_allocate+0x63/0x80 [zfs]
[  125.392388]  zio_execute+0x81/0x120 [zfs]
[  125.392388]  taskq_thread+0x2cb/0x500 [spl]
[  125.392388]  ? wake_up_q+0x90/0x90
[  125.392388]  ? zio_gang_tree_free+0x60/0x60 [zfs]
[  125.392388]  ? taskq_thread_spawn+0x50/0x50 [spl]
[  125.392388]  kthread+0x127/0x150
[  125.392388]  ? set_kthread_struct+0x40/0x40
[  125.392388]  ret_from_fork+0x22/0x30
[  125.392388]  </TASK>
Spoons commented 2 years ago

I'm having this issue as well on Arch Linux with 5.18.9 kernel with zfs built from git @ 74230a5bc1be6e5e84a5f41b26f6f65a155078f0.

I was able to mount the disk as read-write with:

echo 1 > /sys/module/zfs/parameters/zil_replay_disable
echo 1 > /sys/module/zfs/parameters/zfs_recover

Scrubbing now.

bitwise0perator commented 2 years ago

Just to keep this thread fresh: I have encountered a problem generally matching this description on FreeBSD 12.3-RELEASE-p5.

The issue appears to have begun with a kernel panic which occurred in the middle of my standard everyday work on the system (it serves as an NFS host for my workstation's home directory). That occurred at approximately 2022-08-02 14:30, but it took me a while to figure out what was going on and this is the first in a series of kernel panics documented by my system and related to the issue:

Aug  2 18:19:31 kernel: Fatal trap 12: page fault while in kernel mode
Aug  2 18:19:31 kernel: cpuid = 3; apic id = 03
Aug  2 18:19:31 kernel: fault virtual address      = 0x38
Aug  2 18:19:31 kernel: fault code         = supervisor write data, page not present
Aug  2 18:19:31 kernel: instruction pointer        = 0x20:0xffffffff827c0ba6
Aug  2 18:19:31 kernel: stack pointer              = 0x28:0xfffffe004edde840
Aug  2 18:19:31 kernel: frame pointer              = 0x28:0xfffffe004edde870
Aug  2 18:19:31 kernel: code segment               = base 0x0, limit 0xfffff, type 0x1b
Aug  2 18:19:31 kernel:                    = DPL 0, pres 1, long 1, def32 0, gran 1
Aug  2 18:19:31 kernel: processor eflags   = interrupt enabled, resume, IOPL = 0
Aug  2 18:19:31 kernel: current process            = 1011 (txg_thread_enter)
Aug  2 18:19:31 kernel: trap number                = 12
Aug  2 18:19:31 kernel: panic: page fault
Aug  2 18:19:31 kernel: cpuid = 3
Aug  2 18:19:31 kernel: time = 1659464287
Aug  2 18:19:31 kernel: KDB: stack backtrace:
Aug  2 18:19:31 kernel: #0 0xffffffff80c2d4a5 at kdb_backtrace+0x65
Aug  2 18:19:31 kernel: #1 0xffffffff80be163b at vpanic+0x17b
Aug  2 18:19:31 kernel: #2 0xffffffff80be14b3 at panic+0x43
Aug  2 18:19:31 kernel: #3 0xffffffff810fd961 at trap_fatal+0x391
Aug  2 18:19:31 kernel: #4 0xffffffff810fd9bf at trap_pfault+0x4f
Aug  2 18:19:31 kernel: #5 0xffffffff810fd006 at trap+0x286
Aug  2 18:19:31 kernel: #6 0xffffffff810d51a8 at calltrap+0x8
Aug  2 18:19:31 kernel: #7 0xffffffff8276e39c at dmu_tx_commit+0x13c
Aug  2 18:19:31 kernel: #8 0xffffffff827b2692 at spa_sync+0x1092
Aug  2 18:19:31 kernel: #9 0xffffffff827c063b at txg_sync_thread+0x34b
Aug  2 18:19:31 kernel: #10 0xffffffff80ba2a0e at fork_exit+0x7e
Aug  2 18:19:31 kernel: #11 0xffffffff810d61de at fork_trampoline+0xe
Aug  2 18:19:31 kernel: Uptime: 21d3h10m50s

Subsequently, any attempt to import the zpool would result in the same sort of panic described in this issue:

Aug  2 18:32:57 kernel: panic: Solaris(panic): zfs: allocating allocated segment(offset=14905978232832 size=75264) of (offset=14905978232832 size=93696)
Aug  2 18:32:57 kernel: 
Aug  2 18:32:57 kernel: cpuid = 3
Aug  2 18:32:57 kernel: time = 1659465130
Aug  2 18:32:57 kernel: KDB: stack backtrace:
Aug  2 18:32:57 kernel: #0 0xffffffff80c2d4a5 at kdb_backtrace+0x65
Aug  2 18:32:57 kernel: #1 0xffffffff80be163b at vpanic+0x17b
Aug  2 18:32:57 kernel: #2 0xffffffff80be14b3 at panic+0x43
Aug  2 18:32:57 kernel: #3 0xffffffff8296617d at vcmn_err+0xcd
Aug  2 18:32:57 kernel: #4 0xffffffff827bce49 at zfs_panic_recover+0x59
Aug  2 18:32:57 kernel: #5 0xffffffff827a06c8 at range_tree_add_impl+0x1d8
Aug  2 18:32:57 kernel: #6 0xffffffff827a10a8 at range_tree_vacate+0x98
Aug  2 18:32:57 kernel: #7 0xffffffff827999bd at metaslab_sync_done+0x26d
Aug  2 18:32:57 kernel: #8 0xffffffff827c830b at vdev_sync_done+0x4b
Aug  2 18:32:57 kernel: #9 0xffffffff827b283b at spa_sync+0x123b
Aug  2 18:32:57 kernel: #10 0xffffffff827c063b at txg_sync_thread+0x34b
Aug  2 18:32:57 kernel: #11 0xffffffff80ba2a0e at fork_exit+0x7e
Aug  2 18:32:57 kernel: #12 0xffffffff810d61de at fork_trampoline+0xe
Aug  2 18:32:57 kernel: Uptime: 1m15s

This is from the next attempt to import; I'm providing it since the stacktrace path seems perhaps sufficiently different to assist in identifying the cause:

Aug  2 18:35:41 kernel: panic: Solaris(panic): zfs: allocating allocated segment(offset=14905978189824 size=38400) of (offset=14905978189824 size=38400)
Aug  2 18:35:41 kernel: 
Aug  2 18:35:41 kernel: cpuid = 1
Aug  2 18:35:41 kernel: time = 1659465228
Aug  2 18:35:41 kernel: KDB: stack backtrace:
Aug  2 18:35:41 kernel: #0 0xffffffff80c2d4a5 at kdb_backtrace+0x65
Aug  2 18:35:41 kernel: #1 0xffffffff80be163b at vpanic+0x17b
Aug  2 18:35:41 kernel: #2 0xffffffff80be14b3 at panic+0x43
Aug  2 18:35:41 kernel: #3 0xffffffff8296617d at vcmn_err+0xcd
Aug  2 18:35:41 kernel: #4 0xffffffff827bce49 at zfs_panic_recover+0x59
Aug  2 18:35:41 kernel: #5 0xffffffff827a06c8 at range_tree_add_impl+0x1d8
Aug  2 18:35:41 kernel: #6 0xffffffff827be994 at space_map_load_callback+0x64
Aug  2 18:35:41 kernel: #7 0xffffffff827be3a1 at space_map_iterate+0x2c1
Aug  2 18:35:41 kernel: #8 0xffffffff827be904 at space_map_load_length+0x84
Aug  2 18:35:41 kernel: #9 0xffffffff82798ca4 at metaslab_load+0xa4
Aug  2 18:35:41 kernel: #10 0xffffffff8279e180 at metaslab_activate+0x30
Aug  2 18:35:41 kernel: #11 0xffffffff8279be8f at metaslab_alloc_dva+0x90f
Aug  2 18:35:41 kernel: #12 0xffffffff8279d823 at metaslab_alloc+0xc3
Aug  2 18:35:41 kernel: #13 0xffffffff827fca5d at zio_dva_allocate+0xbd
Aug  2 18:35:41 kernel: #14 0xffffffff827f9a2c at zio_execute+0xac
Aug  2 18:35:41 kernel: #15 0xffffffff80c3fdf4 at taskqueue_run_locked+0x144
Aug  2 18:35:41 kernel: #16 0xffffffff80c411e6 at taskqueue_thread_loop+0xb6
Aug  2 18:35:41 kernel: #17 0xffffffff80ba2a0e at fork_exit+0x7e
Aug  2 18:35:41 kernel: Uptime: 57s

After the above attempts, I determined that I was able to successfully import the zpool in readonly mode. I sequestered away the important changes I had made to the data since the latest backup operation the previous night and I exported the zpool.

I then attempted to import using a previous TXG as the original author of this issue mentioned above, but while that did cause the import operation to attempt for about 8 hours (rather than immediately panic), it ultimately resulted in the same kernel panic. Sadly, I don't have any stack trace output for that one, just this:

Panic String: Solaris(panic): zfs: adding existent segment to range tree (offset=d8e9188f800 size=9600)

I then attempted upgrading from FreeBSD 12.2-RELEASE to 13.1-RELEASE, but sadly, that did not resolve the issue.

Finally, I used the following tunables to successfully import my zpool in read/write mode:

sysctl vfs.zfs.spa.load_verify_data=0
sysctl vfs.zfs.spa.load_verify_metadata=0
sysctl vfs.zfs.recover=1
sysctl vfs.zfs.zil.replay_disable=1

Perhaps interestingly: without the zil.replay_disable tunable being set (but the other three tunables being set as shown), the import operation continues to fail with a panic featuring the same "adding existent segment to range tree" issue. If the zil.replay_disable tunable is set, I am able to import the pool although I receive the following warnings:

Aug  4 06:59:38 kernel: Solaris: WARNING: zfs: adding existent segment to range tree (offset=c2b0c11c200 size=6c00)
Aug  4 06:59:38 kernel: Solaris: WARNING: zfs: adding existent segment to range tree (offset=c2dc59ff000 size=3c00)
Aug  4 06:59:38 kernel: Solaris: WARNING: zfs: adding existent segment to range tree (offset=c2e0b8d6800 size=a800)
Aug  4 06:59:38 kernel: Solaris: WARNING: zfs: adding existent segment to range tree (offset=c437970ca00 size=1800)
Aug  4 06:59:38 kernel: Solaris: WARNING: zfs: adding existent segment to range tree (offset=c437970fa00 size=5400)
Aug  4 06:59:38 kernel: Solaris: WARNING: zfs: adding existent segment to range tree (offset=c5210f89600 size=3c00)
Aug  4 06:59:38 kernel: Solaris: WARNING: zfs: adding existent segment to range tree (offset=c53423d3a00 size=a800)
Aug  4 06:59:38 kernel: Solaris: WARNING: zfs: adding existent segment to range tree (offset=cc039b68200 size=3000)
Aug  4 06:59:38 kernel: Solaris: WARNING: zfs: adding existent segment to range tree (offset=cc11a791400 size=600)
Aug  4 06:59:38 kernel: Solaris: WARNING: zfs: adding existent segment to range tree (offset=cc11a792c00 size=1800)
Aug  4 06:59:38 kernel: Solaris: WARNING: zfs: adding existent segment to range tree (offset=cc212225a00 size=2400)
Aug  4 06:59:38 kernel: Solaris: WARNING: zfs: adding existent segment to range tree (offset=d034d0d4200 size=1800)
Aug  4 06:59:38 kernel: Solaris: WARNING: zfs: adding existent segment to range tree (offset=d051620fa00 size=6600)
Aug  4 06:59:38 kernel: Solaris: WARNING: zfs: adding existent segment to range tree (offset=d05bd0a6000 size=1800)
Aug  4 06:59:49 kernel: Solaris: WARNING: zfs: adding existent segment to range tree (offset=c2b0c11c200 size=6c00)
Aug  4 06:59:49 kernel: Solaris: WARNING: zfs: adding existent segment to range tree (offset=c2dc59ff000 size=3c00)
Aug  4 06:59:49 kernel: Solaris: WARNING: zfs: adding existent segment to range tree (offset=c2e0b8d6800 size=a800)
Aug  4 06:59:50 kernel: Solaris: WARNING: zfs: adding existent segment to range tree (offset=c437970ca00 size=1800)
Aug  4 06:59:50 kernel: Solaris: WARNING: zfs: adding existent segment to range tree (offset=c437970fa00 size=5400)
Aug  4 06:59:50 kernel: Solaris: WARNING: zfs: adding existent segment to range tree (offset=c5210f89600 size=3c00)
Aug  4 06:59:50 kernel: Solaris: WARNING: zfs: adding existent segment to range tree (offset=c53423d3a00 size=a800)
Aug  4 06:59:52 kernel: Solaris: WARNING: zfs: adding existent segment to range tree (offset=cc039b68200 size=3000)
Aug  4 06:59:52 kernel: Solaris: WARNING: zfs: adding existent segment to range tree (offset=cc11a791400 size=600)
Aug  4 06:59:52 kernel: Solaris: WARNING: zfs: adding existent segment to range tree (offset=cc11a792c00 size=1800)
Aug  4 06:59:52 kernel: Solaris: WARNING: zfs: adding existent segment to range tree (offset=cc212225a00 size=2400)
Aug  4 06:59:52 kernel: Solaris: WARNING: zfs: adding existent segment to range tree (offset=d034d0d4200 size=1800)
Aug  4 06:59:52 kernel: Solaris: WARNING: zfs: adding existent segment to range tree (offset=d051620fa00 size=6600)
Aug  4 06:59:52 kernel: Solaris: WARNING: zfs: adding existent segment to range tree (offset=d05bd0a6000 size=1800)

Once I got my pool imported in read/write mode, I executed a full scrub which found zero data errors and repaired 0B. I thought this might put me in the clear, but while the zpool export operation goes smoothly, the next zpool import operation will result in the same panic (albeit with new offset values, it would seem). Again, if I put in place the previously-described tunable configuration, I am able to import the pool with the warnings shown above and make use of the pool as I normally would (at least, so far..).

I would appreciate any advice anyone might have on this front. I have ordered a third external backup drive to which I am going to make a third replica of my pool before I try something as painful and potentially messy as a full restoration from backup.

Oh, incidentally, I did also observe that I could cause ZFS to throw those warnings if I listed snapshots for a particular dataset within my pool (the one in which I was active at the time of the initial kernel panic). I thought I was clever and tried to remove just the latest snapshot for that dataset, but it did not resolve the problem.

bitwise0perator commented 2 years ago

It seems that my current situation is: I have some incorrect extent allocations in my metaslabs. Given the successful scrub, it appears that, aside from the problematic (duplicate?) entries in the metaslabs, the non-problematic entries in the metaslabs correctly correspond to the data on the disk. This may be instructive:

# zdb -AAA -b tank

Traversing all blocks to verify nothing leaked ...

loading concrete vdev 0, metaslab 97 of 130 ...
WARNING: zfs: removing nonexistent segment from range tree (offset=c2b0c11c200 size=6c00)
WARNING: zfs: removing nonexistent segment from range tree (offset=c2dc59ff000 size=3c00)
WARNING: zfs: removing nonexistent segment from range tree (offset=c2e0b8d6800 size=a800)
loading concrete vdev 0, metaslab 102 of 130 ...
WARNING: zfs: removing nonexistent segment from range tree (offset=cc039b68200 size=3000)
WARNING: zfs: removing nonexistent segment from range tree (offset=cc11a791400 size=600)
WARNING: zfs: removing nonexistent segment from range tree (offset=cc11a792c00 size=1800)
WARNING: zfs: removing nonexistent segment from range tree (offset=cc212225a00 size=2400)
loading concrete vdev 0, metaslab 104 of 130 ...
WARNING: zfs: removing nonexistent segment from range tree (offset=d034d0d4200 size=1800)
WARNING: zfs: removing nonexistent segment from range tree (offset=d051620fa00 size=6600)
WARNING: zfs: removing nonexistent segment from range tree (offset=d05bd0a6000 size=1800)

It's unclear to me at the moment why the zdb command issues warnings to the effect that it is "removing nonexistent segment"s from the range tree, whereas the import declares seemingly the opposite problem, but I am sure that makes perfect sense when one better understands ZFS internals. The offset and size values described here certainly seem to match up nicely with those reported as problems during the import operation, so I'm pretty sure I'm on the right path.

I have been searching for any utilities that I might invoke to correct metaslab entries, but have come up empty-handed thus far.

bitwise0perator commented 2 years ago

It looks like these are principally double-free issues from zdb output:

Aug  4 06:59:38 kernel: Solaris: WARNING: zfs: adding existent segment to range tree (offset=c2b0c11c200 size=6c00)
Aug  4 06:59:38 kernel: Solaris: WARNING: zfs: adding existent segment to range tree (offset=c2dc59ff000 size=3c00)
Aug  4 06:59:38 kernel: Solaris: WARNING: zfs: adding existent segment to range tree (offset=c2e0b8d6800 size=a800)

From the metaslab output from zdb:

            [323914] FREE: txg 21134970 pass 2
            [323915]    F  range: c2b0c11c200-c2b0c122e00  size: 006c00 vdev: 16777216 words: 1            <-- Freed the first time
            [323916]    F  range: c2dc59ff000-c2dc5a02c00  size: 003c00 vdev: 16777216 words: 1
            [323917]    F  range: c2e0b8d6800-c2e0b8e1000  size: 00a800 vdev: 16777216 words: 1
            [323918] ALLOC: txg 21134970 pass 3
            [323919]    A  range: c2dc5a1f400-c2dc5a20000  size: 000c00 vdev: 16777216 words: 1
            [323920] ALLOC: txg 21134971 pass 1
            [323921]    A  range: c2000000000-c2000000c00  size: 000c00 vdev: 16777216 words: 1
            [323922]    A  range: c2000034e00-c2000036600  size: 001800 vdev: 16777216 words: 1
            [323923]    A  range: c2000042600-c2000046e00  size: 004800 vdev: 16777216 words: 1
            [323924]    A  range: c200004f200-c2000051c00  size: 002a00 vdev: 16777216 words: 1
            [323925]    A  range: c2000056a00-c2000058200  size: 001800 vdev: 16777216 words: 1
            [323926]    A  range: c2000059a00-c200005fa00  size: 006000 vdev: 16777216 words: 1
            [323927]    A  range: c200006e400-c2000071a00  size: 003600 vdev: 16777216 words: 1
            [323928] FREE: txg 21134971 pass 1
            [323929]    F  range: c2000060600-c2000060c00  size: 000600 vdev: 16777216 words: 1
            [323930]    F  range: c2000085e00-c2000086400  size: 000600 vdev: 16777216 words: 1
            [323931]    F  range: c2af3232400-c2af3233600  size: 001200 vdev: 16777216 words: 1
            [323932]    F  range: c2afd7a2c00-c2afd7a3e00  size: 001200 vdev: 16777216 words: 1
            [323933]    F  range: c2b0c0c4600-c2b0c0c5800  size: 001200 vdev: 16777216 words: 1
            [323934]    F  range: c2b0c12b200-c2b0c12c400  size: 001200 vdev: 16777216 words: 1
            [323935]    F  range: c2de2b67400-c2de2b68c00  size: 001800 vdev: 16777216 words: 1
            [323936]    F  range: c2de2be2400-c2de2be3c00  size: 001800 vdev: 16777216 words: 1
            [323937]    F  range: c2de2d65400-c2de2d66c00  size: 001800 vdev: 16777216 words: 1
            [323938]    F  range: c2de2de4c00-c2de2de6400  size: 001800 vdev: 16777216 words: 1
            [323939]    F  range: c2de2de9400-c2de2deac00  size: 001800 vdev: 16777216 words: 1
            [323940]    F  range: c2de2e3d400-c2de2e3ec00  size: 001800 vdev: 16777216 words: 1
            [323941]    F  range: c2de2ea3400-c2de2ea6400  size: 003000 vdev: 16777216 words: 1
            [323942]    F  range: c2e0b8f1800-c2e0b8f4800  size: 003000 vdev: 16777216 words: 1
            [323943] ALLOC: txg 21134971 pass 2
            [323944]    A  range: c200001bc00-c200001c800  size: 000c00 vdev: 16777216 words: 1
            [323945]    A  range: c2000046e00-c2000047a00  size: 000c00 vdev: 16777216 words: 1
            [323946]    A  range: c2000054c00-c2000055800  size: 000c00 vdev: 16777216 words: 1
            [323947]    A  range: c2000071a00-c2000075c00  size: 004200 vdev: 16777216 words: 1
            [323948]    A  range: c2000078600-c200007ec00  size: 006600 vdev: 16777216 words: 1
            [323949]    A  range: c2000082200-c2000083a00  size: 001800 vdev: 16777216 words: 1
            [323950]    A  range: c2000086400-c2000087c00  size: 001800 vdev: 16777216 words: 1
            [323951]    A  range: c200009ba00-c200009d200  size: 001800 vdev: 16777216 words: 1
            [323952]    A  range: c20000a0800-c20000a2000  size: 001800 vdev: 16777216 words: 1
            [323953]    A  range: c20000a2c00-c20000a4400  size: 001800 vdev: 16777216 words: 1
            [323954]    A  range: c20000a7a00-c20000a9e00  size: 002400 vdev: 16777216 words: 1
            [323955] FREE: txg 21134971 pass 2
            [323956]    F  range: c2b0c11c200-c2b0c122e00  size: 006c00 vdev: 16777216 words: 1            <-- Double freed, it appears         
            [323957]    F  range: c2dc59ff000-c2dc5a02c00  size: 003c00 vdev: 16777216 words: 1
            [323958]    F  range: c2e0b8d6800-c2e0b8e1000  size: 00a800 vdev: 16777216 words: 1

Aug  4 06:59:38 kernel: Solaris: WARNING: zfs: adding existent segment to range tree (offset=d034d0d4200 size=1800)
Aug  4 06:59:38 kernel: Solaris: WARNING: zfs: adding existent segment to range tree (offset=d051620fa00 size=6600)
Aug  4 06:59:38 kernel: Solaris: WARNING: zfs: adding existent segment to range tree (offset=d05bd0a6000 size=1800)

            [203560] FREE: txg 21134970 pass 2
            [203561]    F  range: d034d0d4200-d034d0d5a00  size: 001800 vdev: 16777216 words: 1            <-- Freed the first time
            [203562]    F  range: d051620fa00-d0516216000  size: 006600 vdev: 16777216 words: 1
            [203563]    F  range: d05bd0a6000-d05bd0a7800  size: 001800 vdev: 16777216 words: 1
            [203564] ALLOC: txg 21134971 pass 1
            [203565]    A  range: d0000006000-d0000007200  size: 001200 vdev: 16777216 words: 1
            [203566] FREE: txg 21134971 pass 1
            [203567]    F  range: d0516224400-d0516224a00  size: 000600 vdev: 16777216 words: 1
            [203568] ALLOC: txg 21134971 pass 2
            [203569]    A  range: d0000007200-d0000009000  size: 001e00 vdev: 16777216 words: 1
            [203570]    A  range: d0000028200-d000002fa00  size: 007800 vdev: 16777216 words: 1
            [203571] FREE: txg 21134971 pass 2
            [203572]    F  range: d034d0d4200-d034d0d5a00  size: 001800 vdev: 16777216 words: 1            <-- Double freed, it appears
            [203573]    F  range: d051620fa00-d0516216000  size: 006600 vdev: 16777216 words: 1
            [203574]    F  range: d05bd0a6000-d05bd0a7800  size: 001800 vdev: 16777216 words: 1

Looks like txg 21134971 is the culprit.

bitwise0perator commented 2 years ago

As an update:

After reading about metaslabs, spacemaps, and their maintenance, I came to believe that the following things are true:

  1. The errors I was experiencing combined with my zdb output showed that my zpool was experiencing duplicated free range allocations through some sort of zfs software error
  2. Enabling recovery mode for the zpool permitted it to be operated in spite of the otherwise-fatal errors which resulted from encountering those duplicated free range allocations during the import attempt to read the on-disk spacemaps into memory
  3. I expect the warning events (formerly errors, but reduced through the vfs.zfs.recover tunable) are handled in the zfs software by ignoring the duplicated free range allocation
  4. If all of the preceding are correct, then my pool was not at any obvious risk of data loss; my in-memory spacemaps would be accurate representations of the free space on disk, and I would simply need to have the on-disk spacemaps rewritten without those duplicate entries
  5. Because zfs is constantly rewriting spacemaps in condensing operations, I was hopeful that operating my pool in recovery mode for a day or so would result in the eventual removal of the duplicated range descriptions during those rewrite operations

Lo and behold, that appears to have occurred. At the end of the day, I executed the zdb command to view free range allocations and found that all of the duplicated segments had been subsumed in consolidated descriptions. Expecting this to mean my problem had disappeared (and kinda confirming that by the fact that the zdb command threw no errors when executed), I exported the pool, restarted my machine, and imported it without any tunable adjustments (as described previously above).

And it imported perfectly. A subsequent scrub shows zero bytes repaired and zero data errors.

Hopefully this helps someone else if he or she should encounter the same issue. Please, by all means, someone correct me if I have erred in my assessment, here.

power-max commented 2 years ago

I am having this exact same issue on TrueNAS Scale. I am able to import the pool using readonly=on however I get nearly the exact same panic if I remove the option: truenas PANIC: zfs: adding existent segment to range tree (offset=8971f33000 size=1000)

Here is what dmesg shows:

[54738.442599] PANIC: zfs: adding existent segment to range tree (offset=8971f33000 size=1000)
[54738.443228] Showing stack for process 349075
[54738.443541] CPU: 7 PID: 349075 Comm: txg_sync Tainted: P           OE     5.10.81+truenas #1
[54738.444054] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[54738.444885] Call Trace:
[54738.445059]  dump_stack+0x6b/0x83
[54738.445314]  vcmn_err.cold+0x58/0x80 [spl]
[54738.445640]  ? zfs_btree_remove_idx+0xb4/0x9d0 [zfs]
[54738.446021]  ? zfs_btree_insert_into_leaf+0x233/0x2a0 [zfs]
[54738.446515]  zfs_panic_recover+0x6d/0x90 [zfs]
[54738.446826]  range_tree_add_impl+0x305/0xe40 [zfs]
[54738.447133]  ? lock_timer_base+0x61/0x80
[54738.447387]  ? _cond_resched+0x16/0x40
[54738.447668]  metaslab_free_concrete+0x11d/0x250 [zfs]
[54738.448001]  metaslab_free_impl+0xa9/0xe0 [zfs]
[54738.448438]  metaslab_free+0x168/0x190 [zfs]
[54738.448822]  zio_free_sync+0xda/0xf0 [zfs]
[54738.449153]  dsl_scan_free_block_cb+0x65/0x1a0 [zfs]
[54738.449886]  bpobj_iterate_blkptrs+0xfe/0x360 [zfs]
[54738.450503]  ? dsl_scan_free_block_cb+0x1a0/0x1a0 [zfs]
[54738.451143]  bpobj_iterate_impl+0x29a/0x550 [zfs]
[54738.451701]  ? dsl_scan_free_block_cb+0x1a0/0x1a0 [zfs]
[54738.452266]  dsl_scan_sync+0x552/0x1350 [zfs]
[54738.452867]  ? kfree+0xba/0x480
[54738.453399]  ? bplist_iterate+0x115/0x130 [zfs]
[54738.454001]  spa_sync+0x5b3/0xfa0 [zfs]
[54738.454573]  ? mutex_lock+0xe/0x30
[54738.455020]  ? spa_txg_history_init_io+0x101/0x110 [zfs]
[54738.455673]  txg_sync_thread+0x2e0/0x4a0 [zfs]
[54738.456209]  ? txg_fini+0x250/0x250 [zfs]
[54738.457114]  thread_generic_wrapper+0x6f/0x80 [spl]
[54738.457777]  ? __thread_exit+0x20/0x20 [spl]
[54738.458288]  kthread+0x11b/0x140
[54738.458848]  ? __kthread_bind_mask+0x60/0x60
[54738.459358]  ret_from_fork+0x22/0x30
[54858.855012] INFO: task zpool:348705 blocked for more than 120 seconds.
[54858.856178]       Tainted: P           OE     5.10.81+truenas #1
[54858.857156] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[54858.858067] task:zpool           state:D stack:    0 pid:348705 ppid:348341 flags:0x00004002
[54858.859158] Call Trace:
[54858.859740]  __schedule+0x282/0x870
[54858.860242]  schedule+0x46/0xb0
[54858.860793]  io_schedule+0x42/0x70
[54858.861279]  cv_wait_common+0xac/0x130 [spl]
[54858.861851]  ? add_wait_queue_exclusive+0x70/0x70
[54858.862493]  txg_wait_synced_impl+0xc9/0x110 [zfs]
[54858.863180]  txg_wait_synced+0xc/0x40 [zfs]
[54858.863957]  spa_config_update+0x3f/0x170 [zfs]
[54858.864510]  spa_import+0x5e0/0x840 [zfs]
[54858.865067]  zfs_ioc_pool_import+0x12f/0x150 [zfs]
[54858.865622]  zfsdev_ioctl_common+0x6bc/0x8e0 [zfs]
[54858.866202]  ? __kmalloc_node+0x22d/0x2b0
[54858.866731]  zfsdev_ioctl+0x53/0xe0 [zfs]
[54858.867323]  __x64_sys_ioctl+0x83/0xb0
[54858.867930]  do_syscall_64+0x33/0x80
[54858.868502]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[54858.869099] RIP: 0033:0x7f12650bfcc7
[54858.869634] RSP: 002b:00007ffdbf5e4008 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[54858.870388] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f12650bfcc7
[54858.871162] RDX: 00007ffdbf5e4080 RSI: 0000000000005a02 RDI: 0000000000000003
[54858.872034] RBP: 00007ffdbf5e7f70 R08: 0000000000000002 R09: 00007f1265189be0
[54858.873303] R10: 00000000000348c0 R11: 0000000000000246 R12: 0000561ee2941e60
[54858.874391] R13: 00007ffdbf5e4080 R14: 00007f12440010e8 R15: 0000000000000000
[54979.687028] INFO: task middlewared (wo:1896 blocked for more than 120 seconds.
[54979.688418]       Tainted: P           OE     5.10.81+truenas #1
[54979.689592] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[54979.690876] task:middlewared (wo state:D stack:    0 pid: 1896 ppid:   801 flags:0x00000000
[54979.692270] Call Trace:
[54979.693161]  __schedule+0x282/0x870
[54979.694013]  ? __kmalloc_node+0x141/0x2b0
[54979.694969]  schedule+0x46/0xb0
[54979.695829]  schedule_preempt_disabled+0xa/0x10
[54979.696751]  __mutex_lock.constprop.0+0x133/0x460
[54979.697726]  ? nvlist_xalloc.part.0+0x68/0xc0 [znvpair]
[54979.698781]  spa_all_configs+0x41/0x120 [zfs]
[54979.699831]  zfs_ioc_pool_configs+0x17/0x70 [zfs]
[54979.700814]  zfsdev_ioctl_common+0x6bc/0x8e0 [zfs]
[54979.701845]  ? __kmalloc_node+0x22d/0x2b0
[54979.702751]  zfsdev_ioctl+0x53/0xe0 [zfs]
[54979.703732]  __x64_sys_ioctl+0x83/0xb0
[54979.704594]  do_syscall_64+0x33/0x80
[54979.705288]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[54979.705832] RIP: 0033:0x7f96ca7a7cc7
[54979.706268] RSP: 002b:00007ffc402edc98 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[54979.706962] RAX: ffffffffffffffda RBX: 0000000003a91ed0 RCX: 00007f96ca7a7cc7
[54979.707673] RDX: 00007ffc402edcc0 RSI: 0000000000005a04 RDI: 0000000000000017
[54979.708329] RBP: 00007ffc402f12b0 R08: 0000000003ae97d0 R09: 00007f96ca871be0
[54979.709025] R10: 0000000000000040 R11: 0000000000000246 R12: 0000000003a91ed0
[54979.709695] R13: 0000000000000000 R14: 00007ffc402edcc0 R15: 00007f96c9783ca0
[54979.710335] INFO: task middlewared (wo:2556 blocked for more than 120 seconds.
[54979.711035]       Tainted: P           OE     5.10.81+truenas #1
[54979.711697] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[54979.712406] task:middlewared (wo state:D stack:    0 pid: 2556 ppid:   801 flags:0x00000000
[54979.713186] Call Trace:
[54979.713604]  __schedule+0x282/0x870
[54979.714055]  ? __kmalloc_node+0x141/0x2b0
[54979.714576]  schedule+0x46/0xb0
[54979.715060]  schedule_preempt_disabled+0xa/0x10
[54979.715643]  __mutex_lock.constprop.0+0x133/0x460
[54979.716181]  ? nvlist_xalloc.part.0+0x68/0xc0 [znvpair]
[54979.716779]  spa_all_configs+0x41/0x120 [zfs]
[54979.717334]  zfs_ioc_pool_configs+0x17/0x70 [zfs]
[54979.717893]  zfsdev_ioctl_common+0x6bc/0x8e0 [zfs]
[54979.718490]  ? __kmalloc_node+0x22d/0x2b0
[54979.719173]  zfsdev_ioctl+0x53/0xe0 [zfs]
[54979.719734]  __x64_sys_ioctl+0x83/0xb0
[54979.720208]  do_syscall_64+0x33/0x80
[54979.720702]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[54979.721258] RIP: 0033:0x7fa58b178cc7
[54979.721724] RSP: 002b:00007fff3dc8df48 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[54979.722403] RAX: ffffffffffffffda RBX: 0000000003862f20 RCX: 00007fa58b178cc7
[54979.723110] RDX: 00007fff3dc8df70 RSI: 0000000000005a04 RDI: 0000000000000017
[54979.723761] RBP: 00007fff3dc91560 R08: 00000000038d21d0 R09: 00007fa58b242be0
[54979.724434] R10: 0000000000040000 R11: 0000000000000246 R12: 0000000003862f20
[54979.725124] R13: 0000000000000000 R14: 00007fff3dc8df70 R15: 00007fa58a154ca0
[54979.725777] INFO: task zpool:348705 blocked for more than 241 seconds.
[54979.726384]       Tainted: P           OE     5.10.81+truenas #1
[54979.727050] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[54979.727820] task:zpool           state:D stack:    0 pid:348705 ppid:348341 flags:0x00004002
[54979.728582] Call Trace:
[54979.729076]  __schedule+0x282/0x870
[54979.729606]  schedule+0x46/0xb0
[54979.730054]  io_schedule+0x42/0x70
[54979.730513]  cv_wait_common+0xac/0x130 [spl]
[54979.731194]  ? add_wait_queue_exclusive+0x70/0x70
[54979.731904]  txg_wait_synced_impl+0xc9/0x110 [zfs]
[54979.732634]  txg_wait_synced+0xc/0x40 [zfs]
[54979.733317]  spa_config_update+0x3f/0x170 [zfs]
[54979.733961]  spa_import+0x5e0/0x840 [zfs]
[54979.734511]  zfs_ioc_pool_import+0x12f/0x150 [zfs]
[54979.735219]  zfsdev_ioctl_common+0x6bc/0x8e0 [zfs]
[54979.735940]  ? __kmalloc_node+0x22d/0x2b0
[54979.736545]  zfsdev_ioctl+0x53/0xe0 [zfs]
[54979.737126]  __x64_sys_ioctl+0x83/0xb0
[54979.737813]  do_syscall_64+0x33/0x80
[54979.738349]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[54979.738869] RIP: 0033:0x7f12650bfcc7
[54979.739430] RSP: 002b:00007ffdbf5e4008 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[54979.740197] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f12650bfcc7
[54979.740926] RDX: 00007ffdbf5e4080 RSI: 0000000000005a02 RDI: 0000000000000003
[54979.741653] RBP: 00007ffdbf5e7f70 R08: 0000000000000002 R09: 00007f1265189be0
[54979.742571] R10: 00000000000348c0 R11: 0000000000000246 R12: 0000561ee2941e60
[54979.743326] R13: 00007ffdbf5e4080 R14: 00007f12440010e8 R15: 0000000000000000
[54979.744155] INFO: task txg_sync:349075 blocked for more than 120 seconds.
[54979.745029]       Tainted: P           OE     5.10.81+truenas #1
[54979.745767] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[54979.746561] task:txg_sync        state:D stack:    0 pid:349075 ppid:     2 flags:0x00004000
[54979.747442] Call Trace:
[54979.748012]  __schedule+0x282/0x870
[54979.748588]  schedule+0x46/0xb0
[54979.749284]  vcmn_err.cold+0x7e/0x80 [spl]
[54979.750392]  ? zfs_btree_remove_idx+0xb4/0x9d0 [zfs]
[54979.751492]  ? zfs_btree_insert_into_leaf+0x233/0x2a0 [zfs]
[54979.752628]  zfs_panic_recover+0x6d/0x90 [zfs]
[54979.753620]  range_tree_add_impl+0x305/0xe40 [zfs]
[54979.754658]  ? lock_timer_base+0x61/0x80
[54979.755677]  ? _cond_resched+0x16/0x40
[54979.756656]  metaslab_free_concrete+0x11d/0x250 [zfs]
[54979.757766]  metaslab_free_impl+0xa9/0xe0 [zfs]
[54979.758822]  metaslab_free+0x168/0x190 [zfs]
[54979.759921]  zio_free_sync+0xda/0xf0 [zfs]
[54979.761031]  dsl_scan_free_block_cb+0x65/0x1a0 [zfs]
[54979.762024]  bpobj_iterate_blkptrs+0xfe/0x360 [zfs]
[54979.763212]  ? dsl_scan_free_block_cb+0x1a0/0x1a0 [zfs]
[54979.764270]  bpobj_iterate_impl+0x29a/0x550 [zfs]
[54979.765400]  ? dsl_scan_free_block_cb+0x1a0/0x1a0 [zfs]
[54979.766502]  dsl_scan_sync+0x552/0x1350 [zfs]
[54979.767706]  ? kfree+0xba/0x480
[54979.768621]  ? bplist_iterate+0x115/0x130 [zfs]
[54979.769564]  spa_sync+0x5b3/0xfa0 [zfs]
[54979.770501]  ? mutex_lock+0xe/0x30
[54979.771550]  ? spa_txg_history_init_io+0x101/0x110 [zfs]
[54979.772690]  txg_sync_thread+0x2e0/0x4a0 [zfs]
[54979.773727]  ? txg_fini+0x250/0x250 [zfs]
[54979.774672]  thread_generic_wrapper+0x6f/0x80 [spl]
[54979.775668]  ? __thread_exit+0x20/0x20 [spl]
[54979.776605]  kthread+0x11b/0x140
[54979.777497]  ? __kthread_bind_mask+0x60/0x60
[54979.778404]  ret_from_fork+0x22/0x30
[54979.779248] INFO: task middlewared (wo:349208 blocked for more than 120 seconds.
[54979.780467]       Tainted: P           OE     5.10.81+truenas #1
[54979.781578] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[54979.782852] task:middlewared (wo state:D stack:    0 pid:349208 ppid:   801 flags:0x00000000
[54979.784360] Call Trace:
[54979.785398]  __schedule+0x282/0x870
[54979.786451]  schedule+0x46/0xb0
[54979.787826]  schedule_preempt_disabled+0xa/0x10
[54979.788827]  __mutex_lock.constprop.0+0x133/0x460
[54979.790056]  spa_open_common+0x5e/0x4d0 [zfs]
[54979.791416]  spa_get_stats+0x54/0x530 [zfs]
[54979.792552]  ? __alloc_pages_nodemask+0x18f/0x340
[54979.793716]  ? __kmalloc_node+0x141/0x2b0
[54979.794732]  ? spl_kmem_alloc_impl+0xae/0xf0 [spl]
[54979.795847]  zfs_ioc_pool_stats+0x34/0x80 [zfs]
[54979.797032]  zfsdev_ioctl_common+0x6bc/0x8e0 [zfs]
[54979.797817]  ? __kmalloc_node+0x22d/0x2b0
[54979.798645]  zfsdev_ioctl+0x53/0xe0 [zfs]
[54979.799367]  __x64_sys_ioctl+0x83/0xb0
[54979.799928]  do_syscall_64+0x33/0x80
[54979.800777]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[54979.801421] RIP: 0033:0x7fd85bc11cc7
[54979.802125] RSP: 002b:00007fffc1160c98 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[54979.802903] RAX: ffffffffffffffda RBX: 0000000002378860 RCX: 00007fd85bc11cc7
[54979.803771] RDX: 00007fffc1160cc0 RSI: 0000000000005a05 RDI: 0000000000000016
[54979.804735] RBP: 00007fffc11642b0 R08: 0000000003b9c670 R09: 00007fd85bcdbbe0
[54979.805499] R10: 000000000000007e R11: 0000000000000246 R12: 00007fffc1160cc0
[54979.806241] R13: 0000000003b9b1f0 R14: 0000000000000000 R15: 00007fffc11642c4
[54979.807148] INFO: task middlewared (wo:349309 blocked for more than 120 seconds.
[54979.807994]       Tainted: P           OE     5.10.81+truenas #1
[54979.808703] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[54979.809518] task:middlewared (wo state:D stack:    0 pid:349309 ppid:   801 flags:0x00000000
[54979.810366] Call Trace:
[54979.810895]  __schedule+0x282/0x870
[54979.811542]  ? __kmalloc_node+0x141/0x2b0
[54979.812141]  schedule+0x46/0xb0
[54979.812676]  schedule_preempt_disabled+0xa/0x10
[54979.813263]  __mutex_lock.constprop.0+0x133/0x460
[54979.813843]  ? nvlist_xalloc.part.0+0x68/0xc0 [znvpair]
[54979.814495]  spa_all_configs+0x41/0x120 [zfs]
[54979.815111]  zfs_ioc_pool_configs+0x17/0x70 [zfs]
[54979.815914]  zfsdev_ioctl_common+0x6bc/0x8e0 [zfs]
[54979.816695]  ? __kmalloc_node+0x22d/0x2b0
[54979.817267]  zfsdev_ioctl+0x53/0xe0 [zfs]
[54979.817785]  __x64_sys_ioctl+0x83/0xb0
[54979.818287]  do_syscall_64+0x33/0x80
[54979.818779]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[54979.819833] RIP: 0033:0x7f2d10bc8cc7
[54979.820424] RSP: 002b:00007ffeffd601a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[54979.821226] RAX: ffffffffffffffda RBX: 000000000362ef70 RCX: 00007f2d10bc8cc7
[54979.821922] RDX: 00007ffeffd601d0 RSI: 0000000000005a04 RDI: 0000000000000016
[54979.822655] RBP: 00007ffeffd637c0 R08: 00000000036303f0 R09: 00007f2d10c92be0
[54979.823351] R10: 0000000000040030 R11: 0000000000000246 R12: 000000000362ef70
[54979.824092] R13: 0000000000000000 R14: 00007ffeffd601d0 R15: 00007f2d0fbb9ca0
[54979.825089] INFO: task middlewared (wo:349470 blocked for more than 120 seconds.
[54979.826045]       Tainted: P           OE     5.10.81+truenas #1
[54979.826954] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[54979.828097] task:middlewared (wo state:D stack:    0 pid:349470 ppid:   801 flags:0x00000000
[54979.829364] Call Trace:
[54979.830092]  __schedule+0x282/0x870
[54979.830965]  ? __kmalloc_node+0x141/0x2b0
[54979.831794]  schedule+0x46/0xb0
[54979.832249]  schedule_preempt_disabled+0xa/0x10
[54979.832855]  __mutex_lock.constprop.0+0x133/0x460
[54979.833425]  ? nvlist_xalloc.part.0+0x68/0xc0 [znvpair]
[54979.834170]  spa_all_configs+0x41/0x120 [zfs]
[54979.834803]  zfs_ioc_pool_configs+0x17/0x70 [zfs]
[54979.835456]  zfsdev_ioctl_common+0x6bc/0x8e0 [zfs]
[54979.836056]  ? __kmalloc_node+0x22d/0x2b0
[54979.836665]  zfsdev_ioctl+0x53/0xe0 [zfs]
[54979.837177]  __x64_sys_ioctl+0x83/0xb0
[54979.837637]  do_syscall_64+0x33/0x80
[54979.838081]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[54979.838618] RIP: 0033:0x7f7bf2b01cc7
[54979.839118] RSP: 002b:00007ffcc70da488 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[54979.839858] RAX: ffffffffffffffda RBX: 0000000003e30b70 RCX: 00007f7bf2b01cc7
[54979.840530] RDX: 00007ffcc70da4b0 RSI: 0000000000005a04 RDI: 0000000000000016
[54979.841205] RBP: 00007ffcc70ddaa0 R08: 0000000003e31ff0 R09: 00007f7bf2bcbbe0
[54979.841878] R10: 0000000000040030 R11: 0000000000000246 R12: 0000000003e30b70
[54979.842543] R13: 0000000000000000 R14: 00007ffcc70da4b0 R15: 00007f7bf1af2ca0
da-anda commented 1 year ago

same issue as @power-max on TrueNAS Scale. No idea how to get the system up an running again now since it craps out on startup :(

CaCTuCaTu4ECKuu commented 1 year ago

@da-anda did you tried disconnecting drives before boot? After boot set variables (vfs.zfs.recover=1 vfs.zfs.zil.replay_disable=1) and then importing pool?

You may export pool before connecting drives and then import or go for import while it says pool is offline

da-anda commented 1 year ago

@CaCTuCaTu4ECKuu thanks, I have recovered it in the meantime by booting into a live distro (Ubuntu IIRC) and fixed the pool there. And disconnecting affected drives and then boot did not work since it was the boot pool (TrueNAS is also using ZFS for the boot drive for some reason)

CaCTuCaTu4ECKuu commented 1 year ago

When you mention, I should probaby make config backup and schedule it somehow for boot-pool, didn't even thought that same error could happen to boot pool, it would be easier and faster to reinstall I guess Maybe this happens mostly to HDrives, bcouse I have mirrored usb for boot and it's fine by now, but logically from the problem it shouldn't matter whether pool on hard drives or solid state drives

Delitants commented 1 year ago

same problem on Ubuntu 22

Delitants commented 1 year ago

I'm having this issue as well on Arch Linux with 5.18.9 kernel with zfs built from git @ 74230a5.

I was able to mount the disk as read-write with:

echo 1 > /sys/module/zfs/parameters/zil_replay_disable
echo 1 > /sys/module/zfs/parameters/zfs_recover

Scrubbing now.

And how that scrubbing will help you? Once you reboot, that panic comes back again.

CaCTuCaTu4ECKuu commented 1 year ago

Once you reboot, that panic comes back again.

You have to leave it running for several days with that options enabled. Then export, disable parameters and restart, if system started thet kinda fine, otherwise repeat and leave it be longer.

nbari commented 1 year ago

I am having a similar problem, FreeBSD 13-2:

panic: Solaris(panic): zfs: attempting to increase fill beyond max; probable double add in segment [0:787f7000]
Dump header from device: /dev/nvd0p3
  Architecture: amd64
  Architecture Version: 2
  Dump Length: 4681056256
  Blocksize: 512
  Compression: none
  Dumptime: 2023-06-11 11:44:47 +0000
  Hostname: home
  Magic: FreeBSD Kernel Dump
  Version String: FreeBSD 13.2-RELEASE releng/13.2-n254617-525ecfdad597 GENERIC
  Panic String: Solaris(panic): zfs: attempting to increase fill beyond max; probable double add in segment [0:787f7000]
  Dump Parity: 2582495081
  Bounds: 0
  Dump Status: good
cyberpower678 commented 1 year ago

I have this same problem now, and none of the tunables are working for me.

cyberpower678 commented 1 year ago

So I had to commit an extra step. One of my encrypted datasets got badly corrupted from a send/receive. Since I already suspected the dataset to be badly damaged, I tried to export and re-import the pool with the tunables. I got my pool imported successfully without the keys, nuked the suspicious dataset, and unlocked the remaining ones. I'm back online with my pool. Now I just need to clean it up. Am I right to assume that running zdb -AAA -b tank will help do so?

rincebrain commented 1 year ago

zdb is readonly, it will never fix anything, and if it does, it's a bug.

schmitzkr commented 1 year ago

I also had this error on booting. I followed @cyberpower678 steps and was able to import the pool. In my case I had attempted to delete a dataset that had an (unlocked) encrypted child dataset via the TrueNAS webgui, this process ran for over 8hours without completing and at that point I rebooted the system. It is not clear what sort of issues this may have caused and so I am cautiously looking for any detail about unlocking and deleting existing data sets in this pool.

cyberpower678 commented 1 year ago

zdb is readonly, it will never fix anything, and if it does, it's a bug.

Ok, thanks. Is there something I can run that actually does do some basic scrub like repairing?

rincebrain commented 1 year ago

zpool scrub is the best I've got other than someone writing something bespoke, and for this error, i'm not sure what you'd be able to do other than just a readonly import, since if you're trying to insert to somewhere already inserted to, then at some point someone is going to have the wrong expectations about that data...

aiohdfgaiuhg commented 1 year ago

Just wanted to also say that I have been experiencing this with my backup zfs server that is the target of many zfs send recv and I believe different datasets have caused different corruption and I would get this panic. I had tried importing from a previous uber block but no bueno and I have had to nuke the system twice now and restarting the 100TiB transfer over WAN

cyberpower678 commented 1 year ago

Just wanted to also say that I have been experiencing this with my backup zfs server that is the target of many zfs send recv and I believe different datasets have caused different corruption and I would get this panic. I had tried importing from a previous uber block but no bueno and I have had to nuke the system twice now and restarting the 100TiB transfer over WAN

Have you tried importing while setting the 4 tunables mentioned further up? If you have an encrypted dataset, have you tried importing without unlocking the datasets while the tunables are set?

jmartinbhiberus commented 10 months ago

@CaCTuCaTu4ECKuu thanks, I have recovered it in the meantime by booting into a live distro (Ubuntu IIRC) and fixed the pool there. And disconnecting affected drives and then boot did not work since it was the boot pool (TrueNAS is also using ZFS for the boot drive for some reason)

@CaCTuCaTu4ECKuu thanks, I have recovered it in the meantime by booting into a live distro (Ubuntu IIRC) and fixed the pool there. And disconnecting affected drives and then boot did not work since it was the boot pool (TrueNAS is also using ZFS for the boot drive for some reason)

Hi @da-anda! I am having the exact same issue that you were having in Truenas Scale with the boot-pool. Could you please guide me on how did you fix it from Live CD? Just zpool scrub command?

da-anda commented 10 months ago

Hi @da-anda! I am having the exact same issue that you were having in Truenas Scale with the boot-pool. Could you please guide me on how did you fix it from Live CD? Just zpool scrub command?

sorry, don't recall which commands I used. Probably simply export and import followed by a scrub.

Anticept commented 10 months ago

I can add to this bug. I was running TrueNAS Scale Cobia 23.10.0.1

Hardware is i5-12400, 64gb ram PC-3200 No ECC, on an ASRock Z690 Pro RS, using firmware from after june of this year (don't remember the exact number but it was recent) and updated again to 17.03 and still had the issue.

Hard drives are 3x 4TB Ironwolf Pros in Z1 config, lz4 compression, no encryption, configured to act as a windows file server. A ZIL/SLOG was also set up using a single 32gb optane m2 nvme drive. Everything was set to sync writes.

I do not know what lead to this issue, but it was absolutely tied to a zvol hosting a windows VM disk.

When I updated from truenas 22.13.3 to 23.10.0.1, I had upgraded the pool feature flags that the zvol was part of (I put it under a "VM_disks" dataset). A few days later, these issues started occuring, and it was very random, and left NO loggable trace. A scrub failed with multiple checksum errors (but it didn't seem tied to any specific files that I could find), but after I cleared the warning and reran the scrub, it came up with no issues.

Today, while I was in the VM and updating it, it would crash. I then ran filesystem checking tools within the VM, and it would crash.

I then changed the compression from lz4 to no compression as a test on the zvol. I then tried an in place reinstallation of the windows VM, and halfway through, it again crashed... and then the host system was stuck in a boot loop of kernel panics displaying issues like this:

PXL_20231113_185141151

The only way I could escape this issue was to pull the drives and mount them as read only so I could start copying data off. I am in the process of blowing away and rebuilding the vdev and are going to put the VM zvols on a different vdev.

da-anda commented 10 months ago

@Anticept I am not sure your TrueNAS crash issue is related to the zvol or ZFS. I am experiencing the very same constant crashes with a Win10 VM running on TrueNAS 23.10. while I had no issues at all on 22.x, and in my case the VM has a NVME passed through, so is not running off of any zvol. In case you would like to add additional info to my bug report over at ix-systems, here is the link https://ixsystems.atlassian.net/browse/NAS-124949 . Since I am not running ECC memory (like you), they basically refused to look into the issue. But if more people experience sudden crashes with VMs on 23.10., maybe they will investigate.

Anticept commented 10 months ago

I am not running ECC memory either. But yes, after upgrading to cobia and feature flags, crashes started.

cyberpower678 commented 10 months ago

I seem to observe that this tends to happens when upgrading ZFS to a newer versions, importing datasets created with a significantly older version of ZFS. Something doesn't seem to migrate properly, but doesn't initially break the current version of ZFS until ZFS gets a few more updates.

Have any of you tried booting with my solution in mind? Are you pools encrypted? Can you try booting with the disks physically disconnected, so you can export the pool and re-import it?

da-anda commented 10 months ago

if you are referring to my recent crashes after updating to TrueNAS 23.10 along with the new ZFS feature flags, then no, my pools are not encrypted. The only "special" thing I have enabled is zstd compression (no dedupe, etc). I could boot up the system with both of my pools disconnected though and test the VM then, if that is what you meant.

edit: but if the crashes would be related to ZFS, they should also happen if no VM is running, but that is not the case for me. No VM and no apps, the system seems to run just fine. But ZFS access is ofc also less if no VM/app is running

cyberpower678 commented 10 months ago

if you are referring to my recent crashes after updating to TrueNAS 23.10 along with the new ZFS feature flags, then no, my pools are not encrypted. The only "special" thing I have enabled is zstd compression (no dedupe, etc). I could boot up the system with both of my pools disconnected though and test the VM then, if that is what you meant.

edit: but if the crashes would be related to ZFS, they should also happen if no VM is running, but that is not the case for me. No VM and no apps, the system seems to run just fine. But ZFS access is ofc also less if no VM/app is running

Yes, though I think you might just have a bad dataset, that if not imported, should be fine. In my case, since I have an encrypted pool, I could simply cherry pick the datasets to not import, and quite easily nuke the bad one, and recreate it. I don't know if the range-tree issue happens on import, or on mount, TrueNAS does it all in one step anyway. Since you mentioned an issue with the VM running, is perhaps the VM dependent on a particular dataset?

Are you able to set the ZFS recovery values to get past the panic?

image
da-anda commented 10 months ago

no, my VM has a NVME passthrough as boot drive and does not depend on any dataset, zvol or anything ZFS related. I don't even know why it crashes as there is nothing in the logs (I at least couldn't spot anything). But just to rule out a ZFS issue, I will boot with all drives/pools detached and see what happens

KodeToad commented 9 months ago

For what it's worth, I had this same problem on my striped nvme rpool which began when I was stressing the system while trying out a new 14TB usb zpool for backups. I worked my way through it thanks mainly to @bitwise0perator. For the record, this is what I did: In the middle of heavy operations I was verifying 2 backups and one was on the new zfs usb drive. When I returned some systems had become unusably sluggish, while containers not using the rpool were still fine. When I issued a system shutdown, an hour later I reset the machine but it would no longer boot, panicking over a duplicate segment to range tree.

When I finally found this thread I was able to boot a proxmox install and <ctrl>-<alt>-<F1> then <ctrl>-c and $echo 1 > /sys/module/zfs/parameters/zil_replay_disable $echo 1 > /sys/module/zfs/parameters/zfs_recover $zpool import -f rpool phew, I hadn't lost everything but after another day working out that it wouldn't repair anything in that state and some zvols had permanent errors I worked out how to set those parameters as boot options to startup the machine with the rpool. Boot to linux boot manager, press e to edit the command line and added the following options: zfs.zil_replay_disable=1 zfs.zfs_recovery=1 I was in! yay, so now I had to cleanup those permanent errors and had forgotten to export rpool $zpool import -f rpool $zpool scrub rpool $zpool status -v rpool and one of those corrupt zvols was the main drive on my regular vm. I reconnected the old usb backup drive I was using and recovered that specific virtual disk from the latest good backup: $zfs get volsize rpool/data/vm-101-disk-0 NAME PROPERTY VALUE SOURCE rpool/data/vm-101-disk-0 volsize 512G local $zfs destroy rpool/data/vm-101-disk-0 $zfs create -V 512G rpool/data/vm-101-disk-0 $proxmox-backup-client restore "vm/101/2023-09-12T02:00:02Z" drive-scsi0.img - --repository localhost:PBS-C-WD14TB | dd of=/dev/zvol/rpool/data/vm-101-disk-0 status=progress

Rebooted that vm and I was stoked that it was away. Now I had to cleanup the rest of the permanent error. It seems that new errors kept appearing on that rpool, but they seemed to be contained to new snapshots. I stopped all processes which were creating snapshots and large files on that pool first. $zfs destroy rpool/ROOT/pve-1@zfs-auto-snap_frequent-2023-11-15-0910 $zpool scrub rpool $zpool status -v rpool repeat until all the new errors had gone and I repaired another vm with non-critical data. Finally Nov 15 20:46:06 pmhost kern.emerg kernel: - [21936.857628] PANIC: zfs: adding existent segment to range tree (offset=9ee8d6d000 size=1540000) Then that scrub finished and zpool status showed no more errors. $zpool clear rpool $zpool status -x all pools are healthy

There were some other adventures I had from mistakes but I won't get into those. Some of the other seemingly relevant comments I read were:

  1. It must be usb
  2. linux zfs isn't production ready

    I hope this helps someone.

da-anda commented 9 months ago

FWIW, my issue persists, even without imported zpools. So I don't think my current issue is related to zfs, unless the boot-pool would be corrupted, but it's a clean install of TrueNAS 23.10 on a brand new SSD. I try downgrading TrueNAS, but that likely won't work due to the applied zfs feature flags (albeit those features are not in use atm)

cyberpower678 commented 9 months ago

FWIW, my issue persists, even without imported zpools. So I don't think my current issue is related to zfs, unless the boot-pool would be corrupted, but it's a clean install of TrueNAS 23.10 on a brand new SSD. I try downgrading TrueNAS, but that likely won't work due to the applied zfs feature flags (albeit those features are not in use atm)

So you’ve disconnected all of your disks that have your data pools leavih only the disk(s) with the boot-pool visible to the system. Sounds like your boot-pool may be effed. Try reinstalling TrueNAS from scratch and reload your configuration from your config backup. I’d say only your boot-pool maybe the issue, so you might just be able to reconnect your disks and let the configuration restore process deal with the rest.

Anticept commented 9 months ago

I can add to this bug. I was running TrueNAS Scale Cobia 23.10.0.1

Hardware is i5-12400, 64gb ram PC-3200 No ECC, on an ASRock Z690 Pro RS, using firmware from after june of this year (don't remember the exact number but it was recent) and updated again to 17.03 and still had the issue.

Hard drives are 3x 4TB Ironwolf Pros in Z1 config, lz4 compression, no encryption, configured to act as a windows file server. A ZIL/SLOG was also set up using a single 32gb optane m2 nvme drive. Everything was set to sync writes.

I do not know what lead to this issue, but it was absolutely tied to a zvol hosting a windows VM disk.

When I updated from truenas 22.13.3 to 23.10.0.1, I had upgraded the pool feature flags that the zvol was part of (I put it under a "VM_disks" dataset). A few days later, these issues started occuring, and it was very random, and left NO loggable trace. A scrub failed with multiple checksum errors (but it didn't seem tied to any specific files that I could find), but after I cleared the warning and reran the scrub, it came up with no issues.

Today, while I was in the VM and updating it, it would crash. I then ran filesystem checking tools within the VM, and it would crash.

I then changed the compression from lz4 to no compression as a test on the zvol. I then tried an in place reinstallation of the windows VM, and halfway through, it again crashed... and then the host system was stuck in a boot loop of kernel panics displaying issues like this:

PXL_20231113_185141151

The only way I could escape this issue was to pull the drives and mount them as read only so I could start copying data off. I am in the process of blowing away and rebuilding the vdev and are going to put the VM zvols on a different vdev.

Update on this.

My attempt at using a FRESH install didn't solve the issue, it gradually got worse until it was crashing every half hour.

I also caught messages like this: image

I then ran the tools from the manufacturers, and none of these disks had logged any ECC events at all.

I don't think it's ZFS, I think there's something bugged with the kernel that is included with truenas cobia that doesn't like my particular hardware.

I again moved everything off the drives, rolled back to bluefin, and all is stable again with over a week of stability.

bsiara commented 9 months ago

today I tryed move our gitlab instance to zfs filesystem and I get the sam error:

Nov 24 09:44:49 ec1-tools-gitlab zed: eid=40 class=config_sync pool='data1'
Nov 24 09:47:41 ec1-tools-gitlab zed: eid=41 class=data pool='data1' priority=0 err=52 flags=0x808081 bookmark=388:4371443:0:0
Nov 24 09:47:41 ec1-tools-gitlab zed: eid=42 class=checksum pool='data1' vdev=nvme-nvme.1d0f-766f6c3037323832376466636230333133636336-416d617a6f6e20456c617374696320426c6f636b2053746f7265-000
00001-part1 algorithm=fletcher4 size=12288 offset=1444227571712 priority=0 err=52 flags=0x180080 bookmark=388:4371443:0:0
Nov 24 09:47:41 ec1-tools-gitlab zed: eid=43 class=data pool='data1' priority=0 err=52 flags=0x808081 bookmark=388:4371443:0:0
Nov 24 09:48:55 ec1-tools-gitlab kernel: [79009.932658] PANIC: zfs: adding existent segment to range tree (offset=150426be000 size=3000)
Nov 24 09:48:55 ec1-tools-gitlab kernel: [79009.935648] Showing stack for process 434
Nov 24 09:48:55 ec1-tools-gitlab kernel: [79009.935651] CPU: 5 PID: 434 Comm: txg_sync Tainted: P           OE      6.2.0-1016-aws #16~22.04.1-Ubuntu
Nov 24 09:48:55 ec1-tools-gitlab kernel: [79009.935655] Hardware name: Amazon EC2 c5a.2xlarge/, BIOS 1.0 10/16/2017
Nov 24 09:48:55 ec1-tools-gitlab kernel: [79009.935658] Call Trace:
Nov 24 09:48:55 ec1-tools-gitlab kernel: [79009.935661]  <TASK>
Nov 24 09:48:55 ec1-tools-gitlab kernel: [79009.935665]  dump_stack_lvl+0x48/0x70
Nov 24 09:48:55 ec1-tools-gitlab kernel: [79009.935673]  dump_stack+0x10/0x20
Nov 24 09:48:55 ec1-tools-gitlab kernel: [79009.935678]  vcmn_err+0xd0/0x130 [spl]
Nov 24 09:48:55 ec1-tools-gitlab kernel: [79009.935697]  ? srso_return_thunk+0x5/0x10
Nov 24 09:48:55 ec1-tools-gitlab kernel: [79009.935703]  ? bmov+0x17/0x30 [zfs]
Nov 24 09:48:55 ec1-tools-gitlab kernel: [79009.935874]  ? srso_return_thunk+0x5/0x10
Nov 24 09:48:55 ec1-tools-gitlab kernel: [79009.935877]  ? bt_grow_leaf+0x17a/0x190 [zfs]
Nov 24 09:48:55 ec1-tools-gitlab kernel: [79009.936029]  ? bt_grow_leaf+0x17a/0x190 [zfs]
Nov 24 09:48:55 ec1-tools-gitlab kernel: [79009.936178]  ? srso_return_thunk+0x5/0x10
Nov 24 09:48:55 ec1-tools-gitlab kernel: [79009.936181]  ? bcpy+0x17/0x30 [zfs]
Nov 24 09:48:55 ec1-tools-gitlab kernel: [79009.936329]  ? srso_return_thunk+0x5/0x10
Nov 24 09:48:55 ec1-tools-gitlab kernel: [79009.936332]  ? zfs_btree_insert_into_leaf+0x26a/0x360 [zfs]
Nov 24 09:48:55 ec1-tools-gitlab kernel: [79009.936480]  ? srso_return_thunk+0x5/0x10
Nov 24 09:48:55 ec1-tools-gitlab kernel: [79009.936483]  ? __slab_free+0xbc/0x340
Nov 24 09:48:55 ec1-tools-gitlab kernel: [79009.936489]  zfs_panic_recover+0x6d/0xa0 [zfs]
Nov 24 09:48:55 ec1-tools-gitlab kernel: [79009.936663]  range_tree_add_impl+0x261/0x1090 [zfs]
Nov 24 09:48:55 ec1-tools-gitlab kernel: [79009.936833]  ? spl_kvmalloc+0x9e/0xd0 [spl]
Nov 24 09:48:55 ec1-tools-gitlab kernel: [79009.936845]  ? srso_return_thunk+0x5/0x10
Nov 24 09:48:55 ec1-tools-gitlab kernel: [79009.936848]  ? __kmalloc_node+0x54/0x140
Nov 24 09:48:55 ec1-tools-gitlab kernel: [79009.936854]  ? srso_return_thunk+0x5/0x10
Nov 24 09:48:55 ec1-tools-gitlab kernel: [79009.936857]  ? spl_kvmalloc+0x9e/0xd0 [spl]
Nov 24 09:48:55 ec1-tools-gitlab kernel: [79009.936868]  ? srso_return_thunk+0x5/0x10
Nov 24 09:48:55 ec1-tools-gitlab kernel: [79009.936871]  ? __pfx_range_tree_add+0x10/0x10 [zfs]
Nov 24 09:48:55 ec1-tools-gitlab kernel: [79009.937035]  range_tree_add+0x11/0x20 [zfs]
Nov 24 09:48:55 ec1-tools-gitlab kernel: [79009.937195]  range_tree_vacate+0x114/0x2b0 [zfs]
Nov 24 09:48:55 ec1-tools-gitlab kernel: [79009.937356]  metaslab_sync_done+0x491/0x550 [zfs]
Nov 24 09:48:55 ec1-tools-gitlab kernel: [79009.937520]  vdev_sync_done+0x3b/0xa0 [zfs]
Nov 24 09:48:55 ec1-tools-gitlab kernel: [79009.937684]  spa_sync+0x864/0x1070 [zfs]
Nov 24 09:48:55 ec1-tools-gitlab kernel: [79009.937852]  txg_sync_thread+0x219/0x3a0 [zfs]
Nov 24 09:48:55 ec1-tools-gitlab kernel: [79009.938015]  ? __pfx_txg_sync_thread+0x10/0x10 [zfs]
Nov 24 09:48:55 ec1-tools-gitlab kernel: [79009.938202]  ? __pfx_thread_generic_wrapper+0x10/0x10 [spl]
Nov 24 09:48:55 ec1-tools-gitlab kernel: [79009.938216]  thread_generic_wrapper+0x64/0x80 [spl]
Nov 24 09:48:55 ec1-tools-gitlab kernel: [79009.938228]  kthread+0xcd/0xf0
Nov 24 09:48:55 ec1-tools-gitlab kernel: [79009.938232]  ? __pfx_kthread+0x10/0x10
Nov 24 09:48:55 ec1-tools-gitlab kernel: [79009.938236]  ret_from_fork+0x2c/0x50
Nov 24 09:48:55 ec1-tools-gitlab kernel: [79009.938242]  </TASK>
Nov 24 09:51:09 ec1-tools-gitlab kernel: [79144.023007] INFO: task txg_sync:434 blocked for more than 120 seconds.
Nov 24 09:51:09 ec1-tools-gitlab kernel: [79144.025590]       Tainted: P           OE      6.2.0-1016-aws #16~22.04.1-Ubuntu
Nov 24 09:51:09 ec1-tools-gitlab kernel: [79144.028197] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 24 09:51:09 ec1-tools-gitlab kernel: [79144.030902] task:txg_sync        state:D stack:0     pid:434   ppid:2      flags:0x00004000
Nov 24 09:51:09 ec1-tools-gitlab kernel: [79144.030908] Call Trace:
Nov 24 09:51:09 ec1-tools-gitlab kernel: [79144.030911]  <TASK>
Nov 24 09:51:09 ec1-tools-gitlab kernel: [79144.030916]  __schedule+0x2b4/0x5e0
Nov 24 09:51:09 ec1-tools-gitlab kernel: [79144.030926]  schedule+0x5d/0x100
Nov 24 09:51:09 ec1-tools-gitlab kernel: [79144.030955]  vcmn_err+0xdd/0x130 [spl]
Nov 24 09:51:09 ec1-tools-gitlab kernel: [79144.030969]  ? srso_return_thunk+0x5/0x10
Nov 24 09:51:09 ec1-tools-gitlab kernel: [79144.030974]  ? bmov+0x17/0x30 [zfs]
Nov 24 09:51:09 ec1-tools-gitlab kernel: [79144.031141]  ? srso_return_thunk+0x5/0x10
Nov 24 09:51:09 ec1-tools-gitlab kernel: [79144.031145]  ? bt_grow_leaf+0x17a/0x190 [zfs]
Nov 24 09:51:09 ec1-tools-gitlab kernel: [79144.031301]  ? bt_grow_leaf+0x17a/0x190 [zfs]
Nov 24 09:51:09 ec1-tools-gitlab kernel: [79144.031451]  ? srso_return_thunk+0x5/0x10
Nov 24 09:51:09 ec1-tools-gitlab kernel: [79144.031455]  ? bcpy+0x17/0x30 [zfs]
Nov 24 09:51:09 ec1-tools-gitlab kernel: [79144.031602]  ? srso_return_thunk+0x5/0x10
Nov 24 09:51:09 ec1-tools-gitlab kernel: [79144.031606]  ? zfs_btree_insert_into_leaf+0x26a/0x360 [zfs]
Nov 24 09:51:09 ec1-tools-gitlab kernel: [79144.031754]  ? srso_return_thunk+0x5/0x10
Nov 24 09:51:09 ec1-tools-gitlab kernel: [79144.031757]  ? __slab_free+0xbc/0x340
Nov 24 09:51:09 ec1-tools-gitlab kernel: [79144.031764]  zfs_panic_recover+0x6d/0xa0 [zfs]
Nov 24 09:51:09 ec1-tools-gitlab kernel: [79144.031938]  range_tree_add_impl+0x261/0x1090 [zfs]
Nov 24 09:51:09 ec1-tools-gitlab kernel: [79144.032107]  ? spl_kvmalloc+0x9e/0xd0 [spl]
Nov 24 09:51:09 ec1-tools-gitlab kernel: [79144.032119]  ? srso_return_thunk+0x5/0x10
Nov 24 09:51:09 ec1-tools-gitlab kernel: [79144.032123]  ? __kmalloc_node+0x54/0x140
Nov 24 09:51:09 ec1-tools-gitlab kernel: [79144.032128]  ? srso_return_thunk+0x5/0x10
Nov 24 09:51:09 ec1-tools-gitlab kernel: [79144.032132]  ? spl_kvmalloc+0x9e/0xd0 [spl]
Nov 24 09:51:09 ec1-tools-gitlab kernel: [79144.032143]  ? srso_return_thunk+0x5/0x10
Nov 24 09:51:09 ec1-tools-gitlab kernel: [79144.032147]  ? __pfx_range_tree_add+0x10/0x10 [zfs]
Nov 24 09:51:09 ec1-tools-gitlab kernel: [79144.032315]  range_tree_add+0x11/0x20 [zfs]
Nov 24 09:51:09 ec1-tools-gitlab kernel: [79144.032474]  range_tree_vacate+0x114/0x2b0 [zfs]
Nov 24 09:51:09 ec1-tools-gitlab kernel: [79144.032634]  metaslab_sync_done+0x491/0x550 [zfs]
Nov 24 09:51:09 ec1-tools-gitlab kernel: [79144.032826]  vdev_sync_done+0x3b/0xa0 [zfs]
Nov 24 09:51:09 ec1-tools-gitlab kernel: [79144.033035]  spa_sync+0x864/0x1070 [zfs]
Nov 24 09:51:09 ec1-tools-gitlab kernel: [79144.033254]  txg_sync_thread+0x219/0x3a0 [zfs]
Nov 24 09:51:09 ec1-tools-gitlab kernel: [79144.033497]  ? __pfx_txg_sync_thread+0x10/0x10 [zfs]
Nov 24 09:51:09 ec1-tools-gitlab kernel: [79144.033733]  ? __pfx_thread_generic_wrapper+0x10/0x10 [spl]
Nov 24 09:51:09 ec1-tools-gitlab kernel: [79144.033751]  thread_generic_wrapper+0x64/0x80 [spl]
Nov 24 09:51:09 ec1-tools-gitlab kernel: [79144.033767]  kthread+0xcd/0xf0
Nov 24 09:51:09 ec1-tools-gitlab kernel: [79144.033773]  ? __pfx_kthread+0x10/0x10
Nov 24 09:51:09 ec1-tools-gitlab kernel: [79144.033777]  ret_from_fork+0x2c/0x50
Nov 24 09:51:09 ec1-tools-gitlab kernel: [79144.033785]  </TASK>
Nov 24 09:51:09 ec1-tools-gitlab kernel: [79144.033789] INFO: task vdev_autotrim:450 blocked for more than 120 seconds.
Nov 24 09:51:09 ec1-tools-gitlab kernel: [79144.036379]       Tainted: P           OE      6.2.0-1016-aws #16~22.04.1-Ubuntu
Nov 24 09:51:09 ec1-tools-gitlab kernel: [79144.039092] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 24 09:51:09 ec1-tools-gitlab kernel: [79144.041851] task:vdev_autotrim   state:D stack:0     pid:450   ppid:2      flags:0x00004000
Nov 24 09:51:09 ec1-tools-gitlab kernel: [79144.041858] Call Trace:
Nov 24 09:51:09 ec1-tools-gitlab kernel: [79144.041861]  <TASK>
Nov 24 09:51:09 ec1-tools-gitlab kernel: [79144.041865]  __schedule+0x2b4/0x5e0
Nov 24 09:51:09 ec1-tools-gitlab kernel: [79144.041874]  schedule+0x5d/0x100
Nov 24 09:51:09 ec1-tools-gitlab kernel: [79144.041879]  cv_wait_common+0x107/0x140 [spl]
Nov 24 09:51:09 ec1-tools-gitlab kernel: [79144.041891]  ? __pfx_autoremove_wake_function+0x10/0x10
Nov 24 09:51:09 ec1-tools-gitlab kernel: [79144.041898]  __cv_wait+0x15/0x30 [spl]
Nov 24 09:51:09 ec1-tools-gitlab kernel: [79144.041909]  vdev_autotrim_thread+0x640/0x920 [zfs]
Nov 24 09:51:09 ec1-tools-gitlab kernel: [79144.042080]  ? __pfx_vdev_autotrim_thread+0x10/0x10 [zfs]
Nov 24 09:51:09 ec1-tools-gitlab kernel: [79144.042235]  ? __pfx_thread_generic_wrapper+0x10/0x10 [spl]
Nov 24 09:51:09 ec1-tools-gitlab kernel: [79144.042248]  thread_generic_wrapper+0x64/0x80 [spl]
Nov 24 09:51:09 ec1-tools-gitlab kernel: [79144.042260]  kthread+0xcd/0xf0
Nov 24 09:51:09 ec1-tools-gitlab kernel: [79144.042264]  ? __pfx_kthread+0x10/0x10
Nov 24 09:51:09 ec1-tools-gitlab kernel: [79144.042268]  ret_from_fork+0x2c/0x50
Nov 24 09:51:09 ec1-tools-gitlab kernel: [79144.042275]  </TASK>
Nov 24 09:53:10 ec1-tools-gitlab kernel: [79264.850711] INFO: task txg_sync:434 blocked for more than 241 seconds.
Nov 24 09:53:10 ec1-tools-gitlab kernel: [79264.853054]       Tainted: P           OE      6.2.0-1016-aws #16~22.04.1-Ubuntu
Nov 24 09:53:10 ec1-tools-gitlab kernel: [79264.855696] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 24 09:53:10 ec1-tools-gitlab kernel: [79264.858451] task:txg_sync        state:D stack:0     pid:434   ppid:2      flags:0x00004000
Nov 24 09:53:10 ec1-tools-gitlab kernel: [79264.858458] Call Trace:
Nov 24 09:53:10 ec1-tools-gitlab kernel: [79264.858462]  <TASK>
Nov 24 09:53:10 ec1-tools-gitlab kernel: [79264.858466]  __schedule+0x2b4/0x5e0
Nov 24 09:53:10 ec1-tools-gitlab kernel: [79264.858477]  schedule+0x5d/0x100
Nov 24 09:53:10 ec1-tools-gitlab kernel: [79264.858482]  vcmn_err+0xdd/0x130 [spl]
Nov 24 09:53:10 ec1-tools-gitlab kernel: [79264.858497]  ? srso_return_thunk+0x5/0x10
Nov 24 09:53:10 ec1-tools-gitlab kernel: [79264.858501]  ? bmov+0x17/0x30 [zfs]
Nov 24 09:53:10 ec1-tools-gitlab kernel: [79264.858724]  ? srso_return_thunk+0x5/0x10
Nov 24 09:53:10 ec1-tools-gitlab kernel: [79264.858728]  ? bt_grow_leaf+0x17a/0x190 [zfs]
Nov 24 09:53:10 ec1-tools-gitlab kernel: [79264.858885]  ? bt_grow_leaf+0x17a/0x190 [zfs]
Nov 24 09:53:10 ec1-tools-gitlab kernel: [79264.859035]  ? srso_return_thunk+0x5/0x10
Nov 24 09:53:10 ec1-tools-gitlab kernel: [79264.859038]  ? bcpy+0x17/0x30 [zfs]
Nov 24 09:53:10 ec1-tools-gitlab kernel: [79264.859186]  ? srso_return_thunk+0x5/0x10
Nov 24 09:53:10 ec1-tools-gitlab kernel: [79264.859189]  ? zfs_btree_insert_into_leaf+0x26a/0x360 [zfs]
Nov 24 09:53:10 ec1-tools-gitlab kernel: [79264.859337]  ? srso_return_thunk+0x5/0x10
Nov 24 09:53:10 ec1-tools-gitlab kernel: [79264.859340]  ? __slab_free+0xbc/0x340
Nov 24 09:53:10 ec1-tools-gitlab kernel: [79264.859347]  zfs_panic_recover+0x6d/0xa0 [zfs]
Nov 24 09:53:10 ec1-tools-gitlab kernel: [79264.859520]  range_tree_add_impl+0x261/0x1090 [zfs]
Nov 24 09:53:10 ec1-tools-gitlab kernel: [79264.859690]  ? spl_kvmalloc+0x9e/0xd0 [spl]
Nov 24 09:53:10 ec1-tools-gitlab kernel: [79264.859702]  ? srso_return_thunk+0x5/0x10
Nov 24 09:53:10 ec1-tools-gitlab kernel: [79264.859706]  ? __kmalloc_node+0x54/0x140
Nov 24 09:53:10 ec1-tools-gitlab kernel: [79264.859711]  ? srso_return_thunk+0x5/0x10
Nov 24 09:53:10 ec1-tools-gitlab kernel: [79264.859714]  ? spl_kvmalloc+0x9e/0xd0 [spl]
Nov 24 09:53:10 ec1-tools-gitlab kernel: [79264.859725]  ? srso_return_thunk+0x5/0x10
Nov 24 09:53:10 ec1-tools-gitlab kernel: [79264.859729]  ? __pfx_range_tree_add+0x10/0x10 [zfs]
Nov 24 09:53:10 ec1-tools-gitlab kernel: [79264.859893]  range_tree_add+0x11/0x20 [zfs]
Nov 24 09:53:10 ec1-tools-gitlab kernel: [79264.860054]  range_tree_vacate+0x114/0x2b0 [zfs]
Nov 24 09:53:10 ec1-tools-gitlab kernel: [79264.860216]  metaslab_sync_done+0x491/0x550 [zfs]
Nov 24 09:53:10 ec1-tools-gitlab kernel: [79264.860381]  vdev_sync_done+0x3b/0xa0 [zfs]
Nov 24 09:53:10 ec1-tools-gitlab kernel: [79264.860545]  spa_sync+0x864/0x1070 [zfs]
Nov 24 09:53:10 ec1-tools-gitlab kernel: [79264.860714]  txg_sync_thread+0x219/0x3a0 [zfs]
Nov 24 09:53:10 ec1-tools-gitlab kernel: [79264.860883]  ? __pfx_txg_sync_thread+0x10/0x10 [zfs]
Nov 24 09:53:10 ec1-tools-gitlab kernel: [79264.861056]  ? __pfx_thread_generic_wrapper+0x10/0x10 [spl]
Nov 24 09:53:10 ec1-tools-gitlab kernel: [79264.861071]  thread_generic_wrapper+0x64/0x80 [spl]
Nov 24 09:53:10 ec1-tools-gitlab kernel: [79264.861085]  kthread+0xcd/0xf0
Nov 24 09:53:10 ec1-tools-gitlab kernel: [79264.861089]  ? __pfx_kthread+0x10/0x10
Nov 24 09:53:10 ec1-tools-gitlab kernel: [79264.861093]  ret_from_fork+0x2c/0x50
Nov 24 09:53:10 ec1-tools-gitlab kernel: [79264.861101]  </TASK>
Nov 24 09:53:10 ec1-tools-gitlab kernel: [79264.861105] INFO: task vdev_autotrim:450 blocked for more than 241 seconds.
Nov 24 09:53:10 ec1-tools-gitlab kernel: [79264.863586]       Tainted: P           OE      6.2.0-1016-aws #16~22.04.1-Ubuntu
Nov 24 09:53:10 ec1-tools-gitlab kernel: [79264.869433] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 24 09:53:10 ec1-tools-gitlab kernel: [79264.875416] task:vdev_autotrim   state:D stack:0     pid:450   ppid:2      flags:0x00004000
Nov 24 09:53:10 ec1-tools-gitlab kernel: [79264.875422] Call Trace:
Nov 24 09:53:10 ec1-tools-gitlab kernel: [79264.875425]  <TASK>
Nov 24 09:53:10 ec1-tools-gitlab kernel: [79264.875428]  __schedule+0x2b4/0x5e0
Nov 24 09:53:10 ec1-tools-gitlab kernel: [79264.875435]  schedule+0x5d/0x100
Nov 24 09:53:10 ec1-tools-gitlab kernel: [79264.875440]  cv_wait_common+0x107/0x140 [spl]
Nov 24 09:53:10 ec1-tools-gitlab kernel: [79264.875451]  ? __pfx_autoremove_wake_function+0x10/0x10
Nov 24 09:53:10 ec1-tools-gitlab kernel: [79264.875457]  __cv_wait+0x15/0x30 [spl]
Nov 24 09:53:10 ec1-tools-gitlab kernel: [79264.875469]  vdev_autotrim_thread+0x640/0x920 [zfs]
Nov 24 09:53:10 ec1-tools-gitlab kernel: [79264.875632]  ? __pfx_vdev_autotrim_thread+0x10/0x10 [zfs]
Nov 24 09:53:10 ec1-tools-gitlab kernel: [79264.875787]  ? __pfx_thread_generic_wrapper+0x10/0x10 [spl]
Nov 24 09:53:10 ec1-tools-gitlab kernel: [79264.875800]  thread_generic_wrapper+0x64/0x80 [spl]
Nov 24 09:53:10 ec1-tools-gitlab kernel: [79264.875812]  kthread+0xcd/0xf0
Nov 24 09:53:10 ec1-tools-gitlab kernel: [79264.875815]  ? __pfx_kthread+0x10/0x10
Nov 24 09:53:10 ec1-tools-gitlab kernel: [79264.875819]  ret_from_fork+0x2c/0x50
Nov 24 09:53:10 ec1-tools-gitlab kernel: [79264.875825]  </TASK>
Nov 24 09:55:11 ec1-tools-gitlab kernel: [79385.678479] INFO: task txg_sync:434 blocked for more than 362 seconds.
Nov 24 09:55:11 ec1-tools-gitlab kernel: [79385.682639]       Tainted: P           OE      6.2.0-1016-aws #16~22.04.1-Ubuntu
Nov 24 09:55:11 ec1-tools-gitlab kernel: [79385.688756] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 24 09:55:11 ec1-tools-gitlab kernel: [79385.695142] task:txg_sync        state:D stack:0     pid:434   ppid:2      flags:0x00004000
Nov 24 09:55:11 ec1-tools-gitlab kernel: [79385.695152] Call Trace:
Nov 24 09:55:11 ec1-tools-gitlab kernel: [79385.695156]  <TASK>
Nov 24 09:55:11 ec1-tools-gitlab kernel: [79385.695162]  __schedule+0x2b4/0x5e0
Nov 24 09:55:11 ec1-tools-gitlab kernel: [79385.695174]  schedule+0x5d/0x100
Nov 24 09:55:11 ec1-tools-gitlab kernel: [79385.695181]  vcmn_err+0xdd/0x130 [spl]
Nov 24 09:55:11 ec1-tools-gitlab kernel: [79385.695198]  ? srso_return_thunk+0x5/0x10
Nov 24 09:55:11 ec1-tools-gitlab kernel: [79385.695204]  ? bmov+0x17/0x30 [zfs]
Nov 24 09:55:11 ec1-tools-gitlab kernel: [79385.695431]  ? srso_return_thunk+0x5/0x10
Nov 24 09:55:11 ec1-tools-gitlab kernel: [79385.695436]  ? bt_grow_leaf+0x17a/0x190 [zfs]
Nov 24 09:55:11 ec1-tools-gitlab kernel: [79385.695633]  ? bt_grow_leaf+0x17a/0x190 [zfs]
Nov 24 09:55:11 ec1-tools-gitlab kernel: [79385.695811]  ? srso_return_thunk+0x5/0x10
Nov 24 09:55:11 ec1-tools-gitlab kernel: [79385.695816]  ? bcpy+0x17/0x30 [zfs]
Nov 24 09:55:11 ec1-tools-gitlab kernel: [79385.696010]  ? srso_return_thunk+0x5/0x10
Nov 24 09:55:11 ec1-tools-gitlab kernel: [79385.696015]  ? zfs_btree_insert_into_leaf+0x26a/0x360 [zfs]
Nov 24 09:55:11 ec1-tools-gitlab kernel: [79385.696197]  ? srso_return_thunk+0x5/0x10
Nov 24 09:55:11 ec1-tools-gitlab kernel: [79385.696202]  ? __slab_free+0xbc/0x340
Nov 24 09:55:11 ec1-tools-gitlab kernel: [79385.696210]  zfs_panic_recover+0x6d/0xa0 [zfs]
Nov 24 09:55:11 ec1-tools-gitlab kernel: [79385.696422]  range_tree_add_impl+0x261/0x1090 [zfs]
Nov 24 09:55:11 ec1-tools-gitlab kernel: [79385.696625]  ? spl_kvmalloc+0x9e/0xd0 [spl]
Nov 24 09:55:11 ec1-tools-gitlab kernel: [79385.696640]  ? srso_return_thunk+0x5/0x10
Nov 24 09:55:11 ec1-tools-gitlab kernel: [79385.696644]  ? __kmalloc_node+0x54/0x140
Nov 24 09:55:11 ec1-tools-gitlab kernel: [79385.696651]  ? srso_return_thunk+0x5/0x10
Nov 24 09:55:11 ec1-tools-gitlab kernel: [79385.696655]  ? spl_kvmalloc+0x9e/0xd0 [spl]
Nov 24 09:55:11 ec1-tools-gitlab kernel: [79385.696669]  ? srso_return_thunk+0x5/0x10
Nov 24 09:55:11 ec1-tools-gitlab kernel: [79385.696674]  ? __pfx_range_tree_add+0x10/0x10 [zfs]
Nov 24 09:55:11 ec1-tools-gitlab kernel: [79385.696879]  range_tree_add+0x11/0x20 [zfs]
Nov 24 09:55:11 ec1-tools-gitlab kernel: [79385.697087]  range_tree_vacate+0x114/0x2b0 [zfs]
Nov 24 09:55:11 ec1-tools-gitlab kernel: [79385.697284]  metaslab_sync_done+0x491/0x550 [zfs]
Nov 24 09:55:11 ec1-tools-gitlab kernel: [79385.697482]  vdev_sync_done+0x3b/0xa0 [zfs]
Nov 24 09:55:11 ec1-tools-gitlab kernel: [79385.697719]  spa_sync+0x864/0x1070 [zfs]
Nov 24 09:55:11 ec1-tools-gitlab kernel: [79385.697926]  txg_sync_thread+0x219/0x3a0 [zfs]
Nov 24 09:55:11 ec1-tools-gitlab kernel: [79385.698143]  ? __pfx_txg_sync_thread+0x10/0x10 [zfs]
Nov 24 09:55:11 ec1-tools-gitlab kernel: [79385.698401]  ? __pfx_thread_generic_wrapper+0x10/0x10 [spl]
Nov 24 09:55:11 ec1-tools-gitlab kernel: [79385.698417]  thread_generic_wrapper+0x64/0x80 [spl]
Nov 24 09:55:11 ec1-tools-gitlab kernel: [79385.698432]  kthread+0xcd/0xf0
Nov 24 09:55:11 ec1-tools-gitlab kernel: [79385.698438]  ? __pfx_kthread+0x10/0x10
Nov 24 09:55:11 ec1-tools-gitlab kernel: [79385.698443]  ret_from_fork+0x2c/0x50
Nov 24 09:55:11 ec1-tools-gitlab kernel: [79385.698451]  </TASK>
Nov 24 09:55:11 ec1-tools-gitlab kernel: [79385.698456] INFO: task vdev_autotrim:450 blocked for more than 362 seconds.
Nov 24 09:55:11 ec1-tools-gitlab kernel: [79385.702796]       Tainted: P           OE      6.2.0-1016-aws #16~22.04.1-Ubuntu
Nov 24 09:55:11 ec1-tools-gitlab kernel: [79385.709008] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 24 09:55:11 ec1-tools-gitlab kernel: [79385.715421] task:vdev_autotrim   state:D stack:0     pid:450   ppid:2      flags:0x00004000
Nov 24 09:55:11 ec1-tools-gitlab kernel: [79385.715430] Call Trace:
Nov 24 09:55:11 ec1-tools-gitlab kernel: [79385.715434]  <TASK>
Nov 24 09:55:11 ec1-tools-gitlab kernel: [79385.715439]  __schedule+0x2b4/0x5e0
Nov 24 09:55:11 ec1-tools-gitlab kernel: [79385.715450]  schedule+0x5d/0x100
Nov 24 09:55:11 ec1-tools-gitlab kernel: [79385.715457]  cv_wait_common+0x107/0x140 [spl]
Nov 24 09:55:11 ec1-tools-gitlab kernel: [79385.715471]  ? __pfx_autoremove_wake_function+0x10/0x10
Nov 24 09:55:11 ec1-tools-gitlab kernel: [79385.715478]  __cv_wait+0x15/0x30 [spl]
Nov 24 09:55:11 ec1-tools-gitlab kernel: [79385.715491]  vdev_autotrim_thread+0x640/0x920 [zfs]
Nov 24 09:55:11 ec1-tools-gitlab kernel: [79385.715691]  ? __pfx_vdev_autotrim_thread+0x10/0x10 [zfs]
Nov 24 09:55:11 ec1-tools-gitlab kernel: [79385.715882]  ? __pfx_thread_generic_wrapper+0x10/0x10 [spl]
Nov 24 09:55:11 ec1-tools-gitlab kernel: [79385.715898]  thread_generic_wrapper+0x64/0x80 [spl]
Nov 24 09:55:11 ec1-tools-gitlab kernel: [79385.715912]  kthread+0xcd/0xf0
Nov 24 09:55:11 ec1-tools-gitlab kernel: [79385.715917]  ? __pfx_kthread+0x10/0x10
Nov 24 09:55:11 ec1-tools-gitlab kernel: [79385.715922]  ret_from_fork+0x2c/0x50
Nov 24 09:55:11 ec1-tools-gitlab kernel: [79385.715929]  </TASK>
Nov 24 09:57:12 ec1-tools-gitlab kernel: [79506.510149] INFO: task txg_sync:434 blocked for more than 483 seconds.
Nov 24 09:57:12 ec1-tools-gitlab kernel: [79506.514163]       Tainted: P           OE      6.2.0-1016-aws #16~22.04.1-Ubuntu
Nov 24 09:57:12 ec1-tools-gitlab kernel: [79506.520031] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 24 09:57:12 ec1-tools-gitlab kernel: [79506.526126] task:txg_sync        state:D stack:0     pid:434   ppid:2      flags:0x00004000
Nov 24 09:57:12 ec1-tools-gitlab kernel: [79506.526134] Call Trace:
Nov 24 09:57:12 ec1-tools-gitlab kernel: [79506.526137]  <TASK>
Nov 24 09:57:12 ec1-tools-gitlab kernel: [79506.526143]  __schedule+0x2b4/0x5e0
Nov 24 09:57:12 ec1-tools-gitlab kernel: [79506.526152]  schedule+0x5d/0x100
Nov 24 09:57:12 ec1-tools-gitlab kernel: [79506.526158]  vcmn_err+0xdd/0x130 [spl]
Nov 24 09:57:12 ec1-tools-gitlab kernel: [79506.526172]  ? srso_return_thunk+0x5/0x10
Nov 24 09:57:12 ec1-tools-gitlab kernel: [79506.526176]  ? bmov+0x17/0x30 [zfs]
Nov 24 09:57:12 ec1-tools-gitlab kernel: [79506.526343]  ? srso_return_thunk+0x5/0x10
Nov 24 09:57:12 ec1-tools-gitlab kernel: [79506.526346]  ? bt_grow_leaf+0x17a/0x190 [zfs]
Nov 24 09:57:12 ec1-tools-gitlab kernel: [79506.526498]  ? bt_grow_leaf+0x17a/0x190 [zfs]
Nov 24 09:57:12 ec1-tools-gitlab kernel: [79506.526656]  ? srso_return_thunk+0x5/0x10
Nov 24 09:57:12 ec1-tools-gitlab kernel: [79506.526660]  ? bcpy+0x17/0x30 [zfs]
Nov 24 09:57:12 ec1-tools-gitlab kernel: [79506.526835]  ? srso_return_thunk+0x5/0x10
Nov 24 09:57:12 ec1-tools-gitlab kernel: [79506.526839]  ? zfs_btree_insert_into_leaf+0x26a/0x360 [zfs]
Nov 24 09:57:12 ec1-tools-gitlab kernel: [79506.527021]  ? srso_return_thunk+0x5/0x10
Nov 24 09:57:12 ec1-tools-gitlab kernel: [79506.527025]  ? __slab_free+0xbc/0x340
Nov 24 09:57:12 ec1-tools-gitlab kernel: [79506.527031]  zfs_panic_recover+0x6d/0xa0 [zfs]
Nov 24 09:57:12 ec1-tools-gitlab kernel: [79506.527214]  range_tree_add_impl+0x261/0x1090 [zfs]
Nov 24 09:57:12 ec1-tools-gitlab kernel: [79506.527386]  ? spl_kvmalloc+0x9e/0xd0 [spl]
Nov 24 09:57:12 ec1-tools-gitlab kernel: [79506.527398]  ? srso_return_thunk+0x5/0x10
Nov 24 09:57:12 ec1-tools-gitlab kernel: [79506.527402]  ? __kmalloc_node+0x54/0x140
Nov 24 09:57:12 ec1-tools-gitlab kernel: [79506.527407]  ? srso_return_thunk+0x5/0x10
Nov 24 09:57:12 ec1-tools-gitlab kernel: [79506.527410]  ? spl_kvmalloc+0x9e/0xd0 [spl]
Nov 24 09:57:12 ec1-tools-gitlab kernel: [79506.527421]  ? srso_return_thunk+0x5/0x10
Nov 24 09:57:12 ec1-tools-gitlab kernel: [79506.527425]  ? __pfx_range_tree_add+0x10/0x10 [zfs]
Nov 24 09:57:12 ec1-tools-gitlab kernel: [79506.527591]  range_tree_add+0x11/0x20 [zfs]
Nov 24 09:57:12 ec1-tools-gitlab kernel: [79506.527761]  range_tree_vacate+0x114/0x2b0 [zfs]
Nov 24 09:57:12 ec1-tools-gitlab kernel: [79506.527998]  metaslab_sync_done+0x491/0x550 [zfs]
Nov 24 09:57:12 ec1-tools-gitlab kernel: [79506.528218]  vdev_sync_done+0x3b/0xa0 [zfs]
Nov 24 09:57:12 ec1-tools-gitlab kernel: [79506.528412]  spa_sync+0x864/0x1070 [zfs]
Nov 24 09:57:12 ec1-tools-gitlab kernel: [79506.528583]  txg_sync_thread+0x219/0x3a0 [zfs]
Nov 24 09:57:12 ec1-tools-gitlab kernel: [79506.528757]  ? __pfx_txg_sync_thread+0x10/0x10 [zfs]
Nov 24 09:57:12 ec1-tools-gitlab kernel: [79506.528950]  ? __pfx_thread_generic_wrapper+0x10/0x10 [spl]
Nov 24 09:57:12 ec1-tools-gitlab kernel: [79506.528963]  thread_generic_wrapper+0x64/0x80 [spl]
Nov 24 09:57:12 ec1-tools-gitlab kernel: [79506.528975]  kthread+0xcd/0xf0
Nov 24 09:57:12 ec1-tools-gitlab kernel: [79506.528980]  ? __pfx_kthread+0x10/0x10
Nov 24 09:57:12 ec1-tools-gitlab kernel: [79506.528983]  ret_from_fork+0x2c/0x50
Nov 24 09:57:12 ec1-tools-gitlab kernel: [79506.528990]  </TASK>
Nov 24 09:57:12 ec1-tools-gitlab kernel: [79506.528993] INFO: task vdev_autotrim:450 blocked for more than 483 seconds.
Nov 24 09:57:12 ec1-tools-gitlab kernel: [79506.533139]       Tainted: P           OE      6.2.0-1016-aws #16~22.04.1-Ubuntu
Nov 24 09:57:12 ec1-tools-gitlab kernel: [79506.539000] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 24 09:57:12 ec1-tools-gitlab kernel: [79506.544990] task:vdev_autotrim   state:D stack:0     pid:450   ppid:2      flags:0x00004000
Nov 24 09:57:12 ec1-tools-gitlab kernel: [79506.544996] Call Trace:
Nov 24 09:57:12 ec1-tools-gitlab kernel: [79506.544998]  <TASK>
Nov 24 09:57:12 ec1-tools-gitlab kernel: [79506.545002]  __schedule+0x2b4/0x5e0
Nov 24 09:57:12 ec1-tools-gitlab kernel: [79506.545009]  schedule+0x5d/0x100
Nov 24 09:57:12 ec1-tools-gitlab kernel: [79506.545013]  cv_wait_common+0x107/0x140 [spl]
Nov 24 09:57:12 ec1-tools-gitlab kernel: [79506.545025]  ? __pfx_autoremove_wake_function+0x10/0x10
Nov 24 09:57:12 ec1-tools-gitlab kernel: [79506.545031]  __cv_wait+0x15/0x30 [spl]
Nov 24 09:57:12 ec1-tools-gitlab kernel: [79506.545042]  vdev_autotrim_thread+0x640/0x920 [zfs]
Nov 24 09:57:12 ec1-tools-gitlab kernel: [79506.545206]  ? __pfx_vdev_autotrim_thread+0x10/0x10 [zfs]
Nov 24 09:57:12 ec1-tools-gitlab kernel: [79506.545361]  ? __pfx_thread_generic_wrapper+0x10/0x10 [spl]
Nov 24 09:57:12 ec1-tools-gitlab kernel: [79506.545374]  thread_generic_wrapper+0x64/0x80 [spl]
Nov 24 09:57:12 ec1-tools-gitlab kernel: [79506.545385]  kthread+0xcd/0xf0
Nov 24 09:57:12 ec1-tools-gitlab kernel: [79506.545389]  ? __pfx_kthread+0x10/0x10
Nov 24 09:57:12 ec1-tools-gitlab kernel: [79506.545392]  ret_from_fork+0x2c/0x50
Nov 24 09:57:12 ec1-tools-gitlab kernel: [79506.545398]  </TASK>
Nov 24 09:59:12 ec1-tools-gitlab kernel: [79627.333876] INFO: task txg_sync:434 blocked for more than 604 seconds.
Nov 24 09:59:12 ec1-tools-gitlab kernel: [79627.337881]       Tainted: P           OE      6.2.0-1016-aws #16~22.04.1-Ubuntu
Nov 24 09:59:12 ec1-tools-gitlab kernel: [79627.343751] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 24 09:59:12 ec1-tools-gitlab kernel: [79627.349769] task:txg_sync        state:D stack:0     pid:434   ppid:2      flags:0x00004000
Nov 24 09:59:12 ec1-tools-gitlab kernel: [79627.349776] Call Trace:
Nov 24 09:59:12 ec1-tools-gitlab kernel: [79627.349779]  <TASK>
Nov 24 09:59:12 ec1-tools-gitlab kernel: [79627.349784]  __schedule+0x2b4/0x5e0
Nov 24 09:59:12 ec1-tools-gitlab kernel: [79627.349794]  schedule+0x5d/0x100
Nov 24 09:59:12 ec1-tools-gitlab kernel: [79627.349800]  vcmn_err+0xdd/0x130 [spl]
Nov 24 09:59:12 ec1-tools-gitlab kernel: [79627.349815]  ? srso_return_thunk+0x5/0x10
Nov 24 09:59:12 ec1-tools-gitlab kernel: [79627.349819]  ? bmov+0x17/0x30 [zfs]
Nov 24 09:59:12 ec1-tools-gitlab kernel: [79627.350019]  ? srso_return_thunk+0x5/0x10
Nov 24 09:59:12 ec1-tools-gitlab kernel: [79627.350022]  ? bt_grow_leaf+0x17a/0x190 [zfs]
Nov 24 09:59:12 ec1-tools-gitlab kernel: [79627.350173]  ? bt_grow_leaf+0x17a/0x190 [zfs]
Nov 24 09:59:12 ec1-tools-gitlab kernel: [79627.350322]  ? srso_return_thunk+0x5/0x10
Nov 24 09:59:12 ec1-tools-gitlab kernel: [79627.350326]  ? bcpy+0x17/0x30 [zfs]
Nov 24 09:59:12 ec1-tools-gitlab kernel: [79627.350474]  ? srso_return_thunk+0x5/0x10
Nov 24 09:59:12 ec1-tools-gitlab kernel: [79627.350477]  ? zfs_btree_insert_into_leaf+0x26a/0x360 [zfs]
Nov 24 09:59:12 ec1-tools-gitlab kernel: [79627.350632]  ? srso_return_thunk+0x5/0x10
Nov 24 09:59:12 ec1-tools-gitlab kernel: [79627.350642]  ? __slab_free+0xbc/0x340
Nov 24 09:59:12 ec1-tools-gitlab kernel: [79627.350648]  zfs_panic_recover+0x6d/0xa0 [zfs]
Nov 24 09:59:12 ec1-tools-gitlab kernel: [79627.350840]  range_tree_add_impl+0x261/0x1090 [zfs]
Nov 24 09:59:12 ec1-tools-gitlab kernel: [79627.351020]  ? spl_kvmalloc+0x9e/0xd0 [spl]
Nov 24 09:59:12 ec1-tools-gitlab kernel: [79627.351032]  ? srso_return_thunk+0x5/0x10
Nov 24 09:59:12 ec1-tools-gitlab kernel: [79627.351035]  ? __kmalloc_node+0x54/0x140
Nov 24 09:59:12 ec1-tools-gitlab kernel: [79627.351041]  ? srso_return_thunk+0x5/0x10
Nov 24 09:59:12 ec1-tools-gitlab kernel: [79627.351044]  ? spl_kvmalloc+0x9e/0xd0 [spl]
Nov 24 09:59:12 ec1-tools-gitlab kernel: [79627.351055]  ? srso_return_thunk+0x5/0x10
Nov 24 09:59:12 ec1-tools-gitlab kernel: [79627.351059]  ? __pfx_range_tree_add+0x10/0x10 [zfs]
Nov 24 09:59:12 ec1-tools-gitlab kernel: [79627.351222]  range_tree_add+0x11/0x20 [zfs]
Nov 24 09:59:12 ec1-tools-gitlab kernel: [79627.351382]  range_tree_vacate+0x114/0x2b0 [zfs]
Nov 24 09:59:12 ec1-tools-gitlab kernel: [79627.351556]  metaslab_sync_done+0x491/0x550 [zfs]
Nov 24 09:59:12 ec1-tools-gitlab kernel: [79627.351720]  vdev_sync_done+0x3b/0xa0 [zfs]
Nov 24 09:59:12 ec1-tools-gitlab kernel: [79627.351900]  spa_sync+0x864/0x1070 [zfs]
Nov 24 09:59:12 ec1-tools-gitlab kernel: [79627.352068]  txg_sync_thread+0x219/0x3a0 [zfs]
Nov 24 09:59:12 ec1-tools-gitlab kernel: [79627.352232]  ? __pfx_txg_sync_thread+0x10/0x10 [zfs]
Nov 24 09:59:12 ec1-tools-gitlab kernel: [79627.352389]  ? __pfx_thread_generic_wrapper+0x10/0x10 [spl]
Nov 24 09:59:12 ec1-tools-gitlab kernel: [79627.352402]  thread_generic_wrapper+0x64/0x80 [spl]
Nov 24 09:59:12 ec1-tools-gitlab kernel: [79627.352414]  kthread+0xcd/0xf0
Nov 24 09:59:12 ec1-tools-gitlab kernel: [79627.352418]  ? __pfx_kthread+0x10/0x10
Nov 24 09:59:12 ec1-tools-gitlab kernel: [79627.352422]  ret_from_fork+0x2c/0x50
Nov 24 09:59:12 ec1-tools-gitlab kernel: [79627.352429]  </TASK>
Nov 24 09:59:12 ec1-tools-gitlab kernel: [79627.352432] INFO: task vdev_autotrim:450 blocked for more than 604 seconds.
Nov 24 09:59:12 ec1-tools-gitlab kernel: [79627.356540]       Tainted: P           OE      6.2.0-1016-aws #16~22.04.1-Ubuntu
Nov 24 09:59:12 ec1-tools-gitlab kernel: [79627.362440] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 24 09:59:12 ec1-tools-gitlab kernel: [79627.368426] task:vdev_autotrim   state:D stack:0     pid:450   ppid:2      flags:0x00004000
Nov 24 09:59:12 ec1-tools-gitlab kernel: [79627.368432] Call Trace:
Nov 24 09:59:12 ec1-tools-gitlab kernel: [79627.368434]  <TASK>
Nov 24 09:59:12 ec1-tools-gitlab kernel: [79627.368437]  __schedule+0x2b4/0x5e0
Nov 24 09:59:12 ec1-tools-gitlab kernel: [79627.368444]  schedule+0x5d/0x100
Nov 24 09:59:12 ec1-tools-gitlab kernel: [79627.368449]  cv_wait_common+0x107/0x140 [spl]
Nov 24 09:59:12 ec1-tools-gitlab kernel: [79627.368460]  ? __pfx_autoremove_wake_function+0x10/0x10
Nov 24 09:59:12 ec1-tools-gitlab kernel: [79627.368467]  __cv_wait+0x15/0x30 [spl]
Nov 24 09:59:12 ec1-tools-gitlab kernel: [79627.368478]  vdev_autotrim_thread+0x640/0x920 [zfs]
Nov 24 09:59:12 ec1-tools-gitlab kernel: [79627.368641]  ? __pfx_vdev_autotrim_thread+0x10/0x10 [zfs]
Nov 24 09:59:12 ec1-tools-gitlab kernel: [79627.368806]  ? __pfx_thread_generic_wrapper+0x10/0x10 [spl]
Nov 24 09:59:12 ec1-tools-gitlab kernel: [79627.368821]  thread_generic_wrapper+0x64/0x80 [spl]
Nov 24 09:59:12 ec1-tools-gitlab kernel: [79627.368837]  kthread+0xcd/0xf0
Nov 24 09:59:12 ec1-tools-gitlab kernel: [79627.368841]  ? __pfx_kthread+0x10/0x10
Nov 24 09:59:12 ec1-tools-gitlab kernel: [79627.368845]  ret_from_fork+0x2c/0x50
Nov 24 09:59:12 ec1-tools-gitlab kernel: [79627.368851]  </TASK>

Ubuntu 22.04 on aws ec2 c5.2xlarge kernel: Linux ec1-tools-gitlab 6.2.0-1016-aws #16~22.04.1-Ubuntu SMP Sun Nov 5 20:08:16 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux openzfs: 2.2.1 compiled from source:

./configure --bindir=/usr/bin --sbindir=/sbin --libdir=/lib/x86_64-linux-gnu --with-udevdir=/lib/udev --with-zfsexecdir=/usr/lib/zfs-linux --enable-systemd -enable-pyzfs --with-python=python3 --with-pammoduledir=/lib/x86_64-linux-gnu/security --with-pkgconfigdir=/usr/lib/x86_64-linux-gnu/pkgconfig --with-systemdunitdir=/lib/systemd/system --with-systemdpresetdir=/lib/systemd/system-preset --with-systemdgeneratordir=/lib/systemd/system-generators
make -j1 deb-utils deb-kmod
apt-get install --fix-missing ./*.deb

zpool and dataset info:

     zpool_list:
        - name: data1
          disk: [nvme1n1]
          mode:
          zpoptions: ["ashift=12", "autotrim=on"]
          zpfsoptions: ["acltype=posixacl", "xattr=sa"]
      zfsd_list:
        - name: gitlab
          pool: data1
          zdoptions: ["recordsize=32k", "compress=lz4", "atime=off", "mountpoint=/opt/gitlab"]
amotin commented 9 months ago

@Anticept With your system having non-ECC RAM, check it is not a memory corruptions, run good long memory test before one scrambled your pool beyond recover. It already looks like you have space map corruption, and your only hope is likely to import the pool read-only and evacuate the data while you can.

Anticept commented 9 months ago

@amotin It's not a memory issue either. Already been down that road.

PS: I also use an optane 32gb nvme as a SLOG device in case that's relevant.

Had no issues for over a year, then as soon as I grabbed truenas cobia, everything went haywire. Lost a VM image (inconsequential, it just had a jump server and that's it), got stuck in a boot loop, etc. Ran tests on the hardware, grabbed manufacturer tools to examine the disks, etc. I mounted it read only and exported the data.

It was really strange because after upgrading to cobia, I ran tests and it all seemed stable at first.

Before moving back to bluefin, I created a fresh cobia install, wiped the array and made it fresh, and copied everything back to it, and it continued to remain stable even though i was pounding it with the file copy operations to fill the disks back up as fast as it would all go. But once the next day came around and we started having 10+ people hitting it through the SMB service, that's when it got real unstable, crashing as often as every half hour.

I said I'm done with this, went back to bluefin 2.4.2. It's been 100% issue free.

Got another system with ECC ram coming, we're going to go to two fileservers, one as a standby, as we can't afford to have it going offline like that.

oukb commented 9 months ago

After update to 2.2.1 I tried with zfs_dmu_offset_next_sync=0 But import with error

WARNING: zfs: adding existent segment to range tree (offset=a18422c6000 size=4a000)
ZFS: Unloaded module v2.2.0-1_g459c99ff2
ZFS: Loaded module v2.2.1-1, ZFS pool version 5000, ZFS filesystem version 5
PANIC: zfs: adding existent segment to range tree (offset=a18422c6000 size=4a000)
Showing stack for process 4052008
CPU: 27 PID: 4052008 Comm: z_wr_iss Tainted: P           OE      6.5.11-300.fc39.x86_64 #1
Hardware name: Supermicro Super Server/H12SSL-i, BIOS 2.6a 09/27/2023
Call Trace:
<TASK>
dump_stack_lvl+0x47/0x60
vcmn_err+0xdf/0x120 [spl]
zfs_panic_recover+0x79/0xa0 [zfs]
range_tree_add_impl+0x28f/0xea0 [zfs]
? range_tree_remove_impl+0x55d/0xf50 [zfs]
space_map_load_callback+0x59/0x90 [zfs]
space_map_iterate+0x195/0x410 [zfs]
? __pfx_space_map_load_callback+0x10/0x10 [zfs]
space_map_load_length+0x7f/0x100 [zfs]
metaslab_load+0x184/0x9d0 [zfs]
? __kmem_cache_alloc_node+0x19d/0x340
? spl_kmem_alloc+0x116/0x130 [spl]
metaslab_activate+0x3b/0x280 [zfs]
? metaslab_set_selected_txg+0x94/0xd0 [zfs]
metaslab_alloc_dva+0x64f/0x12d0 [zfs]
metaslab_alloc+0xe2/0x290 [zfs]
zio_dva_allocate+0xc4/0x8d0 [zfs]
? kmem_cache_free+0x22/0x3a0
? zio_io_to_allocate+0x63/0x90 [zfs]
zio_execute+0x84/0x120 [zfs]
taskq_thread+0x2c0/0x4e0 [spl]
? __pfx_default_wake_function+0x10/0x10
? __pfx_zio_execute+0x10/0x10 [zfs]
? __pfx_taskq_thread+0x10/0x10 [spl]
kthread+0xe5/0x120
? __pfx_kthread+0x10/0x10
ret_from_fork+0x31/0x50
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1b/0x30
</TASK>

https://dpaste.com/CYKF9JA8A

kernel: PANIC: zfs: adding existent segment to range tree (offset=a18422c6000 size=4a000)
kernel: Showing stack for process 8806
kernel: CPU: 17 PID: 8806 Comm: z_metaslab Tainted: P           OE      6.6.2-201.fc39.x86_64 #1
kernel: Hardware name: Supermicro Super Server/H12SSL-i, BIOS 2.6a 09/27/2023
kernel: Call Trace:
kernel: <TASK>
kernel: dump_stack_lvl+0x47/0x60
kernel: vcmn_err+0xdf/0x120 [spl]
kernel: zfs_panic_recover+0x79/0xa0 [zfs]
kernel: range_tree_add_impl+0x28f/0xea0 [zfs]
kernel: ? range_tree_remove_impl+0x55d/0xf50 [zfs]
kernel: space_map_load_callback+0x59/0x90 [zfs]
kernel: space_map_iterate+0x195/0x410 [zfs]
kernel: ? __pfx_space_map_load_callback+0x10/0x10 [zfs]
kernel: space_map_load_length+0x7f/0x100 [zfs]
kernel: metaslab_load+0x184/0x9d0 [zfs]
kernel: ? spl_kmem_alloc+0x116/0x130 [spl]
kernel: metaslab_preload+0x50/0xa0 [zfs]
kernel: taskq_thread+0x2c0/0x4e0 [spl]
kernel: ? __pfx_default_wake_function+0x10/0x10
kernel: ? __pfx_taskq_thread+0x10/0x10 [spl]
kernel: kthread+0xe5/0x120
kernel: ? __pfx_kthread+0x10/0x10
kernel: ret_from_fork+0x31/0x50
kernel: ? __pfx_kthread+0x10/0x10
kernel: ret_from_fork_asm+0x1b/0x30
kernel: </TASK>

kernel:PANIC: zfs: adding existent segment to range tree (offset=a18422c6000 size=4a000)
kernel: PANIC: zfs: adding existent segment to range tree (offset=a18422c6000 size=4a000)
kernel: Showing stack for process 105541
kernel: CPU: 14 PID: 105541 Comm: z_wr_iss Tainted: P           OE      6.5.12-300.fc39.x86_64 #1
kernel: Hardware name: Supermicro Super Server/H12SSL-i, BIOS 2.6a 09/27/2023
kernel: Call Trace:
kernel: <TASK>
kernel: dump_stack_lvl+0x47/0x60
kernel: vcmn_err+0xdf/0x120 [spl]
kernel: zfs_panic_recover+0x79/0xa0 [zfs]
kernel: range_tree_add_impl+0x28f/0xea0 [zfs]
kernel: ? range_tree_remove_impl+0x55d/0xf50 [zfs]
kernel: space_map_load_callback+0x59/0x90 [zfs]
kernel: space_map_iterate+0x195/0x410 [zfs]
kernel: ? __pfx_space_map_load_callback+0x10/0x10 [zfs]
kernel: space_map_load_length+0x7f/0x100 [zfs]
kernel: metaslab_load+0x184/0x9d0 [zfs]
kernel: ? __kmem_cache_alloc_node+0x19d/0x340
kernel: ? spl_kmem_alloc+0x116/0x130 [spl]
kernel: metaslab_activate+0x3b/0x280 [zfs]
kernel: ? metaslab_set_selected_txg+0x94/0xd0 [zfs]
kernel: metaslab_alloc_dva+0x64f/0x12d0 [zfs]
kernel: metaslab_alloc+0xe2/0x290 [zfs]
kernel: zio_dva_allocate+0xc4/0x8d0 [zfs]
kernel: ? spl_kmem_alloc+0x116/0x130 [spl]
kernel: ? __kmalloc_node+0x50/0x150
kernel: ? zio_io_to_allocate+0x63/0x90 [zfs]
kernel: zio_execute+0x84/0x120 [zfs]
kernel: taskq_thread+0x2c0/0x4e0 [spl]
kernel: ? __pfx_default_wake_function+0x10/0x10
kernel: ? __pfx_zio_execute+0x10/0x10 [zfs]
kernel: ? __pfx_taskq_thread+0x10/0x10 [spl]
kernel: kthread+0xe5/0x120
kernel: ? __pfx_kthread+0x10/0x10
kernel: ret_from_fork+0x31/0x50
kernel: ? __pfx_kthread+0x10/0x10
kernel: ret_from_fork_asm+0x1b/0x30
kernel: </TASK>

scrub doest't help

ronisbr commented 8 months ago

Hi!

Just for the record.

Yesterday I had the same problem in FreeBSD. A pool import causes Kernel Panic with the message:

panic: Solaris(panic) zfs: adding existent segment to range tree (offset=19c06284000 size=1e000)

It started after a power outage. Here is the screenshot I took from the backtrace:

Captura de Tela 2023-12-23 às 11 13 52

I could log into the system by using the single user mode (which seems to not import any pool besides zroot). In this case, following the suggestions here, I managed to mount the pool RW and executed a scrub. It finishes without errors but the problem was not solved.

Since that was a mission critical machine, I just delete the pool, create another one, and restore the backups. This pool only contains jails and bhyve virtual machines.

Anyone has any idea of what is happening? There are some reports in the internet about a similar problem.