openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.43k stars 1.73k forks source link

Unable to import pool: PANIC at dsl_deadlist.c:308:dsl_deadlist_open() #16352

Open doug-last opened 1 month ago

doug-last commented 1 month ago

System information

running system: promox Virtual Environment 8.2.4 pool created on: truenas

uname -a: Linux proxmox 6.8.8-2-pve #1 SMP PREEMPT_DYNAMIC PMX 6.8.8-2 (2024-06-24T09:00Z) x86_64 GNU/Linux OpenZFS Version | zfs-2.2.4-pve1 zfs-kmod-2.2.4-pve1

Describe the problem you're observing

i'm unable to import a pool with zpool import on truenas results in reboot on proxmox it panics and freeze zfs tools

Describe how to reproduce the problem

pool:[

root@proxmox:~# zpool import 
   pool: pool 1
     id: 13738970627573752325
  state: ONLINE
status: The pool was last accessed by another system.
 action: The pool can be imported using its name or numeric identifier and
    the '-f' flag.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-EY
 config:

    pool 1                                    ONLINE
      raidz1-0                                ONLINE
        1c3ea2d4-0ce6-421c-a5a3-b7b913129025  ONLINE
        2cb01bf1-8dd1-4239-99cb-7d9cebbe3194  ONLINE
        f14cdf51-ef82-44ed-9c84-fb852ff4a9dc  ONLINE
root@proxmox:~# zpool import "pool 1"
cannot import 'pool 1': pool was previously in use from another system.
Last accessed by truenas (hostid=29dcd2cc) at Thu Jul  4 12:05:51 2024
The pool can be imported, use 'zpool import -f' to import the pool.

then using 'zpool import -f "pool 1" ' as instructed above: warning on bare metal monitor (not frozen ssh console):

VERIFY0(dmu_bonus_hold(os, object, dl, &dl->dl_dbuf)) failed (0 == 52)
PANIC at dsl_deadlist.c:308:dsl_deadlist_open()

on dmesg:

[   99.968789] VERIFY0(dmu_bonus_hold(os, object, dl, &dl->dl_dbuf)) failed (0 == 52)
[   99.968827] PANIC at dsl_deadlist.c:308:dsl_deadlist_open()
[   99.968843] Showing stack for process 1769
[   99.968846] CPU: 0 PID: 1769 Comm: dmu_objset_find Tainted: P           O       6.8.8-2-pve #1
[   99.968849] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./A55M-HVS, BIOS P1.20 11/02/2011
[   99.968852] Call Trace:
[   99.968855]  <TASK>
[   99.968859]  dump_stack_lvl+0x76/0xa0
[   99.968868]  dump_stack+0x10/0x20
[   99.968871]  spl_dumpstack+0x29/0x40 [spl]
[   99.968895]  spl_panic+0xfc/0x120 [spl]
[   99.968916]  ? dnode_hold+0x1b/0x30 [zfs]
[   99.969307]  dsl_deadlist_open+0x168/0x180 [zfs]
[   99.969632]  dsl_dataset_hold_obj+0x686/0xb50 [zfs]
[   99.969959]  dmu_objset_find_dp_impl+0x145/0x400 [zfs]
[   99.970280]  dmu_objset_find_dp_cb+0x2a/0x50 [zfs]
[   99.970600]  taskq_thread+0x282/0x4c0 [spl]
[   99.970623]  ? finish_task_switch.isra.0+0x8c/0x310
[   99.970629]  ? __pfx_default_wake_function+0x10/0x10
[   99.970634]  ? __pfx_taskq_thread+0x10/0x10 [spl]
[   99.970653]  kthread+0xf2/0x120
[   99.970656]  ? __pfx_kthread+0x10/0x10
[   99.970659]  ret_from_fork+0x47/0x70
[   99.970663]  ? __pfx_kthread+0x10/0x10
[   99.970665]  ret_from_fork_asm+0x1b/0x30
[   99.970669]  </TASK>
[  100.040282] WARNING: can't open objset 2011, error 5
[  100.040503] WARNING: can't open objset 2199, error 5
[  247.082554] INFO: task zpool:1641 blocked for more than 122 seconds.
[  247.082567]       Tainted: P           O       6.8.8-2-pve #1
[  247.082570] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  247.082571] task:zpool           state:D stack:0     pid:1641  tgid:1641  ppid:1420   flags:0x00004002
[  247.082578] Call Trace:
[  247.082581]  <TASK>
[  247.082586]  __schedule+0x401/0x15e0
[  247.082595]  schedule+0x33/0x110
[  247.082600]  taskq_wait+0xb8/0x100 [spl]
[  247.082625]  ? __pfx_autoremove_wake_function+0x10/0x10
[  247.082631]  dmu_objset_find_dp+0x17a/0x250 [zfs]
[  247.083020]  ? __pfx_zil_check_log_chain+0x10/0x10 [zfs]
[  247.083334]  spa_load+0x161d/0x1a30 [zfs]
[  247.083684]  ? __dprintf+0x12e/0x1d0 [zfs]
[  247.084041]  spa_load_best+0x57/0x2c0 [zfs]
[  247.084372]  spa_import+0x234/0x6d0 [zfs]
[  247.084712]  zfs_ioc_pool_import+0x163/0x180 [zfs]
[  247.085062]  zfsdev_ioctl_common+0x8a1/0x9f0 [zfs]
[  247.085379]  ? __check_object_size+0x9d/0x300
[  247.085384]  zfsdev_ioctl+0x57/0xf0 [zfs]
[  247.085712]  __x64_sys_ioctl+0xa3/0xf0
[  247.085718]  x64_sys_call+0xa68/0x24b0
[  247.085722]  do_syscall_64+0x81/0x170
[  247.085726]  ? task_work_add+0x8b/0xc0
[  247.085730]  ? filp_flush+0x57/0x90
[  247.085733]  ? fput+0x4f/0x130
[  247.085740]  ? vsnprintf+0x2c1/0x530
[  247.085743]  ? __mod_memcg_state+0x71/0x130
[  247.085748]  ? refill_stock+0x2a/0x50
[  247.085751]  ? obj_cgroup_uncharge_pages+0x71/0xf0
[  247.085756]  ? __memcg_slab_free_hook+0x115/0x180
[  247.085759]  ? __fput+0x15e/0x2e0
[  247.085762]  ? kmem_cache_free+0x36c/0x3f0
[  247.085766]  ? __fput+0x15e/0x2e0
[  247.085770]  ? syscall_exit_to_user_mode+0x89/0x260
[  247.085774]  ? do_syscall_64+0x8d/0x170
[  247.085777]  ? __rseq_handle_notify_resume+0xa5/0x4d0
[  247.085781]  ? syscall_exit_to_user_mode+0x89/0x260
[  247.085784]  ? do_syscall_64+0x8d/0x170
[  247.085787]  ? irqentry_exit+0x43/0x50
[  247.085790]  ? exc_page_fault+0x94/0x1b0
[  247.085793]  entry_SYSCALL_64_after_hwframe+0x78/0x80
[  247.085797] RIP: 0033:0x7dc26c037c5b
[  247.085816] RSP: 002b:00007ffd73069d60 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[  247.085819] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007dc26c037c5b
[  247.085821] RDX: 00007ffd73069e20 RSI: 0000000000005a02 RDI: 0000000000000003
[  247.085823] RBP: 00007ffd7306dd10 R08: 00007dc26c10d460 R09: 00007dc26c10d460
[  247.085825] R10: 0000000000000000 R11: 0000000000000246 R12: 00006056e643f950
[  247.085826] R13: 00007ffd73069e20 R14: 00006056e6449660 R15: 00006056e6529bf0
[  247.085829]  </TASK>
[  247.085844] INFO: task dmu_objset_find:1769 blocked for more than 122 seconds.
[  247.085848]       Tainted: P           O       6.8.8-2-pve #1
[  247.085849] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  247.085851] task:dmu_objset_find state:D stack:0     pid:1769  tgid:1769  ppid:2      flags:0x00004000
[  247.085855] Call Trace:
[  247.085857]  <TASK>
[  247.085860]  __schedule+0x401/0x15e0
[  247.085864]  schedule+0x33/0x110
[  247.085867]  spl_panic+0x112/0x120 [spl]
[  247.085892]  ? dnode_hold+0x1b/0x30 [zfs]
[  247.086273]  dsl_deadlist_open+0x168/0x180 [zfs]
[  247.086614]  dsl_dataset_hold_obj+0x686/0xb50 [zfs]
[  247.086965]  dmu_objset_find_dp_impl+0x145/0x400 [zfs]
[  247.087299]  dmu_objset_find_dp_cb+0x2a/0x50 [zfs]
[  247.087636]  taskq_thread+0x282/0x4c0 [spl]
[  247.087658]  ? finish_task_switch.isra.0+0x8c/0x310
[  247.087665]  ? __pfx_default_wake_function+0x10/0x10
[  247.087671]  ? __pfx_taskq_thread+0x10/0x10 [spl]
[  247.087689]  kthread+0xf2/0x120
[  247.087693]  ? __pfx_kthread+0x10/0x10
[  247.087696]  ret_from_fork+0x47/0x70
[  247.087700]  ? __pfx_kthread+0x10/0x10
[  247.087703]  ret_from_fork_asm+0x1b/0x30
[  247.087707]  </TASK>
giucam commented 1 month ago

I'm also encountering this problem after a power failure. @doug-last, did you have any luck importing your pool?

doug-last commented 1 month ago

I'm also encountering this problem after a power failure. @doug-last, did you have any luck importing your pool?

no luck, also tested with a few random live-boot images, also no luck, i'm waiting and wishing for it to be fixed on an update since it was my backup and storage system, and i have no backup of backup (no where else to backup), i expected to be able to at least read the contents in case of failure or whatever..

giucam commented 1 month ago

@doug-last fyi, even though i cannot import my pool i'm having luck recovering the data with this script: https://gist.github.com/giucam/d2a43b2dd6ca918fcfcaf07a45485df7

recover.sh dir/in/pool recovered/

doug-last commented 3 weeks ago

@doug-last fyi, even though i cannot import my pool i'm having luck recovering the data with this script: https://gist.github.com/giucam/d2a43b2dd6ca918fcfcaf07a45485df7

recover.sh dir/in/pool recovered/

Didn't work for me, even on zdb i'm getting the same error:

root@proxmox:/# zdb -e -O 'pool 1' '/'
dmu_bonus_hold(os, object, dl, &dl->dl_dbuf) == 0 (0x34 == 0)
ASSERT at module/zfs/dsl_deadlist.c:308:dsl_deadlist_open()Aborted
giucam commented 2 weeks ago

@doug-last Right, i had also made some changes to the zfs code to avoid the crashes: https://gist.github.com/giucam/d2a43b2dd6ca918fcfcaf07a45485df7#file-0001-make-it-not-crash-patch You can apply that patch on top of the zfs 2.2.4 tag. You don't need to replace the kernel module, just the userspace binaries.

doug-last commented 1 day ago

@doug-last Right, i had also made some changes to the zfs code to avoid the crashes: https://gist.github.com/giucam/d2a43b2dd6ca918fcfcaf07a45485df7#file-0001-make-it-not-crash-patch You can apply that patch on top of the zfs 2.2.4 tag. You don't need to replace the kernel module, just the userspace binaries.

Applied the path and compiled the whole thing: zdb:

sudo zdb -e -O 'pool 1' '/' -v
failed to lookup dataset=pool 1 path=/: No such file or directory

(also tries a few folders that i could remember, same result) zpool import:

sudo zpool import -f 'pool 1'
cannot import 'pool 1': I/O error
        Destroy and re-create the pool from
        a backup source.

on dmesg:

[set11 15:03] WARNING: can't open objset 2011, error 5
[  +0,000103] WARNING: can't open objset 2199, error 5 
[  +0,441889] WARNING: can't open objset 94, error 5 
...repeats...