Kernel panic at zpool import time

techexo commented 2 years ago

System information

Type	Version/Name
Distribution Name	Proxmox
Distribution Version	7
Kernel Version	5.3.19-3-pve
Architecture	amd64
OpenZFS Version	2.1.2-pve1

Describe the problem you're observing

Hello all, I am running a proxmox server on a HP Proliant Microserver Gen8 with the latest kernel 5.13.19-3-pve. Since two days, I am unable to import my discworld pool without crashing the server! It is a simple two-disks mirrored configuration.

I was able to import the pool readonly (zfs import -o readonly=on -d /dev/sdxx) and make a copy of all the data, I don't have however snapshots available. When trying to import the pool with write access, it hangs all zfs-related commands (zpool status, for example). I also tried zpool import -F -X -a -d /dev/disk/by-id, without much success either. The zpool import command is irresponsive and doesn't terminate to a SIGTERM or a SIGKILL. Looking at the disk activity with iostat doesn't show tremendous activity on the pool disks (but they are not dead as I have full read-only access).

I'll join below the trace seen with dmesg. Any help understanding why I cannot import this pool will be much appreciated. Don't hesitate to ask me for more detailed information as I don't know what command outputs to send and help.

Thanks in advance

Describe how to reproduce the problem

Not sure it is reproducible on another environment than mine. On this server, trying to import the pool with write access is enough to reproduce the issue.

Include any warning/errors/backtraces from the system logs

Jan 24 15:00:55 atuin kernel: [ 3505.259654] INFO: task zfs:33284 blocked for more than 120 seconds.
Jan 24 15:00:55 atuin kernel: [ 3505.259711]       Tainted: P          IO      5.13.19-3-pve #1
Jan 24 15:00:55 atuin kernel: [ 3505.259745] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 24 15:00:55 atuin kernel: [ 3505.259788] task:zfs             state:D stack:    0 pid:33284 ppid:     1 flags:0x00000004
Jan 24 15:00:55 atuin kernel: [ 3505.259838] Call Trace:
Jan 24 15:00:55 atuin kernel: [ 3505.259855]  <TASK>
Jan 24 15:00:55 atuin kernel: [ 3505.259871]  __schedule+0x2fa/0x910
Jan 24 15:00:55 atuin kernel: [ 3505.259904]  schedule+0x4f/0xc0
Jan 24 15:00:55 atuin kernel: [ 3505.259927]  schedule_preempt_disabled+0xe/0x10
Jan 24 15:00:55 atuin kernel: [ 3505.259957]  __mutex_lock.constprop.0+0x305/0x4d0
Jan 24 15:00:55 atuin kernel: [ 3505.259989]  __mutex_lock_slowpath+0x13/0x20
Jan 24 15:00:55 atuin kernel: [ 3505.260018]  mutex_lock+0x34/0x40
Jan 24 15:00:55 atuin kernel: [ 3505.260042]  spa_all_configs+0x4a/0x120 [zfs]
Jan 24 15:00:55 atuin kernel: [ 3505.260243]  zfs_ioc_pool_configs+0x1c/0x70 [zfs]
Jan 24 15:00:55 atuin kernel: [ 3505.260436]  zfsdev_ioctl_common+0x752/0x9b0 [zfs]
Jan 24 15:00:55 atuin kernel: [ 3505.260629]  ? __kmalloc_node+0x276/0x300
Jan 24 15:00:55 atuin kernel: [ 3505.260658]  ? _copy_from_user+0x2e/0x60
Jan 24 15:00:55 atuin kernel: [ 3505.260685]  zfsdev_ioctl+0x57/0xe0 [zfs]
Jan 24 15:00:55 atuin kernel: [ 3505.260868]  __x64_sys_ioctl+0x91/0xc0
Jan 24 15:00:55 atuin kernel: [ 3505.260895]  do_syscall_64+0x61/0xb0
Jan 24 15:00:55 atuin kernel: [ 3505.260921]  ? irqentry_exit_to_user_mode+0x9/0x20
Jan 24 15:00:55 atuin kernel: [ 3505.260951]  ? irqentry_exit+0x19/0x30
Jan 24 15:00:55 atuin kernel: [ 3505.260974]  ? exc_page_fault+0x8f/0x170
Jan 24 15:00:55 atuin kernel: [ 3505.260999]  ? asm_exc_page_fault+0x8/0x30
Jan 24 15:00:55 atuin kernel: [ 3505.261026]  entry_SYSCALL_64_after_hwframe+0x44/0xae
Jan 24 15:00:55 atuin kernel: [ 3505.261057] RIP: 0033:0x7fc6fbe9dcc7
Jan 24 15:00:55 atuin kernel: [ 3505.261080] RSP: 002b:00007ffd9c1835e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Jan 24 15:00:55 atuin kernel: [ 3505.261124] RAX: ffffffffffffffda RBX: 000055d292dae570 RCX: 00007fc6fbe9dcc7
Jan 24 15:00:55 atuin kernel: [ 3505.261165] RDX: 00007ffd9c183610 RSI: 0000000000005a04 RDI: 0000000000000003
Jan 24 15:00:55 atuin kernel: [ 3505.262491] RBP: 00007ffd9c186c00 R08: 00007fc6fb58e010 R09: 0000000000000000
Jan 24 15:00:55 atuin kernel: [ 3505.263613] R10: 0000000000000022 R11: 0000000000000246 R12: 000055d292dae570
Jan 24 15:00:55 atuin kernel: [ 3505.264700] R13: 0000000000000000 R14: 00007ffd9c183610 R15: 0000000000000006
Jan 24 15:00:55 atuin kernel: [ 3505.265769]  </TASK>
Jan 24 15:02:55 atuin kernel: [ 3626.091977] INFO: task zfs:33284 blocked for more than 241 seconds.
Jan 24 15:02:55 atuin kernel: [ 3626.093064]       Tainted: P          IO      5.13.19-3-pve #1
Jan 24 15:02:55 atuin kernel: [ 3626.094134] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 24 15:02:55 atuin kernel: [ 3626.095231] task:zfs             state:D stack:    0 pid:33284 ppid:     1 flags:0x00000004
Jan 24 15:02:55 atuin kernel: [ 3626.096391] Call Trace:
Jan 24 15:02:56 atuin kernel: [ 3626.097483]  <TASK>
Jan 24 15:02:56 atuin kernel: [ 3626.098547]  __schedule+0x2fa/0x910
Jan 24 15:02:56 atuin kernel: [ 3626.099587]  schedule+0x4f/0xc0
Jan 24 15:02:56 atuin kernel: [ 3626.100591]  schedule_preempt_disabled+0xe/0x10
Jan 24 15:02:56 atuin kernel: [ 3626.101580]  __mutex_lock.constprop.0+0x305/0x4d0
Jan 24 15:02:56 atuin kernel: [ 3626.102647]  __mutex_lock_slowpath+0x13/0x20
Jan 24 15:02:56 atuin kernel: [ 3626.103765]  mutex_lock+0x34/0x40
Jan 24 15:02:56 atuin kernel: [ 3626.104892]  spa_all_configs+0x4a/0x120 [zfs]
Jan 24 15:02:56 atuin kernel: [ 3626.106127]  zfs_ioc_pool_configs+0x1c/0x70 [zfs]
Jan 24 15:02:56 atuin kernel: [ 3626.107227]  zfsdev_ioctl_common+0x752/0x9b0 [zfs]
Jan 24 15:02:56 atuin kernel: [ 3626.108325]  ? __kmalloc_node+0x276/0x300
Jan 24 15:02:56 atuin kernel: [ 3626.109236]  ? _copy_from_user+0x2e/0x60
Jan 24 15:02:56 atuin kernel: [ 3626.110124]  zfsdev_ioctl+0x57/0xe0 [zfs]
Jan 24 15:02:56 atuin kernel: [ 3626.111244]  __x64_sys_ioctl+0x91/0xc0
Jan 24 15:02:56 atuin kernel: [ 3626.112267]  do_syscall_64+0x61/0xb0
Jan 24 15:02:56 atuin kernel: [ 3626.113277]  ? irqentry_exit_to_user_mode+0x9/0x20
Jan 24 15:02:56 atuin kernel: [ 3626.114216]  ? irqentry_exit+0x19/0x30
Jan 24 15:02:56 atuin kernel: [ 3626.115082]  ? exc_page_fault+0x8f/0x170
Jan 24 15:02:56 atuin kernel: [ 3626.115961]  ? asm_exc_page_fault+0x8/0x30
Jan 24 15:02:56 atuin kernel: [ 3626.116828]  entry_SYSCALL_64_after_hwframe+0x44/0xae
Jan 24 15:02:56 atuin kernel: [ 3626.117706] RIP: 0033:0x7fc6fbe9dcc7
Jan 24 15:02:56 atuin kernel: [ 3626.118579] RSP: 002b:00007ffd9c1835e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Jan 24 15:02:56 atuin kernel: [ 3626.119487] RAX: ffffffffffffffda RBX: 000055d292dae570 RCX: 00007fc6fbe9dcc7
Jan 24 15:02:56 atuin kernel: [ 3626.120416] RDX: 00007ffd9c183610 RSI: 0000000000005a04 RDI: 0000000000000003
Jan 24 15:02:56 atuin kernel: [ 3626.121344] RBP: 00007ffd9c186c00 R08: 00007fc6fb58e010 R09: 0000000000000000
Jan 24 15:02:56 atuin kernel: [ 3626.122266] R10: 0000000000000022 R11: 0000000000000246 R12: 000055d292dae570
Jan 24 15:02:56 atuin kernel: [ 3626.123198] R13: 0000000000000000 R14: 00007ffd9c183610 R15: 0000000000000006
Jan 24 15:02:56 atuin kernel: [ 3626.124146]  </TASK>
Jan 24 15:04:50 atuin kernel: [ 3740.917885] VERIFY3(range_tree_space(smla->smla_rt) + sme->sme_run <= smla->smla_sm->sm_size) failed (17179877376 <= 17179869184)
Jan 24 15:04:50 atuin kernel: [ 3740.919910] PANIC at space_map.c:405:space_map_load_callback()
Jan 24 15:04:50 atuin kernel: [ 3740.920975] Showing stack for process 41661
Jan 24 15:04:50 atuin kernel: [ 3740.922153] CPU: 4 PID: 41661 Comm: z_wr_iss Tainted: P          IO      5.13.19-3-pve #1
Jan 24 15:04:50 atuin kernel: [ 3740.923243] Hardware name: HP ProLiant MicroServer Gen8, BIOS J06 04/04/2019
Jan 24 15:04:50 atuin kernel: [ 3740.924412] Call Trace:
Jan 24 15:04:50 atuin kernel: [ 3740.925566]  <TASK>
Jan 24 15:04:50 atuin kernel: [ 3740.926628]  dump_stack+0x7d/0x9c
Jan 24 15:04:50 atuin kernel: [ 3740.927676]  spl_dumpstack+0x29/0x2b [spl]
Jan 24 15:04:50 atuin kernel: [ 3740.928731]  spl_panic+0xd4/0xfc [spl]
Jan 24 15:04:50 atuin kernel: [ 3740.929778]  space_map_load_callback+0x7f/0x90 [zfs]
Jan 24 15:04:50 atuin kernel: [ 3740.930992]  space_map_iterate+0x1c0/0x3f0 [zfs]
Jan 24 15:04:50 atuin kernel: [ 3740.932184]  ? spa_stats_destroy+0x190/0x190 [zfs]
Jan 24 15:04:50 atuin kernel: [ 3740.933372]  space_map_load_length+0x61/0xe0 [zfs]
Jan 24 15:04:50 atuin kernel: [ 3740.934539]  metaslab_load+0x151/0x8a0 [zfs]
Jan 24 15:04:50 atuin kernel: [ 3740.935685]  ? __cond_resched+0x1a/0x50
Jan 24 15:04:50 atuin kernel: [ 3740.936674]  ? slab_pre_alloc_hook.constprop.0+0x96/0xe0
Jan 24 15:04:50 atuin kernel: [ 3740.937650]  ? __raw_callee_save___native_queued_spin_unlock+0x15/0x23
Jan 24 15:04:50 atuin kernel: [ 3740.938629]  metaslab_activate+0x4f/0x2c0 [zfs]
Jan 24 15:04:50 atuin kernel: [ 3740.939761]  metaslab_alloc_dva+0x2cd/0x1480 [zfs]
Jan 24 15:04:50 atuin kernel: [ 3740.940903]  metaslab_alloc+0xbe/0x270 [zfs]
Jan 24 15:04:50 atuin kernel: [ 3740.942038]  zio_dva_allocate+0xe6/0x930 [zfs]
Jan 24 15:04:50 atuin kernel: [ 3740.943195]  ? tsd_hash_search.isra.0+0x71/0xa0 [spl]
Jan 24 15:04:50 atuin kernel: [ 3740.944201]  zio_execute+0x89/0x130 [zfs]
Jan 24 15:04:50 atuin kernel: [ 3740.945368]  taskq_thread+0x2b7/0x500 [spl]
Jan 24 15:04:50 atuin kernel: [ 3740.946375]  ? wake_up_q+0xa0/0xa0
Jan 24 15:04:50 atuin kernel: [ 3740.947372]  ? zio_gang_tree_free+0x70/0x70 [zfs]
Jan 24 15:04:50 atuin kernel: [ 3740.948544]  ? taskq_thread_spawn+0x60/0x60 [spl]
Jan 24 15:04:50 atuin kernel: [ 3740.949554]  kthread+0x12b/0x150
Jan 24 15:04:50 atuin kernel: [ 3740.950546]  ? set_kthread_struct+0x50/0x50
Jan 24 15:04:50 atuin kernel: [ 3740.951546]  ret_from_fork+0x22/0x30
Jan 24 15:04:50 atuin kernel: [ 3740.952550]  </TASK>
Jan 24 15:04:56 atuin kernel: [ 3746.924292] INFO: task zfs:33284 blocked for more than 362 seconds.
Jan 24 15:04:56 atuin kernel: [ 3746.925332]       Tainted: P          IO      5.13.19-3-pve #1
Jan 24 15:04:56 atuin kernel: [ 3746.926356] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 24 15:04:56 atuin kernel: [ 3746.927408] task:zfs             state:D stack:    0 pid:33284 ppid:     1 flags:0x00000004
Jan 24 15:04:56 atuin kernel: [ 3746.928510] Call Trace:
Jan 24 15:04:56 atuin kernel: [ 3746.929575]  <TASK>
Jan 24 15:04:56 atuin kernel: [ 3746.930626]  __schedule+0x2fa/0x910
Jan 24 15:04:56 atuin kernel: [ 3746.931687]  schedule+0x4f/0xc0
Jan 24 15:04:56 atuin kernel: [ 3746.932750]  schedule_preempt_disabled+0xe/0x10
Jan 24 15:04:56 atuin kernel: [ 3746.933776]  __mutex_lock.constprop.0+0x305/0x4d0
Jan 24 15:04:56 atuin kernel: [ 3746.934797]  __mutex_lock_slowpath+0x13/0x20
Jan 24 15:04:56 atuin kernel: [ 3746.935813]  mutex_lock+0x34/0x40
Jan 24 15:04:56 atuin kernel: [ 3746.936831]  spa_all_configs+0x4a/0x120 [zfs]
Jan 24 15:04:56 atuin kernel: [ 3746.938015]  zfs_ioc_pool_configs+0x1c/0x70 [zfs]
Jan 24 15:04:56 atuin kernel: [ 3746.939177]  zfsdev_ioctl_common+0x752/0x9b0 [zfs]
Jan 24 15:04:56 atuin kernel: [ 3746.940329]  ? __kmalloc_node+0x276/0x300
Jan 24 15:04:56 atuin kernel: [ 3746.941255]  ? _copy_from_user+0x2e/0x60
Jan 24 15:04:56 atuin kernel: [ 3746.942163]  zfsdev_ioctl+0x57/0xe0 [zfs]
Jan 24 15:04:56 atuin kernel: [ 3746.943219]  __x64_sys_ioctl+0x91/0xc0
Jan 24 15:04:56 atuin kernel: [ 3746.944101]  do_syscall_64+0x61/0xb0
Jan 24 15:04:56 atuin kernel: [ 3746.944973]  ? irqentry_exit_to_user_mode+0x9/0x20
Jan 24 15:04:56 atuin kernel: [ 3746.945828]  ? irqentry_exit+0x19/0x30
Jan 24 15:04:56 atuin kernel: [ 3746.946680]  ? exc_page_fault+0x8f/0x170
Jan 24 15:04:56 atuin kernel: [ 3746.947534]  ? asm_exc_page_fault+0x8/0x30
Jan 24 15:04:56 atuin kernel: [ 3746.948414]  entry_SYSCALL_64_after_hwframe+0x44/0xae
Jan 24 15:04:56 atuin kernel: [ 3746.949283] RIP: 0033:0x7fc6fbe9dcc7
Jan 24 15:04:56 atuin kernel: [ 3746.950143] RSP: 002b:00007ffd9c1835e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Jan 24 15:04:56 atuin kernel: [ 3746.951039] RAX: ffffffffffffffda RBX: 000055d292dae570 RCX: 00007fc6fbe9dcc7
Jan 24 15:04:56 atuin kernel: [ 3746.951943] RDX: 00007ffd9c183610 RSI: 0000000000005a04 RDI: 0000000000000003
Jan 24 15:04:56 atuin kernel: [ 3746.952864] RBP: 00007ffd9c186c00 R08: 00007fc6fb58e010 R09: 0000000000000000
Jan 24 15:04:56 atuin kernel: [ 3746.953771] R10: 0000000000000022 R11: 0000000000000246 R12: 000055d292dae570
Jan 24 15:04:56 atuin kernel: [ 3746.954683] R13: 0000000000000000 R14: 00007ffd9c183610 R15: 0000000000000006
Jan 24 15:04:56 atuin kernel: [ 3746.955593]  </TASK>
Jan 24 15:05:01 atuin CRON[41710]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Jan 24 15:06:57 atuin kernel: [ 3867.756633] INFO: task zpool:32594 blocked for more than 120 seconds.
Jan 24 15:06:57 atuin kernel: [ 3867.757573]       Tainted: P          IO      5.13.19-3-pve #1
Jan 24 15:06:57 atuin kernel: [ 3867.758495] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 24 15:06:57 atuin kernel: [ 3867.759444] task:zpool           state:D stack:    0 pid:32594 ppid:  1889 flags:0x00004002
Jan 24 15:06:57 atuin kernel: [ 3867.760419] Call Trace:
Jan 24 15:06:57 atuin kernel: [ 3867.761391]  <TASK>
Jan 24 15:06:57 atuin kernel: [ 3867.762333]  __schedule+0x2fa/0x910
Jan 24 15:06:57 atuin kernel: [ 3867.763284]  schedule+0x4f/0xc0
Jan 24 15:06:57 atuin kernel: [ 3867.764464]  io_schedule+0x46/0x70
Jan 24 15:06:57 atuin kernel: [ 3867.765417]  cv_wait_common+0xae/0x130 [spl]
Jan 24 15:06:57 atuin kernel: [ 3867.766388]  ? wait_woken+0x80/0x80
Jan 24 15:06:57 atuin kernel: [ 3867.767327]  __cv_wait_io+0x18/0x20 [spl]
Jan 24 15:06:57 atuin kernel: [ 3867.768271]  txg_wait_synced_impl+0xd6/0x120 [zfs]
Jan 24 15:06:57 atuin kernel: [ 3867.769405]  txg_wait_synced+0x10/0x40 [zfs]
Jan 24 15:06:57 atuin kernel: [ 3867.770505]  spa_load+0x1512/0x1840 [zfs]
Jan 24 15:06:57 atuin kernel: [ 3867.771588]  spa_load_best+0x57/0x2d0 [zfs]
Jan 24 15:06:57 atuin kernel: [ 3867.772666]  spa_import+0x1eb/0x830 [zfs]
Jan 24 15:06:57 atuin kernel: [ 3867.773715]  ? spl_kmem_free_impl+0x25/0x30 [spl]
Jan 24 15:06:57 atuin kernel: [ 3867.774607]  zfs_ioc_pool_import+0x138/0x150 [zfs]
Jan 24 15:06:57 atuin kernel: [ 3867.775640]  zfsdev_ioctl_common+0x752/0x9b0 [zfs]
Jan 24 15:06:57 atuin kernel: [ 3867.776689]  ? __kmalloc_node+0x276/0x300
Jan 24 15:06:57 atuin kernel: [ 3867.777568]  ? _copy_from_user+0x2e/0x60
Jan 24 15:06:57 atuin kernel: [ 3867.778446]  zfsdev_ioctl+0x57/0xe0 [zfs]
Jan 24 15:06:57 atuin kernel: [ 3867.779482]  __x64_sys_ioctl+0x91/0xc0
Jan 24 15:06:57 atuin kernel: [ 3867.780355]  do_syscall_64+0x61/0xb0
Jan 24 15:06:57 atuin kernel: [ 3867.781239]  ? irqentry_exit_to_user_mode+0x9/0x20
Jan 24 15:06:57 atuin kernel: [ 3867.782121]  ? irqentry_exit+0x19/0x30
Jan 24 15:06:57 atuin kernel: [ 3867.782992]  ? exc_page_fault+0x8f/0x170
Jan 24 15:06:57 atuin kernel: [ 3867.783861]  ? asm_exc_page_fault+0x8/0x30
Jan 24 15:06:57 atuin kernel: [ 3867.784741]  entry_SYSCALL_64_after_hwframe+0x44/0xae
Jan 24 15:06:57 atuin kernel: [ 3867.785618] RIP: 0033:0x7f5e62736cc7
Jan 24 15:06:57 atuin kernel: [ 3867.786492] RSP: 002b:00007ffe81c0f7b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Jan 24 15:06:57 atuin kernel: [ 3867.787402] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f5e62736cc7
Jan 24 15:06:57 atuin kernel: [ 3867.788320] RDX: 00007ffe81c0f830 RSI: 0000000000005a02 RDI: 0000000000000003
Jan 24 15:06:57 atuin kernel: [ 3867.789255] RBP: 00007ffe81c13720 R08: 0000000000000000 R09: 00007f5e62800be0
Jan 24 15:06:57 atuin kernel: [ 3867.790178] R10: 0000000010000000 R11: 0000000000000246 R12: 0000560d09185570
Jan 24 15:06:57 atuin kernel: [ 3867.791104] R13: 00007ffe81c0f830 R14: 00007f5e34001aa8 R15: 0000000000000000
Jan 24 15:06:57 atuin kernel: [ 3867.792037]  </TASK>
Jan 24 15:06:57 atuin kernel: [ 3867.792975] INFO: task zfs:33284 blocked for more than 483 seconds.
Jan 24 15:06:57 atuin kernel: [ 3867.793926]       Tainted: P          IO      5.13.19-3-pve #1
Jan 24 15:06:57 atuin kernel: [ 3867.794878] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 24 15:06:57 atuin kernel: [ 3867.795843] task:zfs             state:D stack:    0 pid:33284 ppid:     1 flags:0x00000004
Jan 24 15:06:57 atuin kernel: [ 3867.796842] Call Trace:
Jan 24 15:06:57 atuin kernel: [ 3867.797793]  <TASK>
Jan 24 15:06:57 atuin kernel: [ 3867.798735]  __schedule+0x2fa/0x910
Jan 24 15:06:57 atuin kernel: [ 3867.799680]  schedule+0x4f/0xc
Jan 24 15:06:57 atuin kernel: [ 3867.800628]  schedule_preempt_disabled+0xe/0x10
Jan 24 15:06:57 atuin kernel: [ 3867.801558]  __mutex_lock.constprop.0+0x305/0x4d0
Jan 24 15:06:57 atuin kernel: [ 3867.802480]  __mutex_lock_slowpath+0x13/0x20
Jan 24 15:06:57 atuin kernel: [ 3867.803385]  mutex_lock+0x34/0x40
Jan 24 15:06:57 atuin kernel: [ 3867.804267]  spa_all_configs+0x4a/0x120 [zfs]
Jan 24 15:06:57 atuin kernel: [ 3867.805301]  zfs_ioc_pool_configs+0x1c/0x70 [zfs]
Jan 24 15:06:57 atuin kernel: [ 3867.806319]  zfsdev_ioctl_common+0x752/0x9b0 [zfs]
Jan 24 15:06:57 atuin kernel: [ 3867.807346]  ? __kmalloc_node+0x276/0x300
Jan 24 15:06:57 atuin kernel: [ 3867.808211]  ? _copy_from_user+0x2e/0x60
Jan 24 15:06:57 atuin kernel: [ 3867.809085]  zfsdev_ioctl+0x57/0xe0 [zfs]
Jan 24 15:06:57 atuin kernel: [ 3867.810113]  __x64_sys_ioctl+0x91/0xc0
Jan 24 15:06:57 atuin kernel: [ 3867.810979]  do_syscall_64+0x61/0xb0
Jan 24 15:06:57 atuin kernel: [ 3867.811842]  ? irqentry_exit_to_user_mode+0x9/0x20
Jan 24 15:06:57 atuin kernel: [ 3867.812725]  ? irqentry_exit+0x19/0x30
Jan 24 15:06:57 atuin kernel: [ 3867.813587]  ? exc_page_fault+0x8f/0x170
Jan 24 15:06:57 atuin kernel: [ 3867.814451]  ? asm_exc_page_fault+0x8/0x30
Jan 24 15:06:57 atuin kernel: [ 3867.815316]  entry_SYSCALL_64_after_hwframe+0x44/0xae
Jan 24 15:06:57 atuin kernel: [ 3867.816194] RIP: 0033:0x7fc6fbe9dcc7
Jan 24 15:06:57 atuin kernel: [ 3867.817084] RSP: 002b:00007ffd9c1835e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Jan 24 15:06:57 atuin kernel: [ 3867.817992] RAX: ffffffffffffffda RBX: 000055d292dae570 RCX: 00007fc6fbe9dcc7
Jan 24 15:06:57 atuin kernel: [ 3867.818908] RDX: 00007ffd9c183610 RSI: 0000000000005a04 RDI: 0000000000000003
Jan 24 15:06:57 atuin kernel: [ 3867.819834] RBP: 00007ffd9c186c00 R08: 00007fc6fb58e010 R09: 0000000000000000
Jan 24 15:06:57 atuin kernel: [ 3867.820766] R10: 0000000000000022 R11: 0000000000000246 R12: 000055d292dae570
Jan 24 15:06:57 atuin kernel: [ 3867.821699] R13: 0000000000000000 R14: 00007ffd9c183610 R15: 0000000000000006
Jan 24 15:06:57 atuin kernel: [ 3867.822635]  </TASK>
Jan 24 15:06:57 atuin kernel: [ 3867.823560] INFO: task txg_sync:41655 blocked for more than 120 seconds.
Jan 24 15:06:57 atuin kernel: [ 3867.824522]       Tainted: P          IO      5.13.19-3-pve #1
Jan 24 15:06:57 atuin kernel: [ 3867.825497] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 24 15:06:57 atuin kernel: [ 3867.826494] task:txg_sync        state:D stack:    0 pid:41655 ppid:     2 flags:0x00004000
Jan 24 15:06:57 atuin kernel: [ 3867.827517] Call Trace:
Jan 24 15:06:57 atuin kernel: [ 3867.828523]  <TASK>
Jan 24 15:06:57 atuin kernel: [ 3867.829527]  __schedule+0x2fa/0x910
Jan 24 15:06:57 atuin kernel: [ 3867.830525]  schedule+0x4f/0xc0
Jan 24 15:06:57 atuin kernel: [ 3867.831502]  schedule_timeout+0x8a/0x140
Jan 24 15:06:57 atuin kernel: [ 3867.832464]  ? __bpf_trace_tick_stop+0x10/0x10
Jan 24 15:06:57 atuin kernel: [ 3867.833438]  io_schedule_timeout+0x51/0x80
Jan 24 15:06:57 atuin kernel: [ 3867.834405]  __cv_timedwait_common+0x131/0x170 [spl]
Jan 24 15:06:57 atuin kernel: [ 3867.835381]  ? wait_woken+0x80/0x80
Jan 24 15:06:57 atuin kernel: [ 3867.836342]  __cv_timedwait_io+0x19/0x20 [spl]
Jan 24 15:06:57 atuin kernel: [ 3867.837311]  zio_wait+0x133/0x2c0 [zfs]
Jan 24 15:06:57 atuin kernel: [ 3867.838418]  dsl_pool_sync+0x435/0x4f0 [zfs]
Jan 24 15:06:57 atuin kernel: [ 3867.839491]  spa_sync+0x55a/0xff0 [zfs]
Jan 24 15:06:57 atuin kernel: [ 3867.840557]  ? spa_txg_history_init_io+0x106/0x110 [zfs]
Jan 24 15:06:57 atuin kernel: [ 3867.841643]  txg_sync_thread+0x2d3/0x460 [zfs]
Jan 24 15:06:57 atuin kernel: [ 3867.842695]  ? txg_init+0x260/0x260 [zfs]
Jan 24 15:06:57 atuin kernel: [ 3867.843753]  thread_generic_wrapper+0x79/0x90 [spl]
Jan 24 15:06:57 atuin kernel: [ 3867.844687]  ? __thread_exit+0x20/0x20 [spl]
Jan 24 15:06:57 atuin kernel: [ 3867.845614]  kthread+0x12b/0x150
Jan 24 15:06:57 atuin kernel: [ 3867.846528]  ? set_kthread_struct+0x50/0x50
Jan 24 15:06:57 atuin kernel: [ 3867.847444]  ret_from_fork+0x22/0x30
Jan 24 15:06:57 atuin kernel: [ 3867.848362]  </TASK>
Jan 24 15:06:57 atuin kernel: [ 3867.849273] INFO: task z_wr_iss:41661 blocked for more than 120 seconds.
Jan 24 15:06:57 atuin kernel: [ 3867.850211]       Tainted: P          IO      5.13.19-3-pve #1
Jan 24 15:06:57 atuin kernel: [ 3867.851153] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 24 15:06:57 atuin kernel: [ 3867.852122] task:z_wr_iss        state:D stack:    0 pid:41661 ppid:     2 flags:0x00004000
Jan 24 15:06:57 atuin kernel: [ 3867.853125] Call Trace:
Jan 24 15:06:57 atuin kernel: [ 3867.854104]  <TASK>
Jan 24 15:06:57 atuin kernel: [ 3867.855074]  __schedule+0x2fa/0x910
Jan 24 15:06:57 atuin kernel: [ 3867.856053]  schedule+0x4f/0xc0
Jan 24 15:06:57 atuin kernel: [ 3867.857036]  spl_panic+0xfa/0xfc [spl]
Jan 24 15:06:57 atuin kernel: [ 3867.858021]  space_map_load_callback+0x7f/0x90 [zfs]
Jan 24 15:06:57 atuin kernel: [ 3867.859162]  space_map_iterate+0x1c0/0x3f0 [zfs]
Jan 24 15:06:57 atuin kernel: [ 3867.860300]  ? spa_stats_destroy+0x190/0x190 [zfs]
Jan 24 15:06:57 atuin kernel: [ 3867.861455]  space_map_load_length+0x61/0xe0 [zfs]
Jan 24 15:06:57 atuin kernel: [ 3867.862596]  metaslab_load+0x151/0x8a0 [zfs]
Jan 24 15:06:57 atuin kernel: [ 3867.863718]  ? __cond_resched+0x1a/0x50
Jan 24 15:06:57 atuin kernel: [ 3867.864693]  ? slab_pre_alloc_hook.constprop.0+0x96/0xe0
Jan 24 15:06:57 atuin kernel: [ 3867.865656]  ? __raw_callee_save___native_queued_spin_unlock+0x15/0x23
Jan 24 15:06:57 atuin kernel: [ 3867.866628]  metaslab_activate+0x4f/0x2c0 [zfs]
Jan 24 15:06:57 atuin kernel: [ 3867.867746]  metaslab_alloc_dva+0x2cd/0x1480 [zfs]
Jan 24 15:06:57 atuin kernel: [ 3867.868869]  metaslab_alloc+0xbe/0x270 [zfs]
Jan 24 15:06:57 atuin kernel: [ 3867.869974]  zio_dva_allocate+0xe6/0x930 [zfs]
Jan 24 15:06:57 atuin kernel: [ 3867.871085]  ? tsd_hash_search.isra.0+0x71/0xa0 [spl]
Jan 24 15:06:57 atuin kernel: [ 3867.872038]  zio_execute+0x89/0x130 [zfs]
Jan 24 15:06:57 atuin kernel: [ 3867.873142]  taskq_thread+0x2b7/0x500 [spl]
Jan 24 15:06:57 atuin kernel: [ 3867.874067]  ? wake_up_q+0xa0/0xa0
Jan 24 15:06:57 atuin kernel: [ 3867.874958]  ? zio_gang_tree_free+0x70/0x70 [zfs]
Jan 24 15:06:57 atuin kernel: [ 3867.876006]  ? taskq_thread_spawn+0x60/0x60 [spl]
Jan 24 15:06:57 atuin kernel: [ 3867.876918]  kthread+0x12b/0x150
Jan 24 15:06:57 atuin kernel: [ 3867.877808]  ? set_kthread_struct+0x50/0x50
Jan 24 15:06:57 atuin kernel: [ 3867.878704]  ret_from_fork+0x22/0x30
Jan 24 15:06:57 atuin kernel: [ 3867.879602]  </TASK>
Jan 24 15:08:58 atuin kernel: [ 3988.588947] INFO: task zpool:32594 blocked for more than 241 seconds.
Jan 24 15:08:58 atuin kernel: [ 3988.589916]       Tainted: P          IO      5.13.19-3-pve #1
Jan 24 15:08:58 atuin kernel: [ 3988.590840] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 24 15:08:58 atuin kernel: [ 3988.591783] task:zpool           state:D stack:    0 pid:32594 ppid:  1889 flags:0x00004002
Jan 24 15:08:58 atuin kernel: [ 3988.592754] Call Trace:
Jan 24 15:08:58 atuin kernel: [ 3988.593724]  <TASK>
Jan 24 15:08:58 atuin kernel: [ 3988.594669]  __schedule+0x2fa/0x910
Jan 24 15:08:58 atuin kernel: [ 3988.595622]  schedule+0x4f/0xc0
Jan 24 15:08:58 atuin kernel: [ 3988.596560]  io_schedule+0x46/0x70
Jan 24 15:08:58 atuin kernel: [ 3988.597503]  cv_wait_common+0xae/0x130 [spl]
Jan 24 15:08:58 atuin kernel: [ 3988.598445]  ? wait_woken+0x80/0x80
Jan 24 15:08:58 atuin kernel: [ 3988.599371]  __cv_wait_io+0x18/0x20 [spl]
Jan 24 15:08:58 atuin kernel: [ 3988.600300]  txg_wait_synced_impl+0xd6/0x120 [zfs]
Jan 24 15:08:58 atuin kernel: [ 3988.601416]  txg_wait_synced+0x10/0x40 [zfs]
Jan 24 15:08:58 atuin kernel: [ 3988.602498]  spa_load+0x1512/0x1840 [zfs]
Jan 24 15:08:58 atuin kernel: [ 3988.603573]  spa_load_best+0x57/0x2d0 [zfs]
Jan 24 15:08:58 atuin kernel: [ 3988.604648]  spa_import+0x1eb/0x830 [zfs]
Jan 24 15:08:58 atuin kernel: [ 3988.605716]  ? spl_kmem_free_impl+0x25/0x30 [spl]
Jan 24 15:08:58 atuin kernel: [ 3988.606628]  zfs_ioc_pool_import+0x138/0x150 [zfs]
Jan 24 15:08:58 atuin kernel: [ 3988.607675]  zfsdev_ioctl_common+0x752/0x9b0 [zfs]
Jan 24 15:08:58 atuin kernel: [ 3988.608699]  ? __kmalloc_node+0x276/0x300
Jan 24 15:08:58 atuin kernel: [ 3988.609557]  ? _copy_from_user+0x2e/0x60
Jan 24 15:08:58 atuin kernel: [ 3988.610410]  zfsdev_ioctl+0x57/0xe0 [zfs]
Jan 24 15:08:58 atuin kernel: [ 3988.611423]  __x64_sys_ioctl+0x91/0xc0
Jan 24 15:08:58 atuin kernel: [ 3988.612278]  do_syscall_64+0x61/0xb0
Jan 24 15:08:58 atuin kernel: [ 3988.613149]  ? irqentry_exit_to_user_mode+0x9/0x20
Jan 24 15:08:58 atuin kernel: [ 3988.614010]  ? irqentry_exit+0x19/0x30
Jan 24 15:08:58 atuin kernel: [ 3988.614863]  ? exc_page_fault+0x8f/0x170
Jan 24 15:08:58 atuin kernel: [ 3988.615710]  ? asm_exc_page_fault+0x8/0x30
Jan 24 15:08:58 atuin kernel: [ 3988.616558]  entry_SYSCALL_64_after_hwframe+0x44/0xae
Jan 24 15:08:58 atuin kernel: [ 3988.617423] RIP: 0033:0x7f5e62736cc7
Jan 24 15:08:58 atuin kernel: [ 3988.618274] RSP: 002b:00007ffe81c0f7b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Jan 24 15:08:58 atuin kernel: [ 3988.619154] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f5e62736cc7
Jan 24 15:08:58 atuin kernel: [ 3988.620032] RDX: 00007ffe81c0f830 RSI: 0000000000005a02 RDI: 0000000000000003
Jan 24 15:08:58 atuin kernel: [ 3988.620914] RBP: 00007ffe81c13720 R08: 0000000000000000 R09: 00007f5e62800be0
Jan 24 15:08:58 atuin kernel: [ 3988.621786] R10: 0000000010000000 R11: 0000000000000246 R12: 0000560d09185570
Jan 24 15:08:58 atuin kernel: [ 3988.622662] R13: 00007ffe81c0f830 R14: 00007f5e34001aa8 R15: 0000000000000000
Jan 24 15:08:58 atuin kernel: [ 3988.623541]  </TASK>
Jan 24 15:08:58 atuin kernel: [ 3988.624410] INFO: task zfs:33284 blocked for more than 604 seconds.
Jan 24 15:08:58 atuin kernel: [ 3988.625311]       Tainted: P          IO      5.13.19-3-pve #1
Jan 24 15:08:58 atuin kernel: [ 3988.626210] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 24 15:08:58 atuin kernel: [ 3988.627133] task:zfs             state:D stack:    0 pid:33284 ppid:     1 flags:0x00000004
Jan 24 15:08:58 atuin kernel: [ 3988.628082] Call Trace:
Jan 24 15:08:58 atuin kernel: [ 3988.629024]  <TASK>
Jan 24 15:08:58 atuin kernel: [ 3988.629945]  __schedule+0x2fa/0x910
Jan 24 15:08:58 atuin kernel: [ 3988.630875]  schedule+0x4f/0xc0
Jan 24 15:08:58 atuin kernel: [ 3988.631799]  schedule_preempt_disabled+0xe/0x10
Jan 24 15:08:58 atuin kernel: [ 3988.632730]  __mutex_lock.constprop.0+0x305/0x4d0
Jan 24 15:08:58 atuin kernel: [ 3988.633664]  __mutex_lock_slowpath+0x13/0x20
Jan 24 15:08:58 atuin kernel: [ 3988.634572]  mutex_lock+0x34/0x40
Jan 24 15:08:58 atuin kernel: [ 3988.635456]  spa_all_configs+0x4a/0x120 [zfs]
Jan 24 15:08:58 atuin kernel: [ 3988.636481]  zfs_ioc_pool_configs+0x1c/0x70 [zfs]
Jan 24 15:08:58 atuin kernel: [ 3988.637511]  zfsdev_ioctl_common+0x752/0x9b0 [zfs]
Jan 24 15:08:58 atuin kernel: [ 3988.638541]  ? __kmalloc_node+0x276/0x300
Jan 24 15:08:58 atuin kernel: [ 3988.639408]  ? _copy_from_user+0x2e/0x60
Jan 24 15:08:58 atuin kernel: [ 3988.640512]  zfsdev_ioctl+0x57/0xe0 [zfs]
Jan 24 15:08:58 atuin kernel: [ 3988.641554]  __x64_sys_ioctl+0x91/0xc0
Jan 24 15:08:58 atuin kernel: [ 3988.642423]  do_syscall_64+0x61/0xb0
Jan 24 15:08:58 atuin kernel: [ 3988.643287]  ? irqentry_exit_to_user_mode+0x9/0x20
Jan 24 15:08:58 atuin kernel: [ 3988.644161]  ? irqentry_exit+0x19/0x30
Jan 24 15:08:58 atuin kernel: [ 3988.645034]  ? exc_page_fault+0x8f/0x170
Jan 24 15:08:58 atuin kernel: [ 3988.645900]  ? asm_exc_page_fault+0x8/0x30
Jan 24 15:08:58 atuin kernel: [ 3988.646766]  entry_SYSCALL_64_after_hwframe+0x44/0xae
Jan 24 15:08:58 atuin kernel: [ 3988.647643] RIP: 0033:0x7fc6fbe9dcc7
Jan 24 15:08:58 atuin kernel: [ 3988.648518] RSP: 002b:00007ffd9c1835e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Jan 24 15:08:58 atuin kernel: [ 3988.649440] RAX: ffffffffffffffda RBX: 000055d292dae570 RCX: 00007fc6fbe9dcc7
Jan 24 15:08:58 atuin kernel: [ 3988.650358] RDX: 00007ffd9c183610 RSI: 0000000000005a04 RDI: 0000000000000003
Jan 24 15:08:58 atuin kernel: [ 3988.651285] RBP: 00007ffd9c186c00 R08: 00007fc6fb58e010 R09: 0000000000000000
Jan 24 15:08:58 atuin kernel: [ 3988.652208] R10: 0000000000000022 R11: 0000000000000246 R12: 000055d292dae570
Jan 24 15:08:58 atuin kernel: [ 3988.653150] R13: 0000000000000000 R14: 00007ffd9c183610 R15: 0000000000000006
Jan 24 15:08:58 atuin kernel: [ 3988.654087]  </TASK>
Jan 24 15:08:58 atuin kernel: [ 3988.655013] INFO: task txg_sync:41655 blocked for more than 241 seconds.
Jan 24 15:08:58 atuin kernel: [ 3988.655975]       Tainted: P          IO      5.13.19-3-pve #1
Jan 24 15:08:58 atuin kernel: [ 3988.656951] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 24 15:08:58 atuin kernel: [ 3988.657948] task:txg_sync        state:D stack:    0 pid:41655 ppid:     2 flags:0x00004000
Jan 24 15:08:58 atuin kernel: [ 3988.658971] Call Trace:
Jan 24 15:08:58 atuin kernel: [ 3988.659978]  <TASK>
Jan 24 15:08:58 atuin kernel: [ 3988.660981]  __schedule+0x2fa/0x910
Jan 24 15:08:58 atuin kernel: [ 3988.661979]  schedule+0x4f/0xc0
Jan 24 15:08:58 atuin kernel: [ 3988.662957]  schedule_timeout+0x8a/0x140
Jan 24 15:08:58 atuin kernel: [ 3988.663919]  ? __bpf_trace_tick_stop+0x10/0x10
Jan 24 15:08:58 atuin kernel: [ 3988.664883]  io_schedule_timeout+0x51/0x80
Jan 24 15:08:58 atuin kernel: [ 3988.665872]  __cv_timedwait_common+0x131/0x170 [spl]
Jan 24 15:08:58 atuin kernel: [ 3988.666850]  ? wait_woken+0x80/0x80
Jan 24 15:08:58 atuin kernel: [ 3988.667811]  __cv_timedwait_io+0x19/0x20 [spl]
Jan 24 15:08:58 atuin kernel: [ 3988.668771]  zio_wait+0x133/0x2c0 [zfs]
Jan 24 15:08:58 atuin kernel: [ 3988.669892]  dsl_pool_sync+0x435/0x4f0 [zfs]
Jan 24 15:08:58 atuin kernel: [ 3988.670966]  spa_sync+0x55a/0xff0 [zfs]
Jan 24 15:08:58 atuin kernel: [ 3988.672032]  ? spa_txg_history_init_io+0x106/0x110 [zfs]
Jan 24 15:08:58 atuin kernel: [ 3988.673106]  txg_sync_thread+0x2d3/0x460 [zfs]
Jan 24 15:08:58 atuin kernel: [ 3988.674159]  ? txg_init+0x260/0x260 [zfs]
Jan 24 15:08:58 atuin kernel: [ 3988.675217]  thread_generic_wrapper+0x79/0x90 [spl]
Jan 24 15:08:58 atuin kernel: [ 3988.676140]  ? __thread_exit+0x20/0x20 [spl]
Jan 24 15:08:58 atuin kernel: [ 3988.677075]  kthread+0x12b/0x150
Jan 24 15:08:58 atuin kernel: [ 3988.677990]  ? set_kthread_struct+0x50/0x50
Jan 24 15:08:58 atuin kernel: [ 3988.678906]  ret_from_fork+0x22/0x30
Jan 24 15:08:58 atuin kernel: [ 3988.679823]  </TASK>

rincebrain commented 2 years ago

Yeah, see where it says

Jan 24 15:04:50 atuin kernel: [ 3740.917885] VERIFY3(range_tree_space(smla->smla_rt) + sme->sme_run <= smla->smla_sm->sm_size) failed (17179877376 <= 17179869184)
Jan 24 15:04:50 atuin kernel: [ 3740.919910] PANIC at space_map.c:405:space_map_load_callback()

?

At that point, Progress Is Done - the kernel thread has died, it will not be coming back, and all locks that it was holding are forever taken.

Consequently, it's unsurprising that large swathes of things are not responsive once it happens.

(FWIW, I've never found -F/-X useful - I'm not sure how an automated emergency rewind has much worse success at this than manually doing it, but it is, and I've not looked into why. So you might want to experiment with grabbing a list of txgs still visible with e.g. zdb -lu and then trying manually specifying -T. You probably also want to adjust spa_load_verify_metadata if you don't want it to scan every single block for sanity checking before importing. Note that, as with -F and -X, this can be quite destructive, so, you know, last resort, make sure all the data you could get off is off or you have whole disk backups, etc.

It may still panic on a number of the possible -T values, so reboots between attempts may be necessary. I'd start at most recent and go backward. Normally I'd also suggest doing it with -o readonly=on, but it sounds like that works with the latest version too, so importing readonly in the past wouldn't be a strict improvement...could still try importing readonly before readwrite for each version though.)

e: Huh. #11923 looks like a similar case, on a much older Proxmox ZFS version, which got little attention. In #11691, they may have had hardware issues? Story unclear. #10942 was a reproducible bug with, IIUC, inconsistency between mirror legs being handled incorrectly, but seems to have been fixed, in theory, and the fix is in 2.1.0 and newer. (There are others, but that'll do as a starting point...)

techexo commented 2 years ago

Hi @rincebrain and thanks for the quick answer. I am in the process of doing a readonly import with -T on the latest tgx given by zdb -lu. It's been almost an hour now (for a 2 TB mirror with around a quarter actually used), but still no kernel panic at the moment, and I see movement on the disks with iostat.

I guess the fact that it's very long would be a consequence of not tinkering with spa_load_verify_metadata? I don' know what you're referring to so I preferred not to touch it. I still have the following messages in syslog, but no PANIC message seen. Should I consider it hanged or should I let it run?

Jan 24 20:52:52 atuin kernel: [ 4834.285614] INFO: task zpool:23312 blocked for more than 1087 seconds.
Jan 24 20:52:52 atuin kernel: [ 4834.286069]       Tainted: P          IO      5.13.19-1-pve #1
Jan 24 20:52:52 atuin kernel: [ 4834.286507] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 24 20:52:52 atuin kernel: [ 4834.286954] task:zpool           state:D stack:    0 pid:23312 ppid: 11546 flags:0x00000004
Jan 24 20:52:52 atuin kernel: [ 4834.287416] Call Trace:
Jan 24 20:52:52 atuin kernel: [ 4834.287872]  __schedule+0x2fa/0x910
Jan 24 20:52:52 atuin kernel: [ 4834.288334]  schedule+0x4f/0xc0
Jan 24 20:52:52 atuin kernel: [ 4834.288788]  schedule_preempt_disabled+0xe/0x10
Jan 24 20:52:52 atuin kernel: [ 4834.289242]  __mutex_lock.constprop.0+0x305/0x4d0
Jan 24 20:52:52 atuin kernel: [ 4834.289769]  __mutex_lock_slowpath+0x13/0x20
Jan 24 20:52:52 atuin kernel: [ 4834.290278]  mutex_lock+0x34/0x40
Jan 24 20:52:52 atuin kernel: [ 4834.290714]  spa_all_configs+0x4a/0x120 [zfs]
Jan 24 20:52:52 atuin kernel: [ 4834.291247]  zfs_ioc_pool_configs+0x1c/0x70 [zfs]
Jan 24 20:52:52 atuin kernel: [ 4834.291781]  zfsdev_ioctl_common+0x752/0x9b0 [zfs]
Jan 24 20:52:52 atuin kernel: [ 4834.292307]  ? __kmalloc_node+0x276/0x300
Jan 24 20:52:52 atuin kernel: [ 4834.292749]  ? _copy_from_user+0x2e/0x60
Jan 24 20:52:52 atuin kernel: [ 4834.293184]  zfsdev_ioctl+0x57/0xe0 [zfs]
Jan 24 20:52:52 atuin kernel: [ 4834.293738]  __x64_sys_ioctl+0x91/0xc0
Jan 24 20:52:52 atuin kernel: [ 4834.294368]  do_syscall_64+0x61/0xb0
Jan 24 20:52:52 atuin kernel: [ 4834.294779]  ? handle_mm_fault+0xda/0x2c0
Jan 24 20:52:52 atuin kernel: [ 4834.295189]  ? exit_to_user_mode_prepare+0x37/0x1b0
Jan 24 20:52:52 atuin kernel: [ 4834.295610]  ? irqentry_exit_to_user_mode+0x9/0x20
Jan 24 20:52:52 atuin kernel: [ 4834.296050]  ? irqentry_exit+0x19/0x30
Jan 24 20:52:52 atuin kernel: [ 4834.296484]  ? exc_page_fault+0x8f/0x170
Jan 24 20:52:52 atuin kernel: [ 4834.296918]  ? asm_exc_page_fault+0x8/0x30
Jan 24 20:52:52 atuin kernel: [ 4834.297353]  entry_SYSCALL_64_after_hwframe+0x44/0xae
Jan 24 20:52:52 atuin kernel: [ 4834.297823] RIP: 0033:0x7fc77c8d7cc7
Jan 24 20:52:52 atuin kernel: [ 4834.298279] RSP: 002b:00007ffc19a69608 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Jan 24 20:52:52 atuin kernel: [ 4834.298734] RAX: ffffffffffffffda RBX: 000055720bbca570 RCX: 00007fc77c8d7cc7
Jan 24 20:52:52 atuin kernel: [ 4834.299170] RDX: 00007ffc19a69630 RSI: 0000000000005a04 RDI: 0000000000000003
Jan 24 20:52:52 atuin kernel: [ 4834.299609] RBP: 00007ffc19a6cc20 R08: 00007fc77c168010 R09: 0000000000000000
Jan 24 20:52:52 atuin kernel: [ 4834.300045] R10: 0000000000000022 R11: 0000000000000246 R12: 000055720bbca570
Jan 24 20:52:52 atuin kernel: [ 4834.300484] R13: 0000000000000000 R14: 00007ffc19a69630 R15: 0000000000000000

Edit: a few minutes after crying, it eventually mounted the zpool readonly! I followed with a zpool export and tried to reimport with write access, but it panicked right after. I'm wondering if I should have once again added the -T argument? Or once I accessed it with "readonly" the rollback should've been done?

rincebrain commented 2 years ago

Without that parameter, it will literally iterate over every metadata and data block referenced in that txg before importing it, which is why I suggested you might not want to wait on it doing that.

On Mon, Jan 24, 2022 at 3:51 PM Eloi Coutant @.***> wrote:

Hi @rincebrain https://github.com/rincebrain and thanks for the quick answer. I am in the process of doing a readonly import with -T on the latest tgx given by zdb -lu. It's been almost an hour now (for a 2 TB mirror with around a quarter actually used), but still no kernel panic at the moment, and I see movement on the disks with iostat.

I guess the fact that it's very long would be a consequence of not tinkering with spa_load_verify_metadata? I don' know what you're referring to so I preferred not to touch it. I still have the following messages in syslog, but no PANIC message seen. Should I consider it hanged or should I let it run?

Jan 24 20:52:52 atuin kernel: [ 4834.285614] INFO: task zpool:23312 blocked for more than 1087 seconds. Jan 24 20:52:52 atuin kernel: [ 4834.286069] Tainted: P IO 5.13.19-1-pve #1 Jan 24 20:52:52 atuin kernel: [ 4834.286507] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jan 24 20:52:52 atuin kernel: [ 4834.286954] task:zpool state:D stack: 0 pid:23312 ppid: 11546 flags:0x00000004 Jan 24 20:52:52 atuin kernel: [ 4834.287416] Call Trace: Jan 24 20:52:52 atuin kernel: [ 4834.287872] schedule+0x2fa/0x910 Jan 24 20:52:52 atuin kernel: [ 4834.288334] schedule+0x4f/0xc0 Jan 24 20:52:52 atuin kernel: [ 4834.288788] schedule_preempt_disabled+0xe/0x10 Jan 24 20:52:52 atuin kernel: [ 4834.289242] mutex_lock.constprop.0+0x305/0x4d0 Jan 24 20:52:52 atuin kernel: [ 4834.289769] mutex_lock_slowpath+0x13/0x20 Jan 24 20:52:52 atuin kernel: [ 4834.290278] mutex_lock+0x34/0x40 Jan 24 20:52:52 atuin kernel: [ 4834.290714] spa_all_configs+0x4a/0x120 [zfs] Jan 24 20:52:52 atuin kernel: [ 4834.291247] zfs_ioc_pool_configs+0x1c/0x70 [zfs] Jan 24 20:52:52 atuin kernel: [ 4834.291781] zfsdev_ioctl_common+0x752/0x9b0 [zfs] Jan 24 20:52:52 atuin kernel: [ 4834.292307] ? kmalloc_node+0x276/0x300 Jan 24 20:52:52 atuin kernel: [ 4834.292749] ? _copy_from_user+0x2e/0x60 Jan 24 20:52:52 atuin kernel: [ 4834.293184] zfsdev_ioctl+0x57/0xe0 [zfs] Jan 24 20:52:52 atuin kernel: [ 4834.293738] __x64_sys_ioctl+0x91/0xc0 Jan 24 20:52:52 atuin kernel: [ 4834.294368] do_syscall_64+0x61/0xb0 Jan 24 20:52:52 atuin kernel: [ 4834.294779] ? handle_mm_fault+0xda/0x2c0 Jan 24 20:52:52 atuin kernel: [ 4834.295189] ? exit_to_user_mode_prepare+0x37/0x1b0 Jan 24 20:52:52 atuin kernel: [ 4834.295610] ? irqentry_exit_to_user_mode+0x9/0x20 Jan 24 20:52:52 atuin kernel: [ 4834.296050] ? irqentry_exit+0x19/0x30 Jan 24 20:52:52 atuin kernel: [ 4834.296484] ? exc_page_fault+0x8f/0x170 Jan 24 20:52:52 atuin kernel: [ 4834.296918] ? asm_exc_page_fault+0x8/0x30 Jan 24 20:52:52 atuin kernel: [ 4834.297353] entry_SYSCALL_64_after_hwframe+0x44/0xae Jan 24 20:52:52 atuin kernel: [ 4834.297823] RIP: 0033:0x7fc77c8d7cc7 Jan 24 20:52:52 atuin kernel: [ 4834.298279] RSP: 002b:00007ffc19a69608 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 Jan 24 20:52:52 atuin kernel: [ 4834.298734] RAX: ffffffffffffffda RBX: 000055720bbca570 RCX: 00007fc77c8d7cc7 Jan 24 20:52:52 atuin kernel: [ 4834.299170] RDX: 00007ffc19a69630 RSI: 0000000000005a04 RDI: 0000000000000003 Jan 24 20:52:52 atuin kernel: [ 4834.299609] RBP: 00007ffc19a6cc20 R08: 00007fc77c168010 R09: 0000000000000000 Jan 24 20:52:52 atuin kernel: [ 4834.300045] R10: 0000000000000022 R11: 0000000000000246 R12: 000055720bbca570 Jan 24 20:52:52 atuin kernel: [ 4834.300484] R13: 0000000000000000 R14: 00007ffc19a69630 R15: 0000000000000000

— Reply to this email directly, view it on GitHub https://github.com/openzfs/zfs/issues/13003#issuecomment-1020535408, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABUI7M2QWP4A7U5UA3VUK3UXW3V7ANCNFSM5MVTMJKQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

techexo commented 2 years ago

I read some examples of people using the parameters (spa_load_verify_data & spa_load_verify_metadata), tried it but right now the result is the same: no problem read only, kernel panic with write access. I am currently running the fourth attempt going back in tgx time.

Something I don't explain, is that importing with readonly, I have MB_read/s between 5 and 10, but as soon as I do an import without readonly=on, the speeds go down to < 1 MB/s. I don't know if it's expected but it seems very low for sata3 disks.

I have new disks arriving tomorrow in case the issue is a material one, could you suggest a procedure to transfer data into a new pool? Can I just mount it readonly then do a zfs send to a pool created on the other disk?

rincebrain commented 2 years ago

MB/s read according to what, and while it's doing what? I'd suggest looking at iostat -x and seeing what it's reporting utilization as on the disks while it's doing anything at all (e.g. not PANIC'd).

It's mostly going to be random IO. SATA 3 can push plenty of bytes, but spinning rust seeking around is going to be your bottleneck there.

Your options are the normal ones for copying files around - cp, rsync, etc, plus you could use zfs send/recv to send from your old pool to the new one, complicated by the fact that you can't take an actual snapshot on the source pool, so you'd have to rely on the ability to do zfs send from readonly things, and the limitations (e.g. no resumable send/recv, I think no -R...) that come with it.

I've got no special advice for avoiding this problem going forward, since the cases that are documented for why it might arise are theoretically fixed...

techexo commented 2 years ago

After adding two new disks on the server, I managed to create a new zpool. I was able afterwards to zfs send from the readonly pool to the new zpool and access my files (yipee). However, I am now with a bit of a conundrum: should I just format the disks with the old zpool and use it for something else/a new zpool? Or do you thin I can send some more logs or debug information that could help understand what's going on?

rincebrain commented 2 years ago

My off the cuff answer would be to think of that pool like the fifth elephant - only good for mining resources from the body, now. :P

Less pithily, I can't speak for anyone else, but I personally have no brilliant ideas for digging into what went wrong, so if you really wanted to do that, I'd probably take the sparsest disk image I could of one of the drives and only keep that. (Admittedly, using something like zpool initialize or zpool trim for this probably won't fly because I bet that won't run on read-only imports...)

stale[bot] commented 1 year ago

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

Kation commented 1 year ago

I have this issue with v2.1.12

openzfs / zfs