openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.6k stars 1.75k forks source link

zfs-program: zfs.sync.snapshot can create snapshots with empty name, causes zfs to lock-up #13548

Open lhuedepohl opened 2 years ago

lhuedepohl commented 2 years ago

System information

Type Version/Name
Distribution Name Debian (proxmox)
Distribution Version 11 (proxmox 7.2)
Kernel Version 5.15.35-2-pve
Architecture x86-64
OpenZFS Version zfs-2.1.4-pve1, zfs-kmod-2.1.4-pve1

Describe the problem you're observing

By erroneously supplying an empty snapshot name to a zfs-program lua script I discovered it is apparently possible that way to create snapshots with an empty name, e.g. by effectively doing an zfs.sync.snapshot("rpool@").

This then causes any ZFS operation to lock-up. An error message is printed

Message from syslogd@pve at Jun 11 13:22:01 ...
 kernel:[  116.820844] VERIFY3(0 == dsl_dataset_get_snapname(ds)) failed (0 == 2)

Message from syslogd@pve at Jun 11 13:22:01 ...
 kernel:[  116.820985] PANIC at dsl_dataset.c:923:dsl_dataset_name()

Describe how to reproduce the problem

root@pve:~# cat empty_snapshot.lua
zfs.sync.snapshot("rpool@")
root@pve:~# zfs program -- rpool ./empty_snap.lua
Channel program fully executed and did not produce output.
root@pve:~# zfs list

Message from syslogd@pve at Jun 11 13:22:01 ...
 kernel:[  116.820844] VERIFY3(0 == dsl_dataset_get_snapname(ds)) failed (0 == 2)

Message from syslogd@pve at Jun 11 13:22:01 ...
 kernel:[  116.820985] PANIC at dsl_dataset.c:923:dsl_dataset_name()

Log messages

A stack-trace is shown in dmesg, as soon as the zfs list command is issued (I think):

[  116.820844] VERIFY3(0 == dsl_dataset_get_snapname(ds)) failed (0 == 2)
[  116.820985] PANIC at dsl_dataset.c:923:dsl_dataset_name()
[  116.821059] Showing stack for process 2976
[  116.821117] CPU: 0 PID: 2976 Comm: zfs Tainted: P           O      5.15.35-2-pve #1
[  116.821208] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
[  116.821304] Call Trace:
[  116.821350]  <TASK>
[  116.821415]  dump_stack_lvl+0x4a/0x5f
[  116.821471]  dump_stack+0x10/0x12
[  116.821515]  spl_dumpstack+0x29/0x2b [spl]
[  116.821605]  spl_panic+0xd1/0xe9 [spl]
[  116.821668]  ? kfree+0x1f3/0x250
[  116.821715]  ? spl_kmem_free+0x2a/0x30 [spl]
[  116.821783]  ? zap_value_search+0xad/0xf0 [zfs]
[  116.822875]  ? dmu_buf_rele+0x3d/0x50 [zfs]
[  116.823126]  dsl_dataset_name.part.0+0x114/0x1c0 [zfs]
[  116.823421]  dsl_dataset_stats+0x2c1/0x430 [zfs]
[  116.823713]  ? dmu_buf_rele+0x3d/0x50 [zfs]
[  116.823980]  ? dsl_dir_rele+0x30/0x40 [zfs]
[  116.824271]  ? __cond_resched+0x1a/0x50
[  116.824321]  ? mutex_lock+0x13/0x40
[  116.824366]  ? dmu_objset_from_ds+0x7f/0x170 [zfs]
[  116.824646]  dmu_objset_stats+0x1f/0x60 [zfs]
[  116.824907]  zfs_ioc_objset_stats_impl.part.0+0x69/0xe0 [zfs]
[  116.825288]  zfs_ioc_objset_stats+0x94/0xa0 [zfs]
[  116.825656]  zfsdev_ioctl_common+0x763/0x9e0 [zfs]
[  116.826027]  ? _copy_from_user+0x2e/0x60
[  116.826080]  zfsdev_ioctl+0x57/0xe0 [zfs]
[  116.826427]  __x64_sys_ioctl+0x91/0xc0
[  116.826477]  do_syscall_64+0x5c/0xc0
[  116.826524]  ? handle_mm_fault+0xd8/0x2c0
[  116.826578]  ? exit_to_user_mode_prepare+0x37/0x1b0
[  116.826640]  ? irqentry_exit_to_user_mode+0x9/0x20
[  116.826700]  ? irqentry_exit+0x19/0x30
[  116.826748]  ? exc_page_fault+0x89/0x160
[  116.826798]  ? asm_exc_page_fault+0x8/0x30
[  116.826852]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  116.828795] RIP: 0033:0x7f8e83939cc7
[  116.830728] Code: 00 00 00 48 8b 05 c9 91 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 99 91 0c 00 f7 d8 64 89 01 48
[  116.834783] RSP: 002b:00007ffe90760b98 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[  116.836822] RAX: ffffffffffffffda RBX: 00007ffe90760bc0 RCX: 00007f8e83939cc7
[  116.838836] RDX: 00007ffe90760bc0 RSI: 0000000000005a12 RDI: 0000000000000003
[  116.841088] RBP: 00007ffe90760bb0 R08: 000055dc68098e20 R09: 00007f8e83a03be0
[  116.842518] R10: 0000000000040030 R11: 0000000000000246 R12: 000055dc680915f0
[  116.843959] R13: 00007ffe90760bc0 R14: 000055dc680915f0 R15: 000055dc68094cb0
[  116.845362]  </TASK>

Then only messages like this get repeated:

[  242.055925] INFO: task zfs:2976 blocked for more than 120 seconds.
[  242.058015]       Tainted: P           O      5.15.35-2-pve #1
[  242.060100] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  242.062107] task:zfs             state:D stack:    0 pid: 2976 ppid:  1894 flags:0x00000000
[  242.064519] Call Trace:
[  242.066593]  <TASK>
[  242.070728]  __schedule+0x33d/0x1750
[  242.072845]  ? arch_local_irq_enable+0xb/0xd
[  242.074866]  schedule+0x4e/0xb0
[  242.077139]  spl_panic+0xe7/0xe9 [spl]
[  242.079149]  ? kfree+0x1f3/0x250
[  242.081244]  ? spl_kmem_free+0x2a/0x30 [spl]
[  242.083305]  ? zap_value_search+0xad/0xf0 [zfs]
[  242.085685]  ? dmu_buf_rele+0x3d/0x50 [zfs]
[  242.088058]  dsl_dataset_name.part.0+0x114/0x1c0 [zfs]
[  242.090314]  dsl_dataset_stats+0x2c1/0x430 [zfs]
[  242.092602]  ? dmu_buf_rele+0x3d/0x50 [zfs]
[  242.094798]  ? dsl_dir_rele+0x30/0x40 [zfs]
[  242.097336]  ? __cond_resched+0x1a/0x50
[  242.099462]  ? mutex_lock+0x13/0x40
[  242.101489]  ? dmu_objset_from_ds+0x7f/0x170 [zfs]
[  242.103935]  dmu_objset_stats+0x1f/0x60 [zfs]
[  242.106166]  zfs_ioc_objset_stats_impl.part.0+0x69/0xe0 [zfs]
[  242.108567]  zfs_ioc_objset_stats+0x94/0xa0 [zfs]
[  242.110875]  zfsdev_ioctl_common+0x763/0x9e0 [zfs]
[  242.113245]  ? _copy_from_user+0x2e/0x60
[  242.115562]  zfsdev_ioctl+0x57/0xe0 [zfs]
[  242.117962]  __x64_sys_ioctl+0x91/0xc0
[  242.120095]  do_syscall_64+0x5c/0xc0
[  242.122081]  ? handle_mm_fault+0xd8/0x2c0
[  242.124153]  ? exit_to_user_mode_prepare+0x37/0x1b0
[  242.126162]  ? irqentry_exit_to_user_mode+0x9/0x20
[  242.128217]  ? irqentry_exit+0x19/0x30
[  242.130237]  ? exc_page_fault+0x89/0x160
[  242.132393]  ? asm_exc_page_fault+0x8/0x30
[  242.134424]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  242.136549] RIP: 0033:0x7f8e83939cc7
[  242.138563] RSP: 002b:00007ffe90760b98 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[  242.140685] RAX: ffffffffffffffda RBX: 00007ffe90760bc0 RCX: 00007f8e83939cc7
[  242.142153] RDX: 00007ffe90760bc0 RSI: 0000000000005a12 RDI: 0000000000000003
[  242.143548] RBP: 00007ffe90760bb0 R08: 000055dc68098e20 R09: 00007f8e83a03be0
[  242.144913] R10: 0000000000040030 R11: 0000000000000246 R12: 000055dc680915f0
[  242.146244] R13: 00007ffe90760bc0 R14: 000055dc680915f0 R15: 000055dc68094cb0
[  242.147654]  </TASK>

Questions

lhuedepohl commented 2 years ago

FYI: With the work-around mentioned in the old spl issue linked above, the pool becomes usable again. I will now save also the last few changes since the last backup was sent - as I even can create new snapshots and send increments - and then will re-create the pool afresh.

stale[bot] commented 1 year ago

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.