openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.42k stars 1.72k forks source link

Kernel oops (null pointer dereference) when renaming a zvol, if snapdev=visible AND the zvol has at least one snapshot #16274

Closed jgottula closed 3 weeks ago

jgottula commented 2 months ago

System information

Type Version/Name
Distribution Name Arch Linux
Distribution Version Rolling release
Kernel Version 6.9.2-arch1-1, 6.9.5-arch1-1
Architecture x86_64
OpenZFS Version 2.2.4 (stable) and c98295e (git master)

Describe the problem you're observing

Doing zfs rename on a zvol causes a kernel oops, which kills kthread z_zvol and makes zvols unusable until the system is rebooted.

The kernel oops is a null pointer dereference to 0x120 in dataset_kstats_rename. This happens on the second of the two CPU instructions for dk->dk_kstats->ks_data, in which ks_data is accessed by dereferencing dk_kstats. dk->dk_kstats is null; and offsetof(kstat_t, ks_data) == 0x120 (in this particular build, anyway); and so therefore the taskq thread tries to access the qword at address 0x120.

void
dataset_kstats_rename(dataset_kstats_t *dk, const char *name)
{
        dataset_kstat_values_t *dkv = dk->dk_kstats->ks_data; // <-- dk->dk_kstats is NULL
        char *ds_name;

        ds_name = KSTAT_NAMED_STR_PTR(&dkv->dkv_ds_name);
        ASSERT3S(ds_name, !=, NULL);
        (void) strlcpy(ds_name, name,
            KSTAT_NAMED_STR_BUFLEN(&dkv->dkv_ds_name));
}
00000000000eb840 <dataset_kstats_rename>:
   eb840:       f3 0f 1e fa             endbr64
   eb844:       e8 00 00 00 00          call   eb849 <dataset_kstats_rename+0x9>        eb845: R_X86_64_PLT32   __fentry__-0x4
   eb849:       48 8b 87 98 03 00 00    mov    rax,QWORD PTR [rdi+0x398]  <-- rax (aka: dk->dk_kstats) is zero after this instruction
   eb850:       48 8b 80 20 01 00 00    mov    rax,QWORD PTR [rax+0x120]  <-- this attempts to read qword ptr [0+0x120] and fails
   eb857:       8b 90 08 01 00 00       mov    edx,DWORD PTR [rax+0x108]
   eb85d:       48 8b b8 00 01 00 00    mov    rdi,QWORD PTR [rax+0x100]
   eb864:       e9 00 00 00 00          jmp    eb869 <dataset_kstats_rename+0x29>       eb865: R_X86_64_PLT32   sized_strscpy-0x4

Describe how to reproduce the problem

EDIT: see revised steps in comment below: the problem only occurs if the zvol has snapdev=visible and the zvol has one or more snapshots.

  1. Use OpenZFS 2.2.4 on Linux
  2. Create a pool
  3. Create a zvol
  4. Do zfs rename pool/path/to/zvol pool/path/to/newname
  5. Check dmesg and see if things went wrong

I updated OpenZFS on this system to 2.2.4, after being on an older version for quite some time, and I did that in preparation for creating an entirely new pool, which I've been using pretty extensively the past couple of weeks. During that time, I've run into this particular issue twice, and I think it has actually happened every time I have specifically done zfs rename on a zvol. (And actually yeah: I just checked zpool history for the relevant pool, and the two kernel oopses happened simultaneous with the only two times when I have done zfs rename on a zvol.)

I'm not yet sure whether maybe there are more conditions to the problem manifesting than literally just "use 2.2.4 and do zfs rename on a zvol". There was a reasonable amount of other, non-zvol, I/O happening with the pool at both times when the problem occurred.

I haven't had an opportunity yet to do a bunch of test-reboot-test-reboot cycles on this; and rebooting is necessary because, as soon as the problem occurs, the z_zvol kernel thread is murdered (while holding one or more locks, I believe, or else leaving some other thread of execution forever waiting on the task completion); and from that point on, although most other ZFS things do continue to work, basically anything zvol-related will simply get stuck and make no forward progress.

### Kernel Oops 1 ``` 2024-06-11T23:42:54-07:00 helium kernel: BUG: kernel NULL pointer dereference, address: 0000000000000120 2024-06-11T23:42:54-07:00 helium kernel: #PF: supervisor read access in kernel mode 2024-06-11T23:42:54-07:00 helium kernel: #PF: error_code(0x0000) - not-present page 2024-06-11T23:42:54-07:00 helium kernel: PGD 0 P4D 0 2024-06-11T23:42:54-07:00 helium kernel: Oops: 0000 [#1] PREEMPT SMP PTI 2024-06-11T23:42:54-07:00 helium kernel: CPU: 0 PID: 974 Comm: z_zvol Tainted: P OE 6.9.2-arch1-1 #1 0c9f583c4dce67c6fc47fa44804d20ac179c7f14 2024-06-11T23:42:54-07:00 helium kernel: Hardware name: Gigabyte Technology Co., Ltd. EP35-DS3R/EP35-DS3R, BIOS F4 06/19/2009 2024-06-11T23:42:54-07:00 helium kernel: RIP: 0010:dataset_kstats_rename+0x10/0x30 [zfs] 2024-06-11T23:42:54-07:00 helium kernel: Code: 1f 84 00 00 00 00 00 0f 1f 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 0f 1f 00 0f 1f 44 00 00 48 8b 87 98 03 00 00 <48> 8b 80 20 01 00 00 8b 90 08 01 00 00 48 8b b8 00 01 00 00 e9 67 2024-06-11T23:42:54-07:00 helium kernel: RSP: 0018:ffffb1ee83fcfd38 EFLAGS: 00010292 2024-06-11T23:42:54-07:00 helium kernel: RAX: 0000000000000000 RBX: ffff99ea9ce1c000 RCX: 00000000ab9b9c00 2024-06-11T23:42:54-07:00 helium kernel: RDX: 00000000ab9b9a00 RSI: ffff99ea8d93e000 RDI: ffff99ea89167998 2024-06-11T23:42:54-07:00 helium kernel: RBP: ffff99ea89161800 R08: 0000000000000000 R09: 0000000000000000 2024-06-11T23:42:54-07:00 helium kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff99ea89167d60 2024-06-11T23:42:54-07:00 helium kernel: R13: ffff99ea9ce1c004 R14: ffff99ea55ae3180 R15: ffff99ea89167800 2024-06-11T23:42:54-07:00 helium kernel: FS: 0000000000000000(0000) GS:ffff99eb41c00000(0000) knlGS:0000000000000000 2024-06-11T23:42:54-07:00 helium kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 2024-06-11T23:42:54-07:00 helium kernel: CR2: 0000000000000120 CR3: 0000000003b88000 CR4: 00000000000406f0 2024-06-11T23:42:54-07:00 helium kernel: Call Trace: 2024-06-11T23:42:54-07:00 helium kernel: 2024-06-11T23:42:54-07:00 helium kernel: ? __die_body.cold+0x19/0x27 2024-06-11T23:42:54-07:00 helium kernel: ? page_fault_oops+0x15a/0x2b0 2024-06-11T23:42:54-07:00 helium kernel: ? search_bpf_extables+0x5f/0x80 2024-06-11T23:42:54-07:00 helium kernel: ? exc_page_fault+0x81/0x190 2024-06-11T23:42:54-07:00 helium kernel: ? asm_exc_page_fault+0x26/0x30 2024-06-11T23:42:54-07:00 helium kernel: ? dataset_kstats_rename+0x10/0x30 [zfs 2d360922e0aeba07e78067b282e3eedb08db1dc5] 2024-06-11T23:42:54-07:00 helium kernel: zvol_task_cb+0x37f/0x450 [zfs 2d360922e0aeba07e78067b282e3eedb08db1dc5] 2024-06-11T23:42:54-07:00 helium kernel: taskq_thread+0x2e1/0x550 [spl 8d5c82e0079bc7d3bed9092345840f4851690968] 2024-06-11T23:42:54-07:00 helium kernel: ? __pfx_default_wake_function+0x10/0x10 2024-06-11T23:42:54-07:00 helium kernel: ? __pfx_taskq_thread+0x10/0x10 [spl 8d5c82e0079bc7d3bed9092345840f4851690968] 2024-06-11T23:42:54-07:00 helium kernel: kthread+0xd2/0x100 2024-06-11T23:42:54-07:00 helium kernel: ? __pfx_kthread+0x10/0x10 2024-06-11T23:42:54-07:00 helium kernel: ret_from_fork+0x34/0x50 2024-06-11T23:42:54-07:00 helium kernel: ? __pfx_kthread+0x10/0x10 2024-06-11T23:42:54-07:00 helium kernel: ret_from_fork_asm+0x1a/0x30 2024-06-11T23:42:54-07:00 helium kernel: 2024-06-11T23:42:54-07:00 helium kernel: Modules linked in: tls rpcrdma rdma_cm iw_cm ib_cm ib_core cls_u32 sch_htb ifb wireguard curve25519_x86_64 libchacha20poly1305 chacha_x86_64 poly1305_x86_64 libcurve25519_generic libchacha ip6_udp_tunnel udp_tunnel bridge stp llc cfg80211 rfkill ip6table_nat ip6table_mangle ip6t_REJECT nf_reject_ipv6 xt_nat xt_REDIRECT xt_MASQUERADE iptable_nat xt_connmark xt_TCPMSS iptable_mangle xt_multiport xt_helper ipt_REJECT nf_reject_ipv4 xt_NFLOG nfnetlink_log xt_limit xt_conntrack xt_tcpudp nf_nat_irc nf_conntrack_irc nf_nat_sip nf_conntrack_sip nf_nat_ftp nf_conntrack_ftp nf_nat_pptp nf_conntrack_pptp nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 lz4hc lz4hc_compress ip6table_filter ip6_tables iptable_filter coretemp kvm_intel iTCO_wdt intel_pmc_bxt ppdev gpio_ich iTCO_vendor_support zram kvm i2c_i801 sha512_ssse3 acpi_cpufreq sha1_ssse3 i2c_smbus lpc_ich intel_agp parport_pc intel_gtt parport joydev mac_hid zfs(POE) spl(OE) vhost_scsi target_core_mod vhost_net tun vhost vhost_iotlb tap usbip_host usbip_core 2024-06-11T23:42:54-07:00 helium kernel: pcspkr it87(OE) nfsd hwmon_vid frandom(OE) sg auth_rpcgss crypto_user loop nfs_acl lockd grace sunrpc nfnetlink ip_tables x_tables hid_logitech_hidpp vfat fat hid_logitech_dj btrfs blake2b_generic ata_generic pata_acpi hid_generic xor raid6_pq usbhid sha256_ssse3 ext4 pata_jmicron crc16 mbcache jbd2 dm_cache dm_thin_pool dm_persistent_data dm_bio_prison libcrc32c crc32c_generic dm_snapshot dm_bufio dm_mod radeon drm_ttm_helper ttm video wmi i2c_algo_bit drm_suballoc_helper drm_display_helper cec lz4 lz4_compress netconsole r8169 realtek mdio_devres libphy r8168(OE) 2024-06-11T23:42:54-07:00 helium kernel: CR2: 0000000000000120 2024-06-11T23:42:54-07:00 helium kernel: ---[ end trace 0000000000000000 ]--- 2024-06-11T23:42:54-07:00 helium kernel: RIP: 0010:dataset_kstats_rename+0x10/0x30 [zfs] 2024-06-11T23:42:54-07:00 helium kernel: Code: 1f 84 00 00 00 00 00 0f 1f 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 0f 1f 00 0f 1f 44 00 00 48 8b 87 98 03 00 00 <48> 8b 80 20 01 00 00 8b 90 08 01 00 00 48 8b b8 00 01 00 00 e9 67 2024-06-11T23:42:54-07:00 helium kernel: RSP: 0018:ffffb1ee83fcfd38 EFLAGS: 00010292 2024-06-11T23:42:54-07:00 helium kernel: RAX: 0000000000000000 RBX: ffff99ea9ce1c000 RCX: 00000000ab9b9c00 2024-06-11T23:42:54-07:00 helium kernel: RDX: 00000000ab9b9a00 RSI: ffff99ea8d93e000 RDI: ffff99ea89167998 2024-06-11T23:42:54-07:00 helium kernel: RBP: ffff99ea89161800 R08: 0000000000000000 R09: 0000000000000000 2024-06-11T23:42:54-07:00 helium kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff99ea89167d60 2024-06-11T23:42:54-07:00 helium kernel: R13: ffff99ea9ce1c004 R14: ffff99ea55ae3180 R15: ffff99ea89167800 2024-06-11T23:42:54-07:00 helium kernel: FS: 0000000000000000(0000) GS:ffff99eb41c00000(0000) knlGS:0000000000000000 2024-06-11T23:42:54-07:00 helium kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 2024-06-11T23:42:54-07:00 helium kernel: CR2: 0000000000000120 CR3: 0000000003b88000 CR4: 00000000000406f0 2024-06-11T23:42:54-07:00 helium kernel: note: z_zvol[974] exited with irqs disabled ```
### Kernel Oops 2 ``` 2024-06-16T16:09:44-07:00 helium kernel: BUG: kernel NULL pointer dereference, address: 0000000000000120 2024-06-16T16:09:44-07:00 helium kernel: #PF: supervisor read access in kernel mode 2024-06-16T16:09:44-07:00 helium kernel: #PF: error_code(0x0000) - not-present page 2024-06-16T16:09:44-07:00 helium kernel: PGD 0 P4D 0 2024-06-16T16:09:44-07:00 helium kernel: Oops: 0000 [#1] PREEMPT SMP PTI 2024-06-16T16:09:44-07:00 helium kernel: CPU: 1 PID: 982 Comm: z_zvol Tainted: P OE 6.9.2-arch1-1 #1 0c9f583c4dce67c6fc47fa44804d20ac179c7f14 2024-06-16T16:09:44-07:00 helium kernel: Hardware name: Gigabyte Technology Co., Ltd. EP35-DS3R/EP35-DS3R, BIOS F4 06/19/2009 2024-06-16T16:09:44-07:00 helium kernel: RIP: 0010:dataset_kstats_rename+0x10/0x30 [zfs] 2024-06-16T16:09:44-07:00 helium kernel: Code: 1f 84 00 00 00 00 00 0f 1f 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 0f 1f 00 0f 1f 44 00 00 48 8b 87 98 03 00 00 <48> 8b 80 20 01 00 00 8b 90 08 01 00 00 48 8b b8 00 01 00 00 e9 67 2024-06-16T16:09:44-07:00 helium kernel: RSP: 0018:ffffa06e886a3d38 EFLAGS: 00010292 2024-06-16T16:09:44-07:00 helium kernel: RAX: 0000000000000000 RBX: ffff88edacc7f000 RCX: 00000001861b8801 2024-06-16T16:09:44-07:00 helium kernel: RDX: 00000001861b8601 RSI: ffff88ee4072b100 RDI: ffff88ee212a7998 2024-06-16T16:09:44-07:00 helium kernel: RBP: ffff88edefc05000 R08: 0000000000000000 R09: 0000000000000000 2024-06-16T16:09:44-07:00 helium kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff88ee212a7d60 2024-06-16T16:09:44-07:00 helium kernel: R13: ffff88edacc7f004 R14: ffff88ed96915280 R15: ffff88ee212a7800 2024-06-16T16:09:44-07:00 helium kernel: FS: 0000000000000000(0000) GS:ffff88ee81c80000(0000) knlGS:0000000000000000 2024-06-16T16:09:44-07:00 helium kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 2024-06-16T16:09:44-07:00 helium kernel: CR2: 0000000000000120 CR3: 000000004f110000 CR4: 00000000000406f0 2024-06-16T16:09:44-07:00 helium kernel: Call Trace: 2024-06-16T16:09:44-07:00 helium kernel: 2024-06-16T16:09:44-07:00 helium kernel: ? __die_body.cold+0x19/0x27 2024-06-16T16:09:44-07:00 helium kernel: ? page_fault_oops+0x15a/0x2b0 2024-06-16T16:09:44-07:00 helium kernel: ? search_bpf_extables+0x5f/0x80 2024-06-16T16:09:44-07:00 helium kernel: ? exc_page_fault+0x81/0x190 2024-06-16T16:09:44-07:00 helium kernel: ? asm_exc_page_fault+0x26/0x30 2024-06-16T16:09:44-07:00 helium kernel: ? dataset_kstats_rename+0x10/0x30 [zfs 2d360922e0aeba07e78067b282e3eedb08db1dc5] 2024-06-16T16:09:44-07:00 helium kernel: zvol_task_cb+0x37f/0x450 [zfs 2d360922e0aeba07e78067b282e3eedb08db1dc5] 2024-06-16T16:09:44-07:00 helium kernel: taskq_thread+0x2e1/0x550 [spl 8d5c82e0079bc7d3bed9092345840f4851690968] 2024-06-16T16:09:44-07:00 helium kernel: ? __pfx_default_wake_function+0x10/0x10 2024-06-16T16:09:44-07:00 helium kernel: ? __pfx_taskq_thread+0x10/0x10 [spl 8d5c82e0079bc7d3bed9092345840f4851690968] 2024-06-16T16:09:44-07:00 helium kernel: kthread+0xd2/0x100 2024-06-16T16:09:44-07:00 helium kernel: ? __pfx_kthread+0x10/0x10 2024-06-16T16:09:44-07:00 helium kernel: ret_from_fork+0x34/0x50 2024-06-16T16:09:44-07:00 helium kernel: ? __pfx_kthread+0x10/0x10 2024-06-16T16:09:44-07:00 helium kernel: ret_from_fork_asm+0x1a/0x30 2024-06-16T16:09:44-07:00 helium kernel: 2024-06-16T16:09:44-07:00 helium kernel: Modules linked in: tls mptcp_diag xsk_diag tcp_diag udp_diag raw_diag inet_diag unix_diag af_packet_diag netlink_diag rpcrdma rdma_cm iw_cm ib_cm ib_core cls_u32 sch_htb ifb wireguard curve25519_x86_64 libchacha20poly1305 chacha_x86_64 poly1305_x86_64 libcurve25519_generic libchacha ip6_udp_tunnel udp_tunnel bridge stp llc cfg80211 rfkill ip6table_nat ip6table_mangle ip6t_REJECT nf_reject_ipv6 xt_nat xt_REDIRECT xt_MASQUERADE iptable_nat xt_connmark xt_TCPMSS iptable_mangle xt_multiport xt_helper ipt_REJECT nf_reject_ipv4 xt_NFLOG nfnetlink_log xt_limit xt_conntrack xt_tcpudp nf_nat_irc nf_conntrack_irc nf_nat_sip nf_conntrack_sip nf_nat_ftp nf_conntrack_ftp nf_nat_pptp nf_conntrack_pptp nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 lz4hc lz4hc_compress ip6table_filter ip6_tables iptable_filter joydev coretemp kvm_intel iTCO_wdt intel_pmc_bxt gpio_ich iTCO_vendor_support ppdev kvm zram sha512_ssse3 i2c_i801 sha1_ssse3 i2c_smbus acpi_cpufreq lpc_ich intel_agp intel_gtt parport_pc parport 2024-06-16T16:09:44-07:00 helium kernel: hid_logitech_hidpp mac_hid zfs(POE) spl(OE) vhost_scsi target_core_mod vhost_net tun vhost vhost_iotlb tap usbip_host usbip_core pcspkr it87(OE) hwmon_vid frandom(OE) nfsd sg crypto_user loop auth_rpcgss nfs_acl lockd grace sunrpc nfnetlink ip_tables x_tables vfat fat hid_logitech_dj btrfs ata_generic hid_generic pata_acpi blake2b_generic xor raid6_pq sha256_ssse3 usbhid pata_jmicron ext4 crc16 mbcache jbd2 dm_cache dm_thin_pool dm_persistent_data dm_bio_prison libcrc32c crc32c_generic dm_snapshot dm_bufio dm_mod radeon drm_ttm_helper ttm video wmi i2c_algo_bit drm_suballoc_helper drm_display_helper cec lz4 lz4_compress netconsole r8169 realtek mdio_devres libphy r8168(OE) 2024-06-16T16:09:44-07:00 helium kernel: CR2: 0000000000000120 2024-06-16T16:09:44-07:00 helium kernel: ---[ end trace 0000000000000000 ]--- 2024-06-16T16:09:44-07:00 helium kernel: RIP: 0010:dataset_kstats_rename+0x10/0x30 [zfs] 2024-06-16T16:09:44-07:00 helium kernel: Code: 1f 84 00 00 00 00 00 0f 1f 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 0f 1f 00 0f 1f 44 00 00 48 8b 87 98 03 00 00 <48> 8b 80 20 01 00 00 8b 90 08 01 00 00 48 8b b8 00 01 00 00 e9 67 2024-06-16T16:09:44-07:00 helium kernel: RSP: 0018:ffffa06e886a3d38 EFLAGS: 00010292 2024-06-16T16:09:44-07:00 helium kernel: RAX: 0000000000000000 RBX: ffff88edacc7f000 RCX: 00000001861b8801 2024-06-16T16:09:44-07:00 helium kernel: RDX: 00000001861b8601 RSI: ffff88ee4072b100 RDI: ffff88ee212a7998 2024-06-16T16:09:44-07:00 helium kernel: RBP: ffff88edefc05000 R08: 0000000000000000 R09: 0000000000000000 2024-06-16T16:09:44-07:00 helium kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff88ee212a7d60 2024-06-16T16:09:44-07:00 helium kernel: R13: ffff88edacc7f004 R14: ffff88ed96915280 R15: ffff88ee212a7800 2024-06-16T16:09:44-07:00 helium kernel: FS: 0000000000000000(0000) GS:ffff88ee81c80000(0000) knlGS:0000000000000000 2024-06-16T16:09:44-07:00 helium kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 2024-06-16T16:09:44-07:00 helium kernel: CR2: 0000000000000120 CR3: 000000004f110000 CR4: 00000000000406f0 2024-06-16T16:09:44-07:00 helium kernel: note: z_zvol[982] exited with irqs disabled ```
jgottula commented 2 months ago

This problem appears to be a regression caused—I'm not sure if in whole, or maybe just in part—by #15486, which added the dataset_kstats_rename function at issue here.


From perusing the surrounding source code a little bit (I'm not already familiar with it at all), it appears that whether dataset_kstats_t.dk_kstats contains a valid pointer is less of a guaranteed matter and more of a "maybe" sort of thing, depending on the specific timing (and therefore, perhaps whether dataset_kstats_create may have not actually been called yet, or whether dataset_kstats_destroy may have actually already been called, at the relevant time).

The cross-thread call chain for this kernel oops looks, I believe, something like this:

zfs_ioc_rename
dsl_dir_rename
dsl_dir_rename_sync
zvol_rename_minors
[dispatched as an ASYNCHRONOUS TASK to an entirely different thread (as ZVOL_ASYNC_RENAME_MINORS)]
zvol_task_cb
zvol_rename_minors_impl
zvol_os_rename_minor
dataset_kstats_rename

So we have userspace using an ioctl to ask for the zvol to be renamed. But then the actual renaming stuff itself occurs asynchronously, over in a zfs taskq kernel thread.

And the other relatively-similar functions right next to dataset_kstats_rename in the same code file (dataset_kstats_update_write_kstats, _read_kstats, _nunlinks_kstat, _nunlinked_kstat) all begin with:

if (dk->dk_kstats == NULL)
    return;

So it seems like dataset_kstats_rename should also probably be doing that same check and bailing out if dk_kstats is, in fact, a null pointer.

Now, I don't really know enough about the original intent from #15482 to know if that would be a proper fix, or if it might defeat some of the purpose of what was being done in #15486. (If dk_kstats is null, does that mean that the kstat dataset name doesn't need to be changed, because it wouldn't make any sense to? Or is that still something which would need to be done to accomplish what was intended there, just maybe with more appropriate timing or something? I'm not familiar enough with the kstat stuff that was the subject of the original issue to say. @asomers may have a better idea about this though.)

In any case, it seems like just adding that null pointer check to dataset_kstats_rename ought to work perfectly fine as a workaround that should prevent the kernel oops and at least prevent zvols from falling over and ceasing to work. (And I'm going to go ahead and do that as a local patch on top of 2.2.4, for the time being, for my own purposes.)

asomers commented 2 months ago

If this happens 100% of the time that you rename a zvol, it doesn't sound like a race to me. I think some more analysis is in order, like when does dataset_kstats_create get called? Could it be that it never gets called in your setup for some reason?

jgottula commented 2 months ago

Okay, I have some additional information to provide.

I encountered this bug on a local system that I'm using to do Real Work. And so on that system, I've actually already gone ahead and applied a local patch to just put a simple "return if null" check at the top of dataset_kstats_rename. (That's made the problem go away, which makes sense.)

I was a little reluctant to do a bunch of test/reboot/test/reboot stuff with that particular system. But, I do happen to have a VPS just sitting around, where I can fiddle around with things and not worry if kthreads get killed or if I have to reboot the whole system or whatever.

So, on this VPS, I happen to have the exact same zfs installation (2.2.4 stable; from the very same archzfs binary package), and I have literally no module tunables set to any non-default values, or much else in the way of customization.


Based on my initial experiences with this bug, I was reluctant to jump to any conclusions. It seemed like, if the problem is as simple as "rename a zvol", and is as easily repeatable as it appears to be (at least based on the couple of times that I've had it happen), then surely someone else would have noticed it by now. But I hadn't been able to find any other issues that appeared to show the same problem related to renaming zvols. And that felt odd.

So I decided to check for anything that might make my particular use case different from what other users might typically be doing, and therefore maybe explain why it is that I've encountered this when seemingly no one else has. (So some of the most obvious things to look at would be any non-default zfs props or module tunables.)

Also, I've been browsing the codebase to get a general sense for the call paths that end up at dataset_kstats_rename, dataset_kstats_create, and dataset_kstats_destoy. And as I was doing this for dataset_kstats_create (which, for zvols, happens via zvol_os_create_minor, via a few different functions involving the creation of the zvol device nodes), I noticed that the snapdev property was a relevant concern: if snapdev=visible, then upon creation of the device node for the zvol itself, the code also has to proceed to create an additional device node for each snapshot of that zvol.

And when looking for notable non-default props or tunables, one of the very few things I did identify from the two cases where I'd encountered the problem is that I was dealing with a zvol that both (a) had snapshots and (b) did have snapdev=visible set.


So I went ahead on my VPS system and created an empty pool (no other pools on the system) and created a zvol, initially with snapdev=hidden and no snapshots. I did a few zfs rename operations; no problems.

Then I created a single snapshot and set snapdev=visible. I did the exact same zfs rename command from before, and now... immediate kernel oops, just the same as on my local system.

### Kernel Oops ``` 2024-06-17T15:56:14-07:00 linode-cloud1 kernel: BUG: kernel NULL pointer dereference, address: 0000000000000120 2024-06-17T15:56:14-07:00 linode-cloud1 kernel: #PF: supervisor read access in kernel mode 2024-06-17T15:56:14-07:00 linode-cloud1 kernel: #PF: error_code(0x0000) - not-present page 2024-06-17T15:56:14-07:00 linode-cloud1 kernel: PGD 0 P4D 0 2024-06-17T15:56:14-07:00 linode-cloud1 kernel: Oops: 0000 [#1] PREEMPT SMP NOPTI 2024-06-17T15:56:14-07:00 linode-cloud1 kernel: CPU: 0 PID: 428 Comm: z_zvol Tainted: P OE 6.9.1-arch1-2 #1 06928436e5a6b4805e171d14d8efa397d7db9ad0 2024-06-17T15:56:14-07:00 linode-cloud1 kernel: Hardware name: Linode Compute Instance, BIOS Not Specified 2024-06-17T15:56:14-07:00 linode-cloud1 kernel: RIP: 0010:dataset_kstats_rename+0x10/0x30 [zfs] 2024-06-17T15:56:14-07:00 linode-cloud1 kernel: Code: 1f 84 00 00 00 00 00 0f 1f 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 0f 1f 00 0f 1f 44 00 00 48 8b 87 98 03 00 00 <48> 8b 80 20 01 00 00 8b 90 08 01 00 00 48 8b b8 00 01 00 00 e9 67 2024-06-17T15:56:14-07:00 linode-cloud1 kernel: RSP: 0018:ffffa1d0819bbd38 EFLAGS: 00010292 2024-06-17T15:56:14-07:00 linode-cloud1 kernel: RAX: 0000000000000000 RBX: ffff89e4dfffc800 RCX: 00000036ce455e00 2024-06-17T15:56:14-07:00 linode-cloud1 kernel: RDX: 00000036ce455c00 RSI: ffff89e4ca958980 RDI: ffff89e4c4fe4198 2024-06-17T15:56:14-07:00 linode-cloud1 kernel: RBP: ffff89e4c7fbf000 R08: 0000000000000000 R09: 0000000000000000 2024-06-17T15:56:14-07:00 linode-cloud1 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff89e4c4fe4560 2024-06-17T15:56:14-07:00 linode-cloud1 kernel: R13: ffff89e4dfffc804 R14: ffff89e4ca523540 R15: ffff89e4c4fe4000 2024-06-17T15:56:14-07:00 linode-cloud1 kernel: FS: 0000000000000000(0000) GS:ffff89e4fec00000(0000) knlGS:0000000000000000 2024-06-17T15:56:14-07:00 linode-cloud1 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 2024-06-17T15:56:14-07:00 linode-cloud1 kernel: CR2: 0000000000000120 CR3: 000000000998c002 CR4: 0000000000770ef0 2024-06-17T15:56:14-07:00 linode-cloud1 kernel: PKRU: 55555554 2024-06-17T15:56:14-07:00 linode-cloud1 kernel: Call Trace: 2024-06-17T15:56:14-07:00 linode-cloud1 kernel: 2024-06-17T15:56:14-07:00 linode-cloud1 kernel: ? __die_body.cold+0x19/0x27 2024-06-17T15:56:14-07:00 linode-cloud1 kernel: ? page_fault_oops+0x15a/0x2b0 2024-06-17T15:56:14-07:00 linode-cloud1 kernel: ? exc_page_fault+0x81/0x190 2024-06-17T15:56:14-07:00 linode-cloud1 kernel: ? asm_exc_page_fault+0x26/0x30 2024-06-17T15:56:14-07:00 linode-cloud1 kernel: ? dataset_kstats_rename+0x10/0x30 [zfs 8a589784d96acce9ef1e5edb55eb79da822f1417] 2024-06-17T15:56:14-07:00 linode-cloud1 kernel: zvol_task_cb+0x37f/0x450 [zfs 8a589784d96acce9ef1e5edb55eb79da822f1417] 2024-06-17T15:56:14-07:00 linode-cloud1 kernel: taskq_thread+0x2e1/0x550 [spl d6983f38407ffcfd8bfcf22bd3d60496da893801] 2024-06-17T15:56:14-07:00 linode-cloud1 kernel: ? __pfx_default_wake_function+0x10/0x10 2024-06-17T15:56:14-07:00 linode-cloud1 kernel: ? __pfx_taskq_thread+0x10/0x10 [spl d6983f38407ffcfd8bfcf22bd3d60496da893801] 2024-06-17T15:56:14-07:00 linode-cloud1 kernel: kthread+0xd2/0x100 2024-06-17T15:56:14-07:00 linode-cloud1 kernel: ? __pfx_kthread+0x10/0x10 2024-06-17T15:56:14-07:00 linode-cloud1 kernel: ret_from_fork+0x34/0x50 2024-06-17T15:56:14-07:00 linode-cloud1 kernel: ? __pfx_kthread+0x10/0x10 2024-06-17T15:56:14-07:00 linode-cloud1 kernel: ret_from_fork_asm+0x1a/0x30 2024-06-17T15:56:14-07:00 linode-cloud1 kernel: 2024-06-17T15:56:14-07:00 linode-cloud1 kernel: Modules linked in: nf_conntrack_netlink nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c nfnetlink_acct wireguard curve25519_x86_64 libchacha20poly1305 chacha_x86_64 poly1305_x86_64 libcurve25519_generic libchacha ip6_udp_tunnel udp_tunnel cfg80211 rfkill intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul polyval_clmulni polyval_generic gf128mul ghash_clmulni_intel sha512_ssse3 sha256_ssse3 sha1_ssse3 bochs iTCO_wdt aesni_intel joydev drm_vram_helper intel_pmc_bxt i2c_i801 iTCO_vendor_support mousedev drm_ttm_helper intel_agp crypto_simd cryptd psmouse pcspkr lpc_ich i2c_smbus ttm intel_gtt mac_hid zfs(POE) spl(OE) loop dm_mod nfnetlink qemu_fw_cfg ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 virtio_net net_failover failover virtio_scsi serio_raw atkbd libps2 vivaldi_fmap virtio_pci crc32c_intel virtio_pci_legacy_dev i8042 virtio_pci_modern_dev serio ata_piix ata_generic 2024-06-17T15:56:14-07:00 linode-cloud1 kernel: CR2: 0000000000000120 2024-06-17T15:56:14-07:00 linode-cloud1 kernel: ---[ end trace 0000000000000000 ]--- 2024-06-17T15:56:14-07:00 linode-cloud1 kernel: RIP: 0010:dataset_kstats_rename+0x10/0x30 [zfs] 2024-06-17T15:56:14-07:00 linode-cloud1 kernel: Code: 1f 84 00 00 00 00 00 0f 1f 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 0f 1f 00 0f 1f 44 00 00 48 8b 87 98 03 00 00 <48> 8b 80 20 01 00 00 8b 90 08 01 00 00 48 8b b8 00 01 00 00 e9 67 2024-06-17T15:56:14-07:00 linode-cloud1 kernel: RSP: 0018:ffffa1d0819bbd38 EFLAGS: 00010292 2024-06-17T15:56:14-07:00 linode-cloud1 kernel: RAX: 0000000000000000 RBX: ffff89e4dfffc800 RCX: 00000036ce455e00 2024-06-17T15:56:14-07:00 linode-cloud1 kernel: RDX: 00000036ce455c00 RSI: ffff89e4ca958980 RDI: ffff89e4c4fe4198 2024-06-17T15:56:14-07:00 linode-cloud1 kernel: RBP: ffff89e4c7fbf000 R08: 0000000000000000 R09: 0000000000000000 2024-06-17T15:56:14-07:00 linode-cloud1 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff89e4c4fe4560 2024-06-17T15:56:14-07:00 linode-cloud1 kernel: R13: ffff89e4dfffc804 R14: ffff89e4ca523540 R15: ffff89e4c4fe4000 2024-06-17T15:56:14-07:00 linode-cloud1 kernel: FS: 0000000000000000(0000) GS:ffff89e4fec00000(0000) knlGS:0000000000000000 2024-06-17T15:56:14-07:00 linode-cloud1 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 2024-06-17T15:56:14-07:00 linode-cloud1 kernel: CR2: 0000000000000120 CR3: 000000000998c002 CR4: 0000000000770ef0 2024-06-17T15:56:14-07:00 linode-cloud1 kernel: PKRU: 55555554 2024-06-17T15:56:14-07:00 linode-cloud1 kernel: note: z_zvol[428] exited with irqs disabled ```

So this was a great finding. The problem is reproducible on a different system with a different pool; it seems unlikely to be a race or other timing-sensitive thing; and it appears pretty definitively to be connected to snapdev=visible.

(Also, my local system is a 2-CPU machine, whereas this happens to be a single-CPU VPS instance; which is an actually-oddly-handy thing, in certain limited respects, re: "is this a race" type stuff.)


Revised steps to reproduce

  1. Use OpenZFS (stable 2.2.4 or git master) on Linux
  2. zpool create pool; zfs create -V 1G pool/vol1
  3. TEST: snapdev=hidden, no snapshots: OK zfs rename pool/vol1 pool/vol2; zfs rename pool/vol2 pool/vol3
  4. zfs snap pool/vol3@snap1
  5. TEST: snapdev=hidden, has snapshot(s): OK zfs rename pool/vol3 pool/vol4; zfs rename pool/vol4 pool/vol5
  6. zfs destroy pool/vol5@snap1; zfs set snapdev=visible pool/vol5
  7. TEST: snapdev=visible, no snapshots: OK zfs rename pool/vol5 pool/vol6; zfs rename pool/vol6 pool/vol7
  8. zfs snap pool/vol7@snap1
  9. TEST: snapdev=visible, has snapshot(s): FAIL zfs rename pool/vol7 pool/vol8; zfs rename pool/vol8 pool/vol9 The zfs rename commands do not hang, and the zvol is actually renamed (briefly to vol8, and ultimately to vol9) Upon the first zfs rename command, dmesg shows a kernel null pointer dereference /dev/zvol/pool/vol7 and /dev/zvol/pool/vol7@snap1 remain forever, and udev workers block indefinitely

I'll continue messing around with this and see what else I can determine.

asomers commented 2 months ago

Hey, that's great news! Once you've answered those questions, do you think you'll be able to add a test case?

jgottula commented 2 months ago

Hey, that's great news! Once you've answered those questions, do you think you'll be able to add a test case?

Yeah, that's a good idea. I'll look into doing this.

jgottula commented 2 months ago

Okay, reproduced successfully once again.

I've confirmed that snapdev=visible must be set on the zvol and that the zvol must have at least one snapshot. (One snapshot is entirely sufficient. But if there are no snapshots, no bug.)

jgottula commented 2 months ago

Reproduced once again, this time using a git master branch (c98295eed2) build of the zfs kernel modules and userspace utilities. Exact same behavior and kernel bug manifestation.

# zfs version
zfs-2.2.99-534_gc98295eed2
zfs-kmod-2.2.99-534_gc98295eed2
### Kernel Oops ``` 2024-06-17T19:16:21-07:00 linode-cloud1 kernel: BUG: kernel NULL pointer dereference, address: 0000000000000120 2024-06-17T19:16:21-07:00 linode-cloud1 kernel: #PF: supervisor read access in kernel mode 2024-06-17T19:16:21-07:00 linode-cloud1 kernel: #PF: error_code(0x0000) - not-present page 2024-06-17T19:16:21-07:00 linode-cloud1 kernel: PGD 0 P4D 0 2024-06-17T19:16:21-07:00 linode-cloud1 kernel: Oops: 0000 [#1] PREEMPT SMP NOPTI 2024-06-17T19:16:21-07:00 linode-cloud1 kernel: CPU: 0 PID: 1947 Comm: z_zvol Tainted: P OE 6.9.5-arch1-1 #1 b9e5462a84a73f67b5c7c6b73f88d2a6349ae768 2024-06-17T19:16:21-07:00 linode-cloud1 kernel: Hardware name: Linode Compute Instance, BIOS Not Specified 2024-06-17T19:16:21-07:00 linode-cloud1 kernel: RIP: 0010:dataset_kstats_rename+0x10/0x30 [zfs] 2024-06-17T19:16:21-07:00 linode-cloud1 kernel: Code: 1f 84 00 00 00 00 00 0f 1f 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 0f 1f 00 0f 1f 44 00 00 48 8b 87 98 03 00 00 <48> 8b 80 20 01 00 00 8b 90 08 01 00 00 48 8b b8 00 01 00 00 e9 a7 2024-06-17T19:16:21-07:00 linode-cloud1 kernel: RSP: 0018:ffffc1c583d5bd38 EFLAGS: 00010292 2024-06-17T19:16:21-07:00 linode-cloud1 kernel: RAX: 0000000000000000 RBX: ffff9fabca768400 RCX: 000000000357a800 2024-06-17T19:16:21-07:00 linode-cloud1 kernel: RDX: 000000000357a600 RSI: ffff9fabc7b29f30 RDI: ffff9fabe6bec998 2024-06-17T19:16:21-07:00 linode-cloud1 kernel: RBP: ffff9fabc7be8800 R08: 0000000000000000 R09: 0000000000000000 2024-06-17T19:16:21-07:00 linode-cloud1 kernel: R10: 000000000010000f R11: 0000000000000000 R12: ffff9fabe6becd60 2024-06-17T19:16:21-07:00 linode-cloud1 kernel: R13: ffff9fabca768404 R14: ffff9fabe6ae3540 R15: ffff9fabe6bec800 2024-06-17T19:16:21-07:00 linode-cloud1 kernel: FS: 0000000000000000(0000) GS:ffff9fabfec00000(0000) knlGS:0000000000000000 2024-06-17T19:16:21-07:00 linode-cloud1 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 2024-06-17T19:16:21-07:00 linode-cloud1 kernel: CR2: 0000000000000120 CR3: 0000000028524005 CR4: 0000000000770ef0 2024-06-17T19:16:21-07:00 linode-cloud1 kernel: PKRU: 55555554 2024-06-17T19:16:21-07:00 linode-cloud1 kernel: Call Trace: 2024-06-17T19:16:21-07:00 linode-cloud1 kernel: 2024-06-17T19:16:21-07:00 linode-cloud1 kernel: ? __die_body.cold+0x19/0x27 2024-06-17T19:16:21-07:00 linode-cloud1 kernel: ? page_fault_oops+0x15a/0x2b0 2024-06-17T19:16:21-07:00 linode-cloud1 kernel: ? exc_page_fault+0x81/0x190 2024-06-17T19:16:21-07:00 linode-cloud1 kernel: ? asm_exc_page_fault+0x26/0x30 2024-06-17T19:16:21-07:00 linode-cloud1 kernel: ? dataset_kstats_rename+0x10/0x30 [zfs 76b81818ed9932b301e206d867899ff00e428e9b] 2024-06-17T19:16:21-07:00 linode-cloud1 kernel: zvol_task_cb+0x37d/0x450 [zfs 76b81818ed9932b301e206d867899ff00e428e9b] 2024-06-17T19:16:21-07:00 linode-cloud1 kernel: taskq_thread+0x2e1/0x550 [spl a38ed7e5f6dac8ba1f8d40d07ff19af02b5c66ae] 2024-06-17T19:16:21-07:00 linode-cloud1 kernel: ? __pfx_default_wake_function+0x10/0x10 2024-06-17T19:16:21-07:00 linode-cloud1 kernel: ? __pfx_taskq_thread+0x10/0x10 [spl a38ed7e5f6dac8ba1f8d40d07ff19af02b5c66ae] 2024-06-17T19:16:21-07:00 linode-cloud1 kernel: kthread+0xd2/0x100 2024-06-17T19:16:21-07:00 linode-cloud1 kernel: ? __pfx_kthread+0x10/0x10 2024-06-17T19:16:21-07:00 linode-cloud1 kernel: ret_from_fork+0x34/0x50 2024-06-17T19:16:21-07:00 linode-cloud1 kernel: ? __pfx_kthread+0x10/0x10 2024-06-17T19:16:21-07:00 linode-cloud1 kernel: ret_from_fork_asm+0x1a/0x30 2024-06-17T19:16:21-07:00 linode-cloud1 kernel: 2024-06-17T19:16:21-07:00 linode-cloud1 kernel: Modules linked in: wireguard curve25519_x86_64 libchacha20poly1305 chacha_x86_64 poly1305_x86_64 libcurve25519_generic libchacha ip6_udp_tunnel udp_tunnel nf_conntrack_netlink nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c nfnetlink_acct cfg80211 rfkill intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul polyval_clmulni polyval_generic gf128mul ghash_clmulni_intel sha512_ssse3 joydev sha256_ssse3 sha1_ssse3 aesni_intel iTCO_wdt bochs mousedev intel_pmc_bxt crypto_simd iTCO_vendor_support drm_vram_helper cryptd drm_ttm_helper psmouse i2c_i801 pcspkr intel_agp lpc_ich ttm i2c_smbus intel_gtt mac_hid zfs(POE) spl(OE) loop dm_mod nfnetlink qemu_fw_cfg ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 virtio_net serio_raw atkbd virtio_scsi net_failover failover libps2 vivaldi_fmap crc32c_intel virtio_pci virtio_pci_legacy_dev virtio_pci_modern_dev i8042 serio ata_piix ata_generic 2024-06-17T19:16:21-07:00 linode-cloud1 kernel: CR2: 0000000000000120 2024-06-17T19:16:21-07:00 linode-cloud1 kernel: ---[ end trace 0000000000000000 ]--- 2024-06-17T19:16:21-07:00 linode-cloud1 kernel: RIP: 0010:dataset_kstats_rename+0x10/0x30 [zfs] 2024-06-17T19:16:21-07:00 linode-cloud1 kernel: Code: 1f 84 00 00 00 00 00 0f 1f 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 0f 1f 00 0f 1f 44 00 00 48 8b 87 98 03 00 00 <48> 8b 80 20 01 00 00 8b 90 08 01 00 00 48 8b b8 00 01 00 00 e9 a7 2024-06-17T19:16:21-07:00 linode-cloud1 kernel: RSP: 0018:ffffc1c583d5bd38 EFLAGS: 00010292 2024-06-17T19:16:21-07:00 linode-cloud1 kernel: RAX: 0000000000000000 RBX: ffff9fabca768400 RCX: 000000000357a800 2024-06-17T19:16:21-07:00 linode-cloud1 kernel: RDX: 000000000357a600 RSI: ffff9fabc7b29f30 RDI: ffff9fabe6bec998 2024-06-17T19:16:21-07:00 linode-cloud1 kernel: RBP: ffff9fabc7be8800 R08: 0000000000000000 R09: 0000000000000000 2024-06-17T19:16:21-07:00 linode-cloud1 kernel: R10: 000000000010000f R11: 0000000000000000 R12: ffff9fabe6becd60 2024-06-17T19:16:21-07:00 linode-cloud1 kernel: R13: ffff9fabca768404 R14: ffff9fabe6ae3540 R15: ffff9fabe6bec800 2024-06-17T19:16:21-07:00 linode-cloud1 kernel: FS: 0000000000000000(0000) GS:ffff9fabfec00000(0000) knlGS:0000000000000000 2024-06-17T19:16:21-07:00 linode-cloud1 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 2024-06-17T19:16:21-07:00 linode-cloud1 kernel: CR2: 0000000000000120 CR3: 0000000028524005 CR4: 0000000000770ef0 2024-06-17T19:16:21-07:00 linode-cloud1 kernel: PKRU: 55555554 2024-06-17T19:16:21-07:00 linode-cloud1 kernel: note: z_zvol[1947] exited with irqs disabled ```
robn commented 2 months ago

@jgottula looks like we just missed one check to ensure the kstat object exists before we use it. As you saw, the other dataset_kstats_ functions in dataset_kstats.c do this check.

I'll do a more thorough audit tonight, but for now, if you like, give this patch a try.

diff --git module/zfs/dataset_kstats.c module/zfs/dataset_kstats.c
index 2ac058fd2..5abee1243 100644
--- module/zfs/dataset_kstats.c
+++ module/zfs/dataset_kstats.c
@@ -201,6 +201,9 @@ dataset_kstats_destroy(dataset_kstats_t *dk)
 void
 dataset_kstats_rename(dataset_kstats_t *dk, const char *name)
 {
+   if (dk->dk_kstats == NULL)
+       return;
+
    dataset_kstat_values_t *dkv = dk->dk_kstats->ks_data;
    char *ds_name;

I think it probably is the whole fix for this specific case. For better or worse, we just proceed if kstat objects can't be created, assuming that it doesn't matter because they only exist to make stats visible to the OS. So anything trying to use them should always be doing a NULL check.

jgottula commented 2 months ago

I'll do a more thorough audit tonight, but for now, if you like, give this patch a try.

Yeah so actually I've already done a local build of zfs 2.2.4 with effectively that exact patch applied, and I've confirmed that it fixes the kernel null pointer dereference issue.

For better or worse, we just proceed if kstat objects can't be created, assuming that it doesn't matter because they only exist to make stats visible to the OS.

I just now found some time to take a proper look back at #15482, and I think I better understand the intent now.

And actually... on my system using the zfs 2.2.4 build with the null pointer check patched in... it seems like, despite the null pointer check clearly stopping at least one call to dataset_kstats_rename from doing anything, things nevertheless seem to work as intended, because the dataset_name field does get updated when the zvol is renamed:

# zfs create -V 1M -o snapdev=visible pool2/TEST1
# zfs snap pool2/TEST1@snap1
# zfs get -pH -o value objsetid pool2/TEST1
29215
# printf '0x%x\n' 29215
0x721f
# grep '^dataset_name' /proc/spl/kstat/zfs/pool2/objset-0x721f
dataset_name                    7    pool2/TEST1
# zfs rename pool2/TEST1 pool2/TEST2
# grep '^dataset_name' /proc/spl/kstat/zfs/pool2/objset-0x721f
dataset_name                    7    pool2/TEST2

So (at least on Linux), dataset_name does appear to get updated correctly upon a rename—irrespective, or in spite of, the null pointer check being there. (And as a bonus, the system doesn't break anymore if you do happen to rename a zvol that has snapshots as well as snapdev=visible.)

So basically it seems like we aren't actually losing any intended functionality by having the null pointer check in there.

However, I don't personally have the ability to easily test whether any of the things I just said above are also true for FreeBSD. (And for that matter, I don't even know for sure whether FreeBSD is affected by the null pointer dereference bug.) So someone else would need to check on that.

Given all of the above, my best guess at what's going on is most likely something along the lines of: the code paths upstream of dataset_kstats_rename are calling that function for EVERY device node associated with a zvol when that zvol is renamed, regardless of whether it's the device node representing the zvol proper or is merely one of the ones representing one of the zvol's snapshots; and the snapshot cases lack a value for dk->dk_kstats; and so what the null pointer check is actually doing, is effectively filtering out all of the not-really-actually-valid calls associated with the zvol snapshots, while still letting through the call associated with the actual zvol itself.

That's my best guess, anyway. Maybe I'll patch in some printk's and see if I'm right about that.

(And if I am right about that, then probably it would make sense to not only have the null pointer check in there, but to also ensure that the callers of this function aren't calling it for snapshots on a more explicit basis.)

robn commented 2 months ago

Ahh, sorry, I missed that you'd already done a patch. Good good!

Hah, this is cool, and matches your analysis I think:

int
dataset_kstats_create(dataset_kstats_t *dk, objset_t *objset)
{
        /*
         * There should not be anything wrong with having kstats for
         * snapshots. Since we are not sure how useful they would be
         * though nor how much their memory overhead would matter in
         * a filesystem with many snapshots, we skip them for now.
         */
        if (dmu_objset_is_snapshot(objset))
                return (0);

So it's kind of deliberate I guess.

More philosophically, I softly agree it would be nice to not make calls for things that it's nonsensical to call them on, but it's often a function of how much it would complicate the alling code, and whether the knowledge of what to do is available inside/outside, and who should be deciding. Probably no one-size-fits-all, and in this case the comment makes it sound like the caller should just be able to call it and get on with it. But maybe it speaks to a leaky abstraction as well. We've got lots of all these things :)

(not that I'm an arbiter of taste; I'm kinda just rambling in my lunch break heh).

jgottula commented 2 months ago

Hey, that's great news! Once you've answered those questions, do you think you'll be able to add a test case?

I've also been thinking about how I'd go about adding a new test to the zfs test suite, to ensure that renaming a zvol which has snapdev=visible and one or more snapshots works as intended.

And... to be honest, I'm not really entirely sure how I would go about doing this, since the non-intended/test-failing behavior is... well... that a null pointer dereference occurs in kernel space and therefore the kernel kills off the z_zvol taskq kthread (and then that causes various sorts of actions attempted on zvols subsequent to that point to hang indefinitely when you try to do them, and so on). But I think causing mayhem in the kernel is kind of out-of-scope for the test suite...?

Maybe there's a different way to go about doing effectively the same thing?

Or perhaps I've just done a really poor job of interpreting what was meant here... probably this is it. 😄

jgottula commented 2 months ago

Hah, this is cool, and matches your analysis I think:

int
dataset_kstats_create(dataset_kstats_t *dk, objset_t *objset)
{
        /*
         * There should not be anything wrong with having kstats for
         * snapshots. Since we are not sure how useful they would be
         * though nor how much their memory overhead would matter in
         * a filesystem with many snapshots, we skip them for now.
         */
        if (dmu_objset_is_snapshot(objset))
                return (0);

So it's kind of deliberate I guess.

Aha! Okay, yeah, so that's definitely a piece of this.

And then the other, corresponding piece, which is actually responsible for the calls to dataset_kstats_rename for not just the zvol itself but also for its snapshots, is this (calling zvol_os_rename_minor for the zvol, as well as for each of its "children [not actually a thing...?] and snapshots"; and then that function in turn calls dataset_kstats_rename in a very straightforward/unconditional manner with each of them):

/*
 * Rename minors for specified dataset including children and snapshots.
 */
static void
zvol_rename_minors_impl(const char *oldname, const char *newname)
{
    zvol_state_t *zv, *zv_next;
    int oldnamelen;

    if (zvol_inhibit_dev)
        return;

    oldnamelen = strlen(oldname);

    rw_enter(&zvol_state_lock, RW_READER);

    for (zv = list_head(&zvol_state_list); zv != NULL; zv = zv_next) {
        zv_next = list_next(&zvol_state_list, zv);

        mutex_enter(&zv->zv_state_lock);

        if (strcmp(zv->zv_name, oldname) == 0) {
            zvol_os_rename_minor(zv, newname);
        } else if (strncmp(zv->zv_name, oldname, oldnamelen) == 0 &&
            (zv->zv_name[oldnamelen] == '/' ||
            zv->zv_name[oldnamelen] == '@')) {
            char *name = kmem_asprintf("%s%c%s", newname,
                zv->zv_name[oldnamelen],
                zv->zv_name + oldnamelen + 1);
            zvol_os_rename_minor(zv, name);
            kmem_strfree(name);
        }

        mutex_exit(&zv->zv_state_lock);
    }

    rw_exit(&zvol_state_lock);
}

More philosophically, I softly agree it would be nice to not make calls for things that it's nonsensical to call them on, but it's often a function of how much it would complicate the calling code, and whether the knowledge of what to do is available inside/outside, and who should be deciding. Probably no one-size-fits-all, and in this case the comment makes it sound like the caller should just be able to call it and get on with it.

Yeah, now that I see that code snippet (where it's clearly both a deliberate thing to return NULL for snapshots, as well as something that could be changed later), I'm inclined to agree that the null pointer check is enough; and that, say, making zvol_os_rename_minor explicitly check whether the dataset is a snapshot and explicitly not call into dataset_kstats_rename is probably not the best thing. (Because what if at some future point, snapshots do end up getting kstats after all, but maybe nobody realizes to get rid of that separate check; and so you still end up breaking things, but just in a different way.)

asomers commented 2 months ago

@jgottula I reproduced the issue on FreeBSD 15.0-CURRENT using your same steps, and verified that your solution works.

jgottula commented 2 months ago

Okay, somehow it took me about two weeks to finally actually submit a PR for the tiny patch for this issue, but I finally got around to it now.