Open lordrasmus opened 8 months ago
I think I'm seeing the same problem on one of my systems. The conditions for the oops are roughly the same: I'm trying to remove a vdev from a pool that contains an indirect vdev (two of them, technically), and upon zpool remove
, this happens:
[ 1518.859606] BUG: kernel NULL pointer dereference, address: 0000000000000088
[ 1518.944079] #PF: supervisor read access in kernel mode
[ 1519.006652] #PF: error_code(0x0000) - not-present page
[ 1519.069215] PGD 0 P4D 0
[ 1519.100558] Oops: 0000 [#1] PREEMPT SMP PTI
[ 1519.151659] CPU: 1 PID: 9607 Comm: zpool Tainted: P O 6.5.13-5-pve #1
[ 1519.246438] Hardware name: Dell Inc. PowerEdge R430/03XKDV, BIOS 2.18.2 10/18/2023
[ 1519.338099] RIP: 0010:vdev_passivate+0x113/0x1a0 [zfs]
[ 1519.401100] Code: 00 00 00 31 d2 eb 09 48 83 c2 01 48 39 d7 74 3a 49 8b 0c d0 49 39 ce 74 ee 48 81 79 60 c0 17 b0 c0 74 e4 48 8b b1 98 2b 00 00 <48> 3b 86 88 00 00 00 75 d4 48 83 b9 d0 2c 00 00 00 0f 84 15 ff ff
[ 1519.628019] RSP: 0018:ffffc328cb37fc78 EFLAGS: 00010202
[ 1519.691612] RAX: ffff9ead402e7c00 RBX: ffff9ead558f4000 RCX: ffff9ead62f8c000
[ 1519.778085] RDX: 0000000000000002 RSI: 0000000000000000 RDI: 0000000000000005
[ 1519.864555] RBP: ffffc328cb37fca0 R08: ffff9eb156f24f00 R09: 0000000000000000
[ 1519.951014] R10: 0000000000000000 R11: 0000000000000000 R12: ffffc328cb37fce8
[ 1520.037464] R13: ffff9ead63b84800 R14: ffff9ead62f7c000 R15: ffff9ead62f74000
[ 1520.123923] FS: 0000770e9a2c9800(0000) GS:ffff9ecc3fc40000(0000) knlGS:0000000000000000
[ 1520.221813] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1520.291603] CR2: 0000000000000088 CR3: 0000000bc9d20006 CR4: 00000000001706e0
[ 1520.378031] Call Trace:
[ 1520.408277] <TASK>
[ 1520.434348] ? show_regs+0x6d/0x80
[ 1520.476010] ? __die+0x24/0x80
[ 1520.513495] ? page_fault_oops+0x176/0x500
[ 1520.563455] ? vdev_passivate+0x113/0x1a0 [zfs]
[ 1520.619061] ? kernelmode_fixup_or_oops+0xb2/0x140
[ 1520.677315] ? __bad_area_nosemaphore+0x1a5/0x280
[ 1520.734521] ? bad_area_nosemaphore+0x16/0x30
[ 1520.787547] ? do_user_addr_fault+0x2c4/0x6a0
[ 1520.840567] ? exc_page_fault+0x83/0x1b0
[ 1520.888367] ? asm_exc_page_fault+0x27/0x30
[ 1520.939283] ? vdev_passivate+0x113/0x1a0 [zfs]
[ 1520.994773] ? vdev_passivate+0x32/0x1a0 [zfs]
[ 1521.049202] spa_vdev_remove+0x7f9/0x9b0 [zfs]
[ 1521.103609] ? spa_open_common+0x27f/0x450 [zfs]
[ 1521.160116] zfs_ioc_vdev_remove+0x5e/0xb0 [zfs]
[ 1521.216574] zfsdev_ioctl_common+0x8e1/0xa20 [zfs]
[ 1521.275153] ? __check_object_size+0x9d/0x300
[ 1521.328100] zfsdev_ioctl+0x57/0xf0 [zfs]
[ 1521.377275] __x64_sys_ioctl+0xa3/0xf0
[ 1521.422909] do_syscall_64+0x5b/0x90
[ 1521.466447] ? exit_to_user_mode_prepare+0x39/0x190
[ 1521.525576] ? syscall_exit_to_user_mode+0x37/0x60
[ 1521.583660] ? do_syscall_64+0x67/0x90
[ 1521.629240] ? exc_page_fault+0x94/0x1b0
[ 1521.676887] entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[ 1521.738044] RIP: 0033:0x770e9aa33c5b
[ 1521.781533] Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1c 48 8b 44 24 18 64 48 2b 04 25 28 00 00
[ 1522.007770] RSP: 002b:00007ffe41dfd270 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 1522.099122] RAX: ffffffffffffffda RBX: 00005b3088957830 RCX: 0000770e9aa33c5b
[ 1522.185273] RDX: 00007ffe41dfd2e0 RSI: 0000000000005a0c RDI: 0000000000000003
[ 1522.271427] RBP: 00007ffe41e00cd0 R08: 0000000000000000 R09: 0000000000000000
[ 1522.357574] R10: 0000000000000000 R11: 0000000000000246 R12: 00005b3088951298
[ 1522.443720] R13: 00007ffe41dfd2e0 R14: 00007ffe41e00890 R15: 00005b308894c4e0
[ 1522.529864] </TASK>
[ 1522.556688] Modules linked in: ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter bpfilter sctp ip6_udp_tunnel udp_tunnel 8021q garp mrp softdog msr sunrpc binfmt_misc nfnetlink_log bonding tls dm_crypt intel_rapl_msr intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul polyval_clmulni polyval_generic ghash_clmulni_intel dell_wmi sha256_ssse3 dell_smbios sha1_ssse3 aesni_intel ipmi_ssif dell_wmi_descriptor crypto_simd cryptd ledtrig_audio sparse_keymap mgag200 video drm_shmem_helper ipmi_si rapl drm_kms_helper dcdbas input_leds joydev mei_me ipmi_devintf pcspkr i2c_algo_bit intel_cstate mei mxm_wmi mac_hid ipmi_msghandler acpi_power_meter nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nf_log_syslog nft_log nft_ct nft_redir vhost_net vhost vhost_iotlb tap nft_chain_nat nf_nat nf_conntrack nbd nf_defrag_ipv6 nf_defrag_ipv4 nf_tables nfnetlink efi_pstore drm dmi_sysfs ip_tables x_tables autofs4 zfs(PO)
[ 1522.556842] spl(O) raid10 raid456 hid_generic async_raid6_recov async_memcpy async_pq usbkbd usbmouse async_xor async_tx usbhid xor hid raid6_pq libcrc32c raid0 multipath linear simplefb raid1 xhci_pci xhci_pci_renesas ehci_pci tg3 lpc_ich crc32_pclmul xhci_hcd ehci_hcd ahci libahci megaraid_sas wmi
[ 1523.960768] CR2: 0000000000000088
[ 1524.001324] ---[ end trace 0000000000000000 ]---
[ 1524.133180] RIP: 0010:vdev_passivate+0x113/0x1a0 [zfs]
[ 1524.196050] Code: 00 00 00 31 d2 eb 09 48 83 c2 01 48 39 d7 74 3a 49 8b 0c d0 49 39 ce 74 ee 48 81 79 60 c0 17 b0 c0 74 e4 48 8b b1 98 2b 00 00 <48> 3b 86 88 00 00 00 75 d4 48 83 b9 d0 2c 00 00 00 0f 84 15 ff ff
[ 1524.422737] RSP: 0018:ffffc328cb37fc78 EFLAGS: 00010202
[ 1524.486243] RAX: ffff9ead402e7c00 RBX: ffff9ead558f4000 RCX: ffff9ead62f8c000
[ 1524.572667] RDX: 0000000000000002 RSI: 0000000000000000 RDI: 0000000000000005
[ 1524.659104] RBP: ffffc328cb37fca0 R08: ffff9eb156f24f00 R09: 0000000000000000
[ 1524.745499] R10: 0000000000000000 R11: 0000000000000000 R12: ffffc328cb37fce8
[ 1524.831830] R13: ffff9ead63b84800 R14: ffff9ead62f7c000 R15: ffff9ead62f74000
[ 1524.918282] FS: 0000770e9a2c9800(0000) GS:ffff9ecc3fc40000(0000) knlGS:0000000000000000
[ 1525.016178] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1525.085972] CR2: 0000000000000088 CR3: 0000000bc9d20006 CR4: 00000000001706e0
My system is similar to OPs: Proxmox 8.1.10 Kernel 6.5.13-5-pve OpenZFS 2.2.3
I'm guessing my only option besides waiting for a fix is to recreate the pool entirely to get rid of the indirect vdevs?
My issue https://github.com/openzfs/zfs/issues/16786 seems to be relevant/the same.
System information
Describe the problem you're observing
removing a device leads to kernel Oops
BUG: kernel NULL pointer dereference, address: 0000000000000088
Describe how to reproduce the problem
zpool remove zfs-pool wwn-0x50014ee6052e6cf1
Include any warning/errors/backtraces from the system logs
i added a printk to find out phich pointer is zero
module/zfs/vdev_removal.c -> vdev_passivate()
and here is the output
[ 44.868315] rvd->vdev_child[0] ffff98197f054000 [ 44.868325] cvd->vdev_mg 0000000000000000
so this line is crashing
metaslab_class_t *mc = cvd->vdev_mg->mg_class;
i guess its a bug if indirect-0 exists in the pool ?