openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.65k stars 1.75k forks source link

kernel NULL pointer dereference `spl_kmem_cache_alloc+0x2c` #16109

Open arthurfabre opened 7 months ago

arthurfabre commented 7 months ago

System information

Type Version/Name
Distribution Name Debian
Distribution Version Sid / Unstable
Kernel Version 6.6.13
Architecture amd64
OpenZFS Version 2.2.3-2

Describe the problem you're observing

After upgrading from OpenZFS 2.2.2-4 to OpenZFS 2.2.3-2, zpool import $poolname hangs forever, and kernel logs show a NULL pointer dereference.

Downgrading to OpenZFS 2.2.2-4 (by manually installing packages from Debian's archive) resolves the issue.

Describe how to reproduce the problem

I can consistently reproduce this with zpool import $poolname (after masking zfs.target on boot with systemd.mask=zfs.target to be able to boot).

Include any warning/errors/backtraces from the system logs

[   61.732041] BUG: kernel NULL pointer dereference, address: 0000000000000049
[   61.732062] #PF: supervisor read access in kernel mode
[   61.732071] #PF: error_code(0x0000) - not-present page
[   61.732080] PGD 0 P4D 0
[   61.732086] Oops: 0000 [#1] PREEMPT SMP PTI
[   61.732094] CPU: 4 PID: 1519 Comm: zpool Tainted: P           OE      6.6.13-amd64 #1  Debian 6.6.13-1
[   61.732107] Hardware name: HP ProLiant MicroServer Gen8, BIOS J06 04/04/2019
[   61.732117] RIP: 0010:spl_kmem_cache_alloc+0x2c/0x5c0 [spl]
[   61.732141] Code: 1e fa 0f 1f 44 00 00 41 57 41 56 49 89 fe 41 55 41 54 55 53 89 f3 48 83 ec 58 65 48 8b 04 25 28 00 00 00 48 89 44 24 50 31 c0 <f6> 47 49 01 74 77 41 89 f4 48 8b 6f 40 41 83 e4 01 0f 84 c9 00 00
[   61.732165] RSP: 0018:ffffb618d10a38c8 EFLAGS: 00010246
[   61.732173] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[   61.732183] RDX: 0000000000000200 RSI: 0000000000000000 RDI: 0000000000000000
[   61.732192] RBP: ffff9ebfa03f9000 R08: 0000000000000000 R09: 0000000000000001
[   61.732202] R10: 2000000000000000 R11: 0000008000000000 R12: ffffffffc12fe1e0
[   61.732211] R13: ffffb618d10a3a60 R14: 0000000000000000 R15: 00000000ffffffff
[   61.732222] FS:  00007f230bffc6c0(0000) GS:ffff9ec27ab00000(0000) knlGS:0000000000000000
[   61.732232] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   61.732241] CR2: 0000000000000049 CR3: 00000001082ee001 CR4: 00000000001706e0
[   61.732251] Call Trace:
[   61.732257]  <TASK>
[   61.732263]  ? __die+0x23/0x70
[   61.732273]  ? page_fault_oops+0x171/0x4e0
[   61.732283]  ? exc_page_fault+0x7f/0x180
[   61.732293]  ? asm_exc_page_fault+0x26/0x30
[   61.732302]  ? spl_kmem_cache_alloc+0x2c/0x5c0 [spl]
[   61.732321]  ? select_task_rq_fair+0x6e1/0x1820
[   61.732331]  zfs_inode_alloc+0x1b/0x30 [zfs]
[   61.732535]  zpl_inode_alloc+0x2d/0x90 [zfs]
[   61.732692]  alloc_inode+0x1e/0xc0
[   61.732701]  new_inode+0x16/0xc0
[   61.732709]  zfs_znode_alloc+0xe2/0x7c0 [zfs]
[   61.732870]  zfs_zget+0x251/0x290 [zfs]
[   61.733027]  zfs_domount+0x3f9/0x600 [zfs]
[   61.733185]  ? dbuf_rele_and_unlock+0x1ea/0x4f0 [zfs]
[   61.733349]  zpl_mount+0x285/0x300 [zfs]
[   61.733503]  legacy_get_tree+0x2b/0x50
[   61.733513]  vfs_get_tree+0x29/0xf0
[   61.733520]  path_mount+0x4a3/0xae0
[   61.733527]  __x64_sys_mount+0x11a/0x150
[   61.733535]  do_syscall_64+0x60/0xc0
[   61.733766]  ? handle_mm_fault+0xa2/0x360
[   61.734020]  ? do_user_addr_fault+0x30f/0x660
[   61.734242]  ? exit_to_user_mode_prepare+0x40/0x1e0
[   61.734464]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[   61.734691] RIP: 0033:0x7f230e465daa
[   61.734915] Code: 48 8b 0d 71 90 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 3e 90 0c 00 f7 d8 64 89 01 48
[   61.735395] RSP: 002b:00007f230bff1748 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
[   61.735645] RAX: ffffffffffffffda RBX: 00007f230bff4ae0 RCX: 00007f230e465daa
[   61.735881] RDX: 00007f230e7466b0 RSI: 00007f230bff5b40 RDI: 0000563c372b7e70
[   61.736117] RBP: 00007f230bff3810 R08: 00007f230bff27c0 R09: 0000000000000001
[   61.736354] R10: 0000000000200000 R11: 0000000000000246 R12: 0000563c372b7e60
[   61.736588] R13: 00007f230bff5b40 R14: 0000563c372b7e70 R15: 00007f230bff27c0
[   61.736819]  </TASK>
[   61.737047] Modules linked in: xt_conntrack nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack_netlink nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xfrm_user xfrm_algo xt_addrtype nft_compat nf_tables libcrc32c br_netfilter bridge stp llc overlay ipmi_ssif zfs(POE) intel_rapl_msr intel_rapl_common x86_pkg_temp_thermal intel_powerclamp coretemp ghash_clmulni_intel sha512_ssse3 sha256_ssse3 sha1_ssse3 spl(OE) aesni_intel mgag200 crypto_simd cryptd drm_shmem_helper rapl iTCO_wdt intel_pmc_bxt hpwdt drm_kms_helper acpi_ipmi intel_cstate joydev iTCO_vendor_support watchdog ipmi_si i2c_algo_bit hpilo intel_uncore sg ie31200_edac evdev ipmi_devintf acpi_power_meter ipmi_msghandler serio_raw pcspkr button binfmt_misc drm dm_mod loop configfs efi_pstore nfnetlink ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 crc32c_generic sd_mod t10_pi crc64_rocksoft hid_generic crc64 crc_t10dif crct10dif_generic usbhid ata_generic hid mpt3sas xhci_pci ata_piix xhci_hcd libata raid_class crct10dif_pclmul uhci_hcd ehci_pci crct10dif_common
[   61.737094]  scsi_transport_sas ehci_hcd tg3 crc32_pclmul usbcore scsi_mod crc32c_intel psmouse libphy lpc_ich usb_common scsi_common
[   61.740022] CR2: 0000000000000049
[   61.740323] ---[ end trace 0000000000000000 ]---

Disassembly of spl.ko with:

xzcat (sudo modinfo -n spl) > /tmp/spl.ko
objdump -d /tmp/spl.ko | grep -A 150 '<spl_kmem_cache_alloc>:'
0000000000003240 <spl_kmem_cache_alloc>:
    3240:   f3 0f 1e fa             endbr64
    3244:   e8 00 00 00 00          call   3249 <spl_kmem_cache_alloc+0x9>
    3249:   41 57                   push   %r15
    324b:   41 56                   push   %r14
    324d:   49 89 fe                mov    %rdi,%r14
    3250:   41 55                   push   %r13
    3252:   41 54                   push   %r12
    3254:   55                      push   %rbp
    3255:   53                      push   %rbx
    3256:   89 f3                   mov    %esi,%ebx
    3258:   48 83 ec 58             sub    $0x58,%rsp
    325c:   65 48 8b 04 25 28 00    mov    %gs:0x28,%rax
    3263:   00 00
    3265:   48 89 44 24 50          mov    %rax,0x50(%rsp)
    326a:   31 c0                   xor    %eax,%eax
    326c:   f6 47 49 01             testb  $0x1,0x49(%rdi)
    3270:   74 77                   je     32e9 <spl_kmem_cache_alloc+0xa9>
    3272:   41 89 f4                mov    %esi,%r12d
    3275:   48 8b 6f 40             mov    0x40(%rdi),%rbp
    3279:   41 83 e4 01             and    $0x1,%r12d
    327d:   0f 84 c9 00 00 00       je     334c <spl_kmem_cache_alloc+0x10c>
    3283:   45 89 e5                mov    %r12d,%r13d
    3286:   be 20 28 05 00          mov    $0x52820,%esi
    328b:   89 f0                   mov    %esi,%eax
    328d:   48 89 ef                mov    %rbp,%rdi
    3290:   83 c8 20                or     $0x20,%eax
    3293:   f6 c3 04                test   $0x4,%bl
    3296:   0f 45 f0                cmovne %eax,%esi
    3299:   89 f0                   mov    %esi,%eax
    329b:   80 cc 01                or     $0x1,%ah
    329e:   f6 c7 10                test   $0x10,%bh
    32a1:   0f 45 f0                cmovne %eax,%esi
    32a4:   e8 00 00 00 00          call   32a9 <spl_kmem_cache_alloc+0x69>
    32a9:   49 89 c7                mov    %rax,%r15
    32ac:   48 85 c0                test   %rax,%rax
    32af:   0f 84 c6 04 00 00       je     377b <spl_kmem_cache_alloc+0x53b>
    32b5:   8b 15 00 00 00 00       mov    0x0(%rip),%edx        # 32bb <spl_kmem_cache_alloc+0x7b>
    32bb:   49 8d be 08 01 00 00    lea    0x108(%r14),%rdi
    32c2:   be 01 00 00 00          mov    $0x1,%esi
    32c7:   e8 00 00 00 00          call   32cc <spl_kmem_cache_alloc+0x8c>
    32cc:   49 8b 46 20             mov    0x20(%r14),%rax
    32d0:   48 85 c0                test   %rax,%rax
    32d3:   0f 84 9d 00 00 00       je     3376 <spl_kmem_cache_alloc+0x136>
    32d9:   49 8b 76 30             mov    0x30(%r14),%rsi
    32dd:   89 da                   mov    %ebx,%edx
    32df:   4c 89 ff                mov    %r15,%rdi
    32e2:   e8 00 00 00 00          call   32e7 <spl_kmem_cache_alloc+0xa7>
    32e7:   eb 39                   jmp    3322 <spl_kmem_cache_alloc+0xe2>
    32e9:   ff 15 00 00 00 00       call   *0x0(%rip)        # 32ef <spl_kmem_cache_alloc+0xaf>
    32ef:   49 8b 56 10             mov    0x10(%r14),%rdx
    32f3:   65 8b 05 00 00 00 00    mov    %gs:0x0(%rip),%eax        # 32fa <spl_kmem_cache_alloc+0xba>
    32fa:   48 98                   cltq
    32fc:   4c 8b 2c c2             mov    (%rdx,%rax,8),%r13
    3300:   41 8b 45 04             mov    0x4(%r13),%eax
    3304:   85 c0                   test   %eax,%eax
    3306:   74 74                   je     337c <spl_kmem_cache_alloc+0x13c>
    3308:   83 e8 01                sub    $0x1,%eax
    330b:   41 89 45 04             mov    %eax,0x4(%r13)
    330f:   4d 8b 7c c5 20          mov    0x20(%r13,%rax,8),%r15
    3314:   ff 15 00 00 00 00       call   *0x0(%rip)        # 331a <spl_kmem_cache_alloc+0xda>
    331a:   4d 85 ff                test   %r15,%r15
    331d:   75 ad                   jne    32cc <spl_kmem_cache_alloc+0x8c>
    331f:   45 31 ff                xor    %r15d,%r15d
    3322:   48 8b 44 24 50          mov    0x50(%rsp),%rax
    3327:   65 48 2b 04 25 28 00    sub    %gs:0x28,%rax
    332e:   00 00
    3330:   0f 85 c1 03 00 00       jne    36f7 <spl_kmem_cache_alloc+0x4b7>
    3336:   48 83 c4 58             add    $0x58,%rsp
    333a:   4c 89 f8                mov    %r15,%rax
    333d:   5b                      pop    %rbx
    333e:   5d                      pop    %rbp
    333f:   41 5c                   pop    %r12
    3341:   41 5d                   pop    %r13
    3343:   41 5e                   pop    %r14
    3345:   41 5f                   pop    %r15
    3347:   e9 00 00 00 00          jmp    334c <spl_kmem_cache_alloc+0x10c>
    334c:   45 31 ed                xor    %r13d,%r13d
    334f:   65 48 8b 04 25 00 00    mov    %gs:0x0,%rax
    3356:   00 00
    3358:   8b 40 2c                mov    0x2c(%rax),%eax
    335b:   25 00 00 08 00          and    $0x80000,%eax
    3360:   83 f8 01                cmp    $0x1,%eax
    3363:   19 f6                   sbb    %esi,%esi
    3365:   81 e6 c0 00 00 00       and    $0xc0,%esi
    336b:   81 c6 00 2c 04 00       add    $0x42c00,%esi
    3371:   e9 15 ff ff ff          jmp    328b <spl_kmem_cache_alloc+0x4b>
    3376:   41 0f 18 0f             prefetcht0 (%r15)
    337a:   eb a6                   jmp    3322 <spl_kmem_cache_alloc+0xe2>
    337c:   41 8b 45 0c             mov    0xc(%r13),%eax
    3380:   45 8b 5d 08             mov    0x8(%r13),%r11d
    3384:   48 c7 44 24 20 00 00    movq   $0x0,0x20(%rsp)
    338b:   00 00
    338d:   45 2b 5d 04             sub    0x4(%r13),%r11d
    3391:   41 39 c3                cmp    %eax,%r11d
    3394:   41 0f 46 c3             cmovbe %r11d,%eax
    3398:   41 89 c4                mov    %eax,%r12d
    339b:   49 8d 86 a8 00 00 00    lea    0xa8(%r14),%rax
    33a2:   48 89 c7                mov    %rax,%rdi
    33a5:   48 89 44 24 18          mov    %rax,0x18(%rsp)
    33aa:   e8 00 00 00 00          call   33af <spl_kmem_cache_alloc+0x16f>
    33af:   89 d8                   mov    %ebx,%eax
    33b1:   83 e0 01                and    $0x1,%eax
    33b4:   c7 44 24 10 00 00 00    movl   $0x0,0x10(%rsp)
    33bb:   00
    33bc:   89 44 24 14             mov    %eax,0x14(%rsp)
    33c0:   49 8d 86 90 00 00 00    lea    0x90(%r14),%rax
    33c7:   48 89 44 24 08          mov    %rax,0x8(%rsp)
    33cc:   45 85 e4                test   %r12d,%r12d
    33cf:   0f 8e 06 03 00 00       jle    36db <spl_kmem_cache_alloc+0x49b>
    33d5:   49 8b 86 90 00 00 00    mov    0x90(%r14),%rax
    33dc:   48 8b 4c 24 08          mov    0x8(%rsp),%rcx
    33e1:   48 39 c8                cmp    %rcx,%rax
    33e4:   0f 84 dd 00 00 00       je     34c7 <spl_kmem_cache_alloc+0x287>
    33ea:   49 8b ae 90 00 00 00    mov    0x90(%r14),%rbp
    33f1:   8b 45 28                mov    0x28(%rbp),%eax
    33f4:   8b 55 f4                mov    -0xc(%rbp),%edx
    33f7:   39 d0                   cmp    %edx,%eax
    33f9:   0f 83 eb 02 00 00       jae    36ea <spl_kmem_cache_alloc+0x4aa>
    33ff:   41 83 ec 01             sub    $0x1,%r12d
    3403:   eb 1f                   jmp    3424 <spl_kmem_cache_alloc+0x1e4>
    3405:   49 8b 47 f0             mov    -0x10(%r15),%rax
    3409:   49 89 44 d5 20          mov    %rax,0x20(%r13,%rdx,8)
    340e:   8b 45 f4                mov    -0xc(%rbp),%eax
    3411:   39 45 28                cmp    %eax,0x28(%rbp)
    3414:   0f 83 10 02 00 00       jae    362a <spl_kmem_cache_alloc+0x3ea>
    341a:   41 83 ec 01             sub    $0x1,%r12d
    341e:   0f 82 b7 02 00 00       jb     36db <spl_kmem_cache_alloc+0x49b>
    3424:   83 44 24 10 01          addl   $0x1,0x10(%rsp)
    3429:   74 a1                   je     33cc <spl_kmem_cache_alloc+0x18c>
    342b:   41 8b 55 04             mov    0x4(%r13),%edx
    342f:   8d 42 01                lea    0x1(%rdx),%eax
    3432:   89 14 24                mov    %edx,(%rsp)
    3435:   41 89 45 04             mov    %eax,0x4(%r13)
    3439:   4c 8b 7d 10             mov    0x10(%rbp),%r15
    343d:   4c 89 ff                mov    %r15,%rdi
    3440:   e8 00 00 00 00          call   3445 <spl_kmem_cache_alloc+0x205>
    3445:   8b 14 24                mov    (%rsp),%edx
    3448:   84 c0                   test   %al,%al
    344a:   74 0e                   je     345a <spl_kmem_cache_alloc+0x21a>
    344c:   49 8b 37                mov    (%r15),%rsi
    344f:   49 8b 47 08             mov    0x8(%r15),%rax
    3453:   48 89 46 08             mov    %rax,0x8(%rsi)
    3457:   48 89 30                mov    %rsi,(%rax)
    345a:   4d 89 3f                mov    %r15,(%r15)
    345d:   4d 89 7f 08             mov    %r15,0x8(%r15)
    3461:   48 8b 05 00 00 00 00    mov    0x0(%rip),%rax        # 3468 <spl_kmem_cache_alloc+0x228>
    3468:   83 45 28 01             addl   $0x1,0x28(%rbp)
    346c:   48 89 45 20             mov    %rax,0x20(%rbp)
    3470:   49 8b 86 00 01 00 00    mov    0x100(%r14),%rax
    3477:   48 83 c0 01             add    $0x1,%rax
rincebrain commented 7 months ago

That trace points next to no distance at all into the function, and that function hasn't churned at all since 2020, so I'm pretty sure your pointer maps to 1254 in:

1249            /*
1250             * Allocate directly from a Linux slab.  All optimizations are left
1251             * to the underlying cache we only need to guarantee that KM_SLEEP
1252             * callers will never fail.
1253             */
1254            if (skc->skc_flags & KMC_SLAB) {
1255                    struct kmem_cache *slc = skc->skc_linux_cache;
1256                    do {
1257                            obj = kmem_cache_alloc(slc, kmem_flags_convert(flags));
1258                    } while ((obj == NULL) && !(flags & KM_NOSLEEP));

which would require that the cache object passed in is NULL to break that way.

And if we look up the line in zfs_znode_alloc:

zp = kmem_cache_alloc(znode_cache, KM_SLEEP);

And znode_cache should be initialized in zfs_znode_init long before anyone ever wants zfs_znode_alloc, so either someone's calling that in a strange order, or the request to create that cache is failing, which I don't think should happen.

My blind guess would be, from that stack trace, doing the import is causing something to decide the pool is imported and trigger traditional mount -t zfs style calls, and it's racing the pool import and losing. (Other option would be that zpool import itself is doing the mounts, but I'd be somewhat surprised if that's breaking reliably and nobody else noticed it being broken yet.)

You could try doing zpool import -N [pool] and see if that fails the same way (since -N should preclude zpool import itself triggering any mounts...)

e: The last couple dozen lines of /proc/spl/kstat/zfs/dbgmsg would also potentially be informative for what it was doing.

rincebrain commented 7 months ago

Also, any module parameters set to non-default values? (I'm trying to speculate why this might be breaking for you specifically and not anyone else so far...)

arthurfabre commented 7 months ago

Thanks for looking into this!

any module parameters set to non-default values?

Winner winner chicken dinner! I completely forgot I had:

spl_kmem_cache_slab_limit=0

Removing it resolves the issue. I think setting it to 0 broke recently with 29ea6fa:

  1. Caller specifies KMC_KVMEM for spl_kmem_cache_create() (I think).
  2. size > spl_kmem_cache_slab_limit will always be true.
  3. spl_kmem_cache_create returns NULL.
  4. spl_kmem_cache_alloc is called with a NULL spl_kmem_cache_t *.

That seems a bit foot-gunny, I think if spl_kmem_cache_slab_limit is set lower than what any KMC_KVMEM cache tries to allocate this issue will happen.

But I don't have an actual reason to set it to 0, I think it's leftover debugging from #11574. Happy to close this from my end.

Edit: removed my completely wrong earlier interpretation.