Intel x86 32bit is broken with OpenZFS master

mcmilk commented 2 years ago

System information

Type	Version/Name
Distribution Name	Debian
Distribution Version	11
Kernel Version	5.10.0-15-686-pae
Architecture	i386
OpenZFS Version	master (bisect)

Describe the problem you're observing

Loading zfs causes following panic on my Debian-11 x32 box:

[  439.359994] flushbuffers-cr (912): drop_caches: 3
[  488.955771] spl: loading out-of-tree module taints kernel.
[  488.956030] spl: module verification failed: signature and/or required key missing - tainting kernel
[  489.139294] zfs: module license 'CDDL' taints kernel.
[  489.139307] Disabling lock debugging due to kernel taint
[  489.178938] VERIFY(zio_buf_cache[c] != NULL) failed
[  489.179026] PANIC at zio.c:225:zio_init()
[  489.179091] Showing stack for process 962
[  489.179097] CPU: 0 PID: 962 Comm: insmod Tainted: P           OE     5.10.0-15-686-pae #1 Debian 5.10.120-1
[  489.179097] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014
[  489.179103] Call Trace:
[  489.179121]  dump_stack+0x54/0x68
[  489.179125]  spl_dumpstack+0x23/0x27 [spl]
[  489.179127]  spl_panic.cold+0x5/0x43 [spl]
[  489.179137]  ? pcpu_block_update_hint_alloc+0x213/0x250
[  489.179139]  ? pcpu_alloc_area+0x185/0x250
[  489.179146]  ? find_next_bit+0xf/0x20
[  489.179147]  ? pcpu_free_area+0x1bf/0x2e0
[  489.179154]  ? __raw_callee_save___pv_queued_spin_unlock+0x9/0x10
[  489.179157]  ? kfree+0x82/0x3c0
[  489.179160]  ? spl_kmem_cache_create+0x2f0/0x400 [spl]
[  489.179162]  ? spl_kmem_cache_create+0x2f0/0x400 [spl]
[  489.179258]  zio_init+0x27e/0x290 [zfs]
[  489.179312]  ? zstd_init+0x64/0x64 [zfs]
[  489.179389]  spa_init+0x101/0x160 [zfs]
[  489.179481]  zfs_kmod_init+0x2a/0xe0 [zfs]
[  489.179559]  openzfs_init_os+0xe/0x67 [zfs]
[  489.179610]  openzfs_init+0x2f/0xe28 [zfs]
[  489.179615]  do_one_initcall+0x41/0x1a0
[  489.179617]  ? kmem_cache_alloc_trace+0x119/0x250
[  489.179622]  ? do_init_module+0x21/0x230
[  489.179624]  do_init_module+0x43/0x230
[  489.179626]  load_module+0x2180/0x23b0
[  489.179628]  __ia32_sys_finit_module+0x99/0xf0
[  489.179632]  __do_fast_syscall_32+0x45/0x80
[  489.179634]  do_fast_syscall_32+0x29/0x60
[  489.179635]  do_SYSENTER_32+0x15/0x20
[  489.179638]  entry_SYSENTER_32+0x9f/0xf2
[  489.179640] EIP: 0xb7f4d559
[  489.179642] Code: 03 74 c0 01 10 05 03 74 b8 01 10 06 03 74 b4 01 10 07 03 74 b0 01 10 08 03 74 d8 01 00 00 00 00 00 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90 90 90 90 8d 76 00 58 b8 77 00 00 00 c
[  489.179643] EAX: ffffffda EBX: 00000003 ECX: 00451214 EDX: 00000000
[  489.179644] ESI: 01c9f240 EDI: 01c9f170 EBP: bfb9a514 ESP: bfb9a46c
[  489.179650] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00200292

Describe how to reproduce the problem

Just load the module...

I could bisct the problem to this commit f2330bd1568489ae1fb16d975a5a9bcfe12ed219 by @rincebrain

Include any warning/errors/backtraces from the system logs

rincebrain commented 2 years ago

FWIW, if you revert the commit, at least in my experience, 32bit x86 is incapable of passing a ZTS run without a panic anyway, and has been for years.

I also did remark that I thought this would break, but since it was already broken and the bugs I reported about it had been ignored, it seemed academic.

eta: that panic, if memory serves, is from it trying to allocate a kmem_cache in the zio caches and failing the allocation (because surprise, contiguous 2/4/8/16M is hard on 32-bit), and then an assert tripping on that. Handling that would involve convincing the code that finding a NULL for the zio cache is fine, which is probably not that hard since it worked on the old recordsize...but I think 32-bit x86 will still be deeply broken.

mcmilk commented 2 years ago

ZFS v2.1.4 and v2.1.5 seem fine. I have no idea why x32 panics... I just did the bisect to help a bit into the maybe right direction.

As suggested, I reverted your patch, and some blake3 thing jumps into dmesg now:

[ 3764.681155] ZFS: Unloaded module v2.1.99-1125_g63b18e409 (DEBUG mode)
[ 3764.816262] BUG: kernel NULL pointer dereference, address: 00000000
[ 3764.816279] #PF: supervisor read access in kernel mode
[ 3764.816288] #PF: error_code(0x0000) - not-present page
[ 3764.816296] *pdpt = 0000000008024001 *pde = 0000000000000000 
[ 3764.816306] Oops: 0000 [#1] SMP NOPTI
[ 3764.816314] CPU: 3 PID: 6417 Comm: insmod Tainted: P           OE     5.10.0-15-686-pae #1 Debian 5.10.120-1
[ 3764.816329] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014
[ 3764.816346] EIP: spl_kmem_cache_alloc+0x22/0x270 [spl]
[ 3764.816355] Code: ff e8 62 06 af cc 66 90 0f 1f 44 00 00 55 89 e5 57 89 d7 56 89 d6 53 83 e6 fa 89 c3 89 f0 99 83 ec 08 09 d6 0f 85 ee 00 00 00 <81> 3b 2c 2c 2c 2c 0f 85 12 01 00 00 8d 53 2c b8 11 00 00 00 e8 d5
[ 3764.816380] EAX: 00000000 EBX: 00000000 ECX: fd5f54dc EDX: 00000000
[ 3764.816390] ESI: 00000000 EDI: 00000004 EBP: c904dd58 ESP: c904dd44
[ 3764.816399] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00210246
[ 3764.816412] CR0: 80050033 CR2: 00000000 CR3: 01310b40 CR4: 00350ef0
[ 3764.816421] Call Trace:
[ 3764.816479]  zio_data_buf_alloc+0x29/0x60 [zfs]
[ 3764.816529]  abd_alloc_linear+0x62/0xa0 [zfs]
[ 3764.816584]  chksum_benchit+0x24/0x110 [zfs]
[ 3764.816634]  chksum_benchmark+0x6f/0x1a0 [zfs]
[ 3764.816676]  ? blake3_per_cpu_ctx_init+0x51/0x70 [zfs]
[ 3764.816714]  ? zstd_init+0x64/0x64 [zfs]
[ 3764.816762]  chksum_init+0x12/0x80 [zfs]
[ 3764.816810]  spa_init+0x129/0x160 [zfs]
[ 3764.816858]  zfs_kmod_init+0x2a/0xe0 [zfs]
[ 3764.816913]  openzfs_init_os+0xe/0x67 [zfs]
[ 3764.816953]  openzfs_init+0x2f/0xe28 [zfs]
[ 3764.816962]  do_one_initcall+0x41/0x1a0
[ 3764.816971]  ? kmem_cache_alloc_trace+0x119/0x250
[ 3764.816980]  ? do_init_module+0x21/0x230
[ 3764.816987]  do_init_module+0x43/0x230
[ 3764.816994]  load_module+0x2180/0x23b0
[ 3764.817007]  __ia32_sys_finit_module+0x99/0xf0
[ 3764.817018]  __do_fast_syscall_32+0x45/0x80
[ 3764.817025]  do_fast_syscall_32+0x29/0x60
[ 3764.817032]  do_SYSENTER_32+0x15/0x20
[ 3764.817039]  entry_SYSENTER_32+0x9f/0xf2
[ 3764.817047] EIP: 0xb7fb9559
[ 3764.817052] Code: 03 74 c0 01 10 05 03 74 b8 01 10 06 03 74 b4 01 10 07 03 74 b0 01 10 08 03 74 d8 01 00 00 00 00 00 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90 90 90 90 8d 76 00 58 b8 77 00 00 00 cd 80 90 8d 76
[ 3764.817078] EAX: ffffffda EBX: 00000003 ECX: 00480214 EDX: 00000000
[ 3764.817088] ESI: 015fb240 EDI: 015fb170 EBP: bfaec804 ESP: bfaec75c
[ 3764.817097] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00200292
[ 3764.817108] Modules linked in: zfs(POE+) spl(OE) kvm_amd kvm irqbypass aesni_intel libaes crypto_simd cryptd bochs_drm drm_vram_helper drm_ttm_helper ttm evdev drm_kms_helper serio_raw pcspkr joydev virtio_balloon cec sg qemu_fw_cfg button nfsd auth_rpcgss nfs_acl lockd grace drm sunrpc fuse configfs ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod hid_generic usbhid hid sd_mod t10_pi crc_t10dif crct10dif_generic crct10dif_common sr_mod cdrom virtio_net net_failover virtio_scsi failover ata_generic uhci_hcd ata_piix crc32_pclmul ehci_hcd crc32c_intel psmouse usbcore libata virtio_pci virtio_ring virtio i2c_piix4 usb_common scsi_mod floppy [last unloaded: spl]
[ 3764.819115] CR2: 0000000000000000
[ 3764.819442] ---[ end trace 8a3cb18f9f6ce383 ]---
[ 3764.819780] EIP: spl_kmem_cache_alloc+0x22/0x270 [spl]
[ 3764.820103] Code: ff e8 62 06 af cc 66 90 0f 1f 44 00 00 55 89 e5 57 89 d7 56 89 d6 53 83 e6 fa 89 c3 89 f0 99 83 ec 08 09 d6 0f 85 ee 00 00 00 <81> 3b 2c 2c 2c 2c 0f 85 12 01 00 00 8d 53 2c b8 11 00 00 00 e8 d5
[ 3764.820756] EAX: 00000000 EBX: 00000000 ECX: fd5f54dc EDX: 00000000
[ 3764.821085] ESI: 00000000 EDI: 00000004 EBP: c904dd58 ESP: c904dd44
[ 3764.821402] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00210246
[ 3764.821718] CR0: 80050033 CR2: 00000000 CR3: 01310b40 CR4: 00350ef0

rincebrain commented 2 years ago

I'll look later, but I would guess it's trying unconditionally to allocate from a zio cache that doesn't exist because it only initialized them up to the recordsize at startup.

mcmilk commented 2 years ago

The blake3 stuff could be fixed with this:

diff --git a/module/zfs/zfs_chksum.c b/module/zfs/zfs_chksum.c
index 3ebe08541..07e0ee28e 100644
--- a/module/zfs/zfs_chksum.c
+++ b/module/zfs/zfs_chksum.c
@@ -178,7 +178,7 @@ chksum_benchit(chksum_stat_t *cs)
        void *salt = &cs->salt.zcs_bytes;

        /* allocate test memory via default abd interface */
-       abd = abd_alloc_linear(1<<22, B_FALSE);
+       abd = abd_alloc_linear(1<<22, B_TRUE);
        memset(salt, 0, sizeof (cs->salt.zcs_bytes));
        if (cs->init) {
                ctx = cs->init(&cs->salt);

So it uses zio_buf_alloc() instead of zio_data_buf_alloc() ... But then it panics() on some other thing.... this would need some days of debugging, to get all of them :(

rincebrain commented 2 years ago

I mean, I'm happy to help fix this one, but I think it'll still be deeply broken after.

behlendorf commented 2 years ago

I'll look later, but I would guess it's trying unconditionally to allocate from a zio cache that doesn't exist because it only initialized them up to the recordsize at startup.

So that was the behavior prior to https://github.com/openzfs/zfs/commit/f2330bd1568489ae1fb16d975a5a9bcfe12ed219, we'd only initialize caches up to the max record size and then crash if we ever needed to allocate something larger. Which was broken, but at least shouldn't happen under normal operation unless creating or receiving a filesystem with blocks >1M.

My guess is we're now trying, and failing, to allocate those larger caches in zio_init() so we trip that ASSERT immediately. Which is even more deeply broken. One possibly workable option would be to not have kmem_cache_create() try and immediately populate the cache magazines. At least that way we should be able to create the caches even if allocating from them latter will be slow.

rincebrain commented 2 years ago

Personally, I'd suggest just EOLing Linux/i686.

If it's not being tested by the CI, it's going to be broken eventually - and it has been. For a long time. I spent a day once trying to get my 32-bit x86 box to pass a ZTS run without panicking, to then test early abort on it, and did not succeed.

It might be workable on FBSD/i686, I understand they make very different tradeoffs there, or on kernels with a 2/2 split instead of the 3/1 common distros ship. But I don't think it's going to fly reliably without a lot of work, and I don't know that anyone is motivated to doso.

mcmilk commented 2 years ago

But 2.1.4 an 2.1.5 seem to work.... so it's really sth. broken on the master branch there. I would give it a try... but don't expect me to find all the bugs ;-)

rincebrain commented 2 years ago

Sure, I agree that master is worse than 2.1.

I still would bet a good amount of money that 2.1 won't pass a test run. I'll let you know in a few hours after I run it.

amotin commented 2 years ago

@rincebrain FreeBSD i386 uses 4/4 model for several years now, so it is not that bad in KVA as it was with 3/1 before, so it may work, but I haven't heard any feedback positive or negative about it for a while. Curios to try myself just to know. ;)

behlendorf commented 2 years ago

Getting it 100% stable on Linux/i686 I'm sure would be a ton of work. However, fixing this one bug would probably get us back to the status quo of it being 95% functional on Linux/i686 which seems like more than enough to be useful. I'm pretty sure it won't pass a full test run either, but I'd also be curious to know how far it gets.

rincebrain commented 2 years ago

This one is probably not that hard to fix, and more correctness is generally always good, but I couldn't make a zfs recv not die last I tried - no large blocks, no 16M recordsize change, no zstd at all, just a single compressed recv with fletcher4. (Uncompressed also failed.)

I'm not just arguing this because it can't pass test runs on some edge cases - I'm arguing this because it can't do simple things that I would say are not remotely optional.

rincebrain commented 2 years ago

The answer to "how far will 2.1.5 get", by the way, on -r sanity is zfs_destroy_dev_removal.ksh.

So not very far.

behlendorf commented 2 years ago

Interesting, well at least it failed where we would have expected it to.

nabijaczleweli commented 2 years ago

(In Re: https://github.com/openzfs/zfs/issues/13597#issuecomment-1167760898: x32 works fine ever since my port, this appears to be fake news since the log is i686 PAE?)

mcmilk commented 1 year ago

We may really want to EOL OpenZFS for i386 (zfs 2.1.x and 2.2.x). Whenever I think I fixed sth. .... then another thing seems to happen :(

openzfs / zfs