openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.66k stars 1.76k forks source link

general protection fault in `nvt_lookup_name_type`, NULL pointer dereference in `abd_iterate_func` #16045

Closed lnicola closed 7 months ago

lnicola commented 7 months ago

System information

Type Version/Name
Distribution Name Arch Linux
Distribution Version N/A
Kernel Version 6.8.2-arch2-1
Architecture x86_64
OpenZFS Version zfs-2.2.99-398_g39be46f43f

Describe the problem you're observing

Error message in the logs every two minutes. 2024.02.26.r9034.g8f2f6cd2ac_6.8.1.arch1.1-1 was fine.

Describe how to reproduce the problem

Don't know, I just upgraded and rebooted.

Include any warning/errors/backtraces from the system logs

[Mar31 14:21] general protection fault, probably for non-canonical address 0x380023ce000085d9: 0000 [#519] PREEMPT SMP NOPTI
[  +0.000022] CPU: 2 PID: 17603 Comm: SDMAIN Tainted: P      D    OE      6.8.2-arch2-1 #1 a430fb92f7ba43092b62bbe6bac995458d3d442d
[  +0.000013] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Q1900DC-ITX, BIOS P1.70 03/01/2018
[  +0.000008] RIP: 0010:strcmp+0x10/0x30
[  +0.000012] Code: 84 00 00 00 00 00 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 31 c0 eb 08 48 83 c0 01 84 d2 74 13 <0f> b6 14 07 3a 14 06 74 ef 19 c0 83 c8 01 c3 cc cc cc cc 31 c0 c3
[  +0.000013] RSP: 0018:ffffc0a2e0947a90 EFLAGS: 00010246
[  +0.000008] RAX: 0000000000000000 RBX: ffffffffc0834265 RCX: 000000006e229ac3
[  +0.000007] RDX: 000000000e229ac3 RSI: ffffffffc0834265 RDI: 380023ce000085d9
[  +0.000007] RBP: 000000000000000a R08: 0000000000000000 R09: ffffffffab122ae0
[  +0.000006] R10: 0000000000000001 R11: 0000000000000002 R12: 380023ce000085b1
[  +0.000007] R13: ffffffffc0834265 R14: ffff9fc81c0b6fc0 R15: ffff9fc81c0b6fe0
[  +0.000007] FS:  00007f59716ae5c0(0000) GS:ffff9fcb2fd00000(0000) knlGS:0000000000000000
[  +0.000008] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  +0.000006] CR2: 00007f596d58f240 CR3: 0000000184fba000 CR4: 00000000001006f0
[  +0.000008] Call Trace:
[  +0.000006]  <TASK>
[  +0.000008]  ? die_addr+0x36/0x90
[  +0.000011]  ? exc_general_protection+0x1dd/0x450
[  +0.000013]  ? asm_exc_general_protection+0x26/0x30
[  +0.000015]  ? strcmp+0x10/0x30
[  +0.000008]  nvt_lookup_name_type.isra.0+0x6d/0xb0 [zfs 9599260deb35358072448d0a802ac475d909352c]
[  +0.000484]  nvlist_lookup_byte_array+0x41/0xb0 [zfs 9599260deb35358072448d0a802ac475d909352c]
[  +0.000467]  zpl_xattr_get_sa+0xb1/0x150 [zfs 9599260deb35358072448d0a802ac475d909352c]
[  +0.000464]  zpl_xattr_get+0x15f/0x1f0 [zfs 9599260deb35358072448d0a802ac475d909352c]
[  +0.000473]  zpl_get_acl_impl+0x3b/0xf0 [zfs 9599260deb35358072448d0a802ac475d909352c]
[  +0.000463]  __get_acl.part.0+0xc9/0x150
[  +0.000012]  generic_permission+0x1a9/0x220
[  +0.000010]  ? inode_permission+0x3d/0x190
[  +0.000010]  inode_permission+0x3d/0x190
[  +0.000009]  may_open+0x7b/0x140
[  +0.000010]  path_openat+0x9ba/0x1190
[  +0.000012]  do_filp_open+0xb3/0x160
[  +0.000014]  do_sys_openat2+0xab/0xe0
[  +0.000011]  __x64_sys_openat+0x57/0xa0
[  +0.000009]  do_syscall_64+0x86/0x170
[  +0.000010]  ? syscall_exit_to_user_mode+0x80/0x230
[  +0.000009]  ? do_syscall_64+0x96/0x170
[  +0.000008]  ? do_syscall_64+0x96/0x170
[  +0.000007]  ? syscall_exit_to_user_mode+0x80/0x230
[  +0.000009]  ? do_syscall_64+0x96/0x170
[  +0.000007]  ? exc_page_fault+0x7f/0x180
[  +0.000008]  entry_SYSCALL_64_after_hwframe+0x6e/0x76
[  +0.000010] RIP: 0033:0x7f5971a56dc0
[  +0.000041] Code: 48 89 44 24 20 75 94 44 89 54 24 0c e8 d9 c9 f8 ff 44 8b 54 24 0c 89 da 48 89 ee 41 89 c0 bf 9c ff ff ff b8 01 01 00 00 0f 05 <48> 3d 00 f0 ff ff 77 38 44 89 c7 89 44 24 0c e8 2c ca f8 ff 8b 44
[  +0.000014] RSP: 002b:00007ffdf93934b0 EFLAGS: 00000297 ORIG_RAX: 0000000000000101
[  +0.000010] RAX: ffffffffffffffda RBX: 0000000000080800 RCX: 00007f5971a56dc0
[  +0.000007] RDX: 0000000000080800 RSI: 000064f742bcbb18 RDI: 00000000ffffff9c
[  +0.000007] RBP: 000064f742bcbb18 R08: 0000000000000001 R09: 00007f5971b34ad0
[  +0.000007] R10: 0000000000000000 R11: 0000000000000297 R12: 00000000ffffffff
[  +0.000006] R13: 00007ffdf9393900 R14: 00007ffdf93935a0 R15: 000064f742bb0f90
[  +0.000012]  </TASK>
[  +0.000005] Modules linked in: wireguard curve25519_x86_64 libchacha20poly1305 chacha_x86_64 poly1305_x86_64 libcurve25519_generic libchacha ip6_udp_tunnel udp_tunnel nft_ct xt_tcpudp iptable_filter iptable_nat nf_nat nf_conntrack nf_tables nf_defrag_ipv6 nf_defrag_ipv4 nct6775 libcrc32c nct6775_core crc32c_generic hwmon_vid intel_rapl_msr intel_rapl_common intel_soc_dts_thermal intel_soc_dts_iosf intel_powerclamp coretemp crct10dif_pclmul crc32_pclmul iwlmvm crc32c_intel i915 spi_nor mac80211 polyval_generic mtd gf128mul libarc4 ptp ghash_clmulni_intel pps_core cryptd at24 drm_buddy sha512_ssse3 mei_pxp iTCO_wdt i2c_algo_bit mei_hdcp r8169 sha256_ssse3 hci_uart ppdev realtek btusb ttm spi_intel_platform btqca iwlwifi intel_pmc_bxt btrtl spi_intel btintel iTCO_vendor_support vfat drm_display_helper sha1_ssse3 fat i2c_i801 mdio_devres mei_txe btbcm cec btmtk intel_cstate hid_generic cdc_acm cfg80211 pcspkr libphy lpc_ich i2c_smbus intel_gtt parport_pc mei bluetooth video ecdh_generic parport rfkill i2c_hid_acpi wmi
[  +0.000135]  crc16 i2c_hid pwm_lpss_platform pwm_lpss mac_hid crypto_user fuse loop dm_mod nfnetlink ip_tables x_tables usbhid zfs(POE) spl(OE) xhci_pci xhci_pci_renesas
[  +0.000118] ---[ end trace 0000000000000000 ]---
[  +0.000012] RIP: 0010:strcmp+0x10/0x30
[  +0.000010] Code: 84 00 00 00 00 00 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 31 c0 eb 08 48 83 c0 01 84 d2 74 13 <0f> b6 14 07 3a 14 06 74 ef 19 c0 83 c8 01 c3 cc cc cc cc 31 c0 c3
[  +0.000013] RSP: 0018:ffffc0a2c8cbbb00 EFLAGS: 00010246
[  +0.000007] RAX: 0000000000000000 RBX: ffffffffc0834265 RCX: 000000006e229ac3
[  +0.000007] RDX: 000000000e229ac3 RSI: ffffffffc0834265 RDI: 380023ce000085d9
[  +0.000006] RBP: 000000000000000a R08: 0000000000000000 R09: ffffffffab122ae0
[  +0.000007] R10: 0000000000000001 R11: 0000000000000002 R12: 380023ce000085b1
[  +0.000006] R13: ffffffffc0834265 R14: ffff9fc81c0b6fc0 R15: ffff9fc81c0b6fe0
[  +0.000007] FS:  00007f59716ae5c0(0000) GS:ffff9fcb2fd00000(0000) knlGS:0000000000000000
[  +0.000009] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  +0.000006] CR2: 00007f596d58f240 CR3: 0000000184fba000 CR4: 00000000001006f0
robn commented 7 months ago

@lnicola thanks for the report.

What action were you taking when this happened. From the log, the process is SDMAIN, and its opening a file. Do you know any more about the file in question? And if you do, if it has any extended attributes?

When you say it was working before (at 8f2f6cd2ac), do you know if this same action was being taken then?


Initial analysis:

strcmp will dereference %rdi:

0f b6 14 07            movzbl (%rdi,%rax,1),%edx

So yes, its probably junk as the crash suggests.

nvt_lookup_name_type loads %rdi this:

49 8d 7c 24 28           lea    0x28(%r12),%rdi

That's the NVP_NAME lookup. i_nvp_t +0x10 is the start of the embedded nvpair_t; its nvp_name field is +0x18 beyond that.

        for (i_nvp_t *e = entry; e != NULL; e = e->nvi_hashtable_next) {
                if (strcmp(NVP_NAME(&e->nvi_nvp), name) == 0 &&

%r12 also looks similarly un-pointer-ish in the crash dump (very nearby %rdi), so likely that's e at fault. Its probably the first pass through the loop, unless the chain of nvi_hashtable_next pointers is corrupt, but that feels less likely to me.

Regardless, entry comes straight out of the nvlist hashtable.

That nvlist is zp->z_xattr_cached, loaded straight up out of the SA area in zpl_xattr_get_sa(). So that suggests there's garbage on disk, or we made a mess of reading it?


"It worked fine" suggests a recent break. There's nothing obvious in this list of commits that are implicated, unless its something coincidental in the myriad page reading changes. I think that's unlikely though if this is a very reliable failure.

39be46f43 evansr    2024-03-29  Linux 5.18+ compat: Detect filemap_range_has_page
2553f94c4 evansr    2024-03-29  Fix buffer underflow if sysfs file is empty
cfb96c772 rob.nor.. 2024-03-29  vdev_disk: clean up spa/bdev mode conversion
c0aab8b8f f.gruen.. 2024-03-29  zvols: prevent overflow of minor device numbers
b1e46f869 george... 2024-03-29  Add ashift validation when adding devices to a pool
e39e20b6d evansr    2024-03-27  ZTS: fix flakiness in cp_files_002_pos
0c8eb974f mav       2024-03-27  BRT: Check pool clone stats in more tests
b40342762 mav       2024-03-27  BRT: Fix tests to work on non-empty pools
a89d209bb mav       2024-03-27  BRT: Fix holes cloning.
8cd8ccca5 mav       2024-03-25  BRT: Skip getting length in brt_entry_lookup()
c6be6ce17 rob.nor.. 2024-03-25  abd_iter_page: don't use compound heads on Linux <4.5
72fd834c4 rob.nor.. 2024-03-25  vdev_disk: use bio_chain() to submit multiple BIOs
df2169d14 rob.nor.. 2024-03-25  vdev_disk: add module parameter to select BIO submission method
06a196020 rob.nor.. 2024-03-25  vdev_disk: rewrite BIO filling machinery to avoid split pages
c4a13ba48 rob.nor.. 2024-03-25  vdev_disk: make read/write IO function configurable
867178ae1 rob.nor.. 2024-03-25  vdev_disk: reorganise vdev_disk_io_start
f3b85d706 rob.nor.. 2024-03-25  vdev_disk: rename existing functions to vdev_classic_*
390b44872 rob.nor.. 2024-03-25  abd: add page iterator
df04efe32 rob.nor.. 2024-03-25  linux 5.4 compat: page_size()
f68bde723 mav       2024-03-25  BRT: Make BRT block sizes configurable
493fcce9b george... 2024-03-25  Provide macros for setting and getting blkptr birth times
4616b96a6 mav       2024-03-25  BRT: Relax brt_pending_apply() locking
80cc51629 mav       2024-03-25  ZAP: Massively switch to _by_dnode() interfaces
bf8f72359 mav       2024-03-25  BRT: Skip duplicate BRT prefetches
102b468b5 rrevans   2024-03-25  Fix corruption caused by mmap flushing problems
c28f94f32 mav       2024-03-21  ZAP: Some cleanups/micro-optimizations
f1b368359 f.gruen.. 2024-03-21  udev: correctly handle partition #16 and later
2c01cae8b mav       2024-03-21  BRT: Change brt_pending_tree sorting order
5c4a4f82c rob.nor.. 2024-03-21  zio: update ZIO type x stage documentation
c9d8f6c59 harr1     2024-03-21  Fix option string, adding -e and fixing order
45e23abed mav       2024-03-20  Update resume token at object receive.
ef08a4d40 robn      2024-03-20  Linux 6.8 compat: use splice_copy_file_range() for fallback
90ff73235 robn      2024-03-20  freebsd: fix missing headers in distribution tarball

However, just before that was 7a2e54b7d, which changed the test for the inode .permission member. I can't fathom how it could actually be related, but I see to words the same and I wonder.

That's all I have for now.

lnicola commented 7 months ago

From the log, the process is SDMAIN, and its opening a file. Do you know any more about the file in question?

That's a good question. I don't know what that process is, and the PID changes every time. I tried execsnoop, but it's no longer crashing.

I do have a couple of these though:

Mar 31 2024 15:57:15.781944214 ereport.fs.zfs.deadman
        class = "ereport.fs.zfs.deadman"
        ena = 0xe264c66920e00801
        detector = (embedded nvlist)
                version = 0x0
                scheme = "zfs"
                pool = 0x53773a861fbf8297
                vdev = 0x247c7a23b0318d42
        (end detector)
        pool = "smart"
        pool_guid = 0x53773a861fbf8297
        pool_state = 0x0
        pool_context = 0x0
        pool_failmode = "continue"
        vdev_guid = 0x247c7a23b0318d42
        vdev_type = "disk"
        vdev_path = "/dev/disk/by-id/ata-Samsung_SSD_860_EVO_250GB_S3YJNB0K339667A-part2"
        vdev_ashift = 0xd
        vdev_complete_ts = 0x3e44013f9a06
        vdev_delta_ts = 0x2d3204
        vdev_read_errors = 0x0
        vdev_write_errors = 0x0
        vdev_cksum_errors = 0x0
        vdev_delays = 0x0
        parent_guid = 0x53773a861fbf8297
        parent_type = "root"
        vdev_spare_paths = 
        vdev_spare_guids = 
        zio_err = 0x0
        zio_flags = 0x180080
        zio_stage = 0x1000000
        zio_pipeline = 0x3e00000
        zio_delay = 0x3e8b0
        zio_timestamp = 0x3ddddb3fdfce
        zio_delta = 0x3ee44
        zio_priority = 0x0
        zio_offset = 0x1c35b88000
        zio_size = 0x6000
        zio_objset = 0x86
        zio_object = 0x5eb5c
        zio_level = 0x0
        zio_blkid = 0x0
        time = 0x66095dab 0x2e9b8596 
        eid = 0x45

Mar 31 2024 15:57:29.435268269 ereport.fs.zfs.deadman
        class = "ereport.fs.zfs.deadman"
        ena = 0xe297a3849ba00801
        detector = (embedded nvlist)
                version = 0x0
                scheme = "zfs"
                pool = 0x53773a861fbf8297
                vdev = 0x247c7a23b0318d42
        (end detector)
        pool = "smart"
        pool_guid = 0x53773a861fbf8297
        pool_state = 0x0
        pool_context = 0x0
        pool_failmode = "continue"
        vdev_guid = 0x247c7a23b0318d42
        vdev_type = "disk"
        vdev_path = "/dev/disk/by-id/ata-Samsung_SSD_860_EVO_250GB_S3YJNB0K339667A-part2"
        vdev_ashift = 0xd
        vdev_complete_ts = 0x3e44013f9a06
        vdev_delta_ts = 0x2d3204
        vdev_read_errors = 0x0
        vdev_write_errors = 0x0
        vdev_cksum_errors = 0x0
        vdev_delays = 0x0
        parent_guid = 0x53773a861fbf8297
        parent_type = "root"
        vdev_spare_paths = 
        vdev_spare_guids = 
        zio_err = 0x0
        zio_flags = 0x180080
        zio_stage = 0x2000000
        zio_pipeline = 0x3e00000
        zio_delay = 0x45dd4
        zio_timestamp = 0x3de2bd5aa68a
        zio_delta = 0x468cc
        zio_priority = 0x0
        zio_offset = 0x1889e68000
        zio_size = 0x4000
        zio_objset = 0x86
        zio_object = 0x39d7a
        zio_level = 0x0
        zio_blkid = 0x0
        time = 0x66095db9 0x19f1aaad 
        eid = 0x46

Looks like they might be caused by the crashes I noticed in the kernel log.

lnicola commented 7 months ago

Another one after a forced reboot:

[  +0.620587] BUG: kernel NULL pointer dereference, address: 000000000000000c
[  +0.000022] #PF: supervisor read access in kernel mode
[  +0.000007] #PF: error_code(0x0000) - not-present page
[  +0.000006] PGD 0 P4D 0 
[  +0.000009] Oops: 0000 [#1] PREEMPT SMP NOPTI
[  +0.000007] CPU: 1 PID: 342 Comm: z_rd_int Tainted: P           OE      6.8.2-arch2-1 #1 a430fb92f7ba43092b62bbe6bac995458d3d442d
[  +0.000012] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Q1900DC-ITX, BIOS P1.70 03/01/2018
[  +0.000008] RIP: 0010:abd_iter_map+0x42/0x90 [zfs]
[  +0.000566] Code: 39 d7 74 1a 48 8b 70 28 f6 01 01 74 16 48 29 f2 48 89 50 08 48 8b 51 48 48 01 f2 48 89 10 c3 cc cc cc cc 4c 8b 40 30 48 29 fa <41> 8b 48 0c 48 29 f1 48 39 d1 48 0f 47 ca 48 89 48 08 49 8b 10 65
[  +0.000014] RSP: 0000:ffffa34d83a7f9d8 EFLAGS: 00010206
[  +0.000009] RAX: ffffa34d83a7fa00 RBX: 0000000000010000 RCX: ffff90809fc3e660
[  +0.000007] RDX: 0000000000010000 RSI: 0000000000000000 RDI: 0000000000002000
[  +0.000007] RBP: ffffa34d83a7fa00 R08: 0000000000000000 R09: 000082731e4c7523
[  +0.000006] R10: 0000006a9d326139 R11: 0050999bdb7e1e7c R12: ffff90809fc3e660
[  +0.000007] R13: ffffffffc03688b0 R14: 0000000000002000 R15: 0000000000000000
[  +0.000007] FS:  0000000000000000(0000) GS:ffff9083afc80000(0000) knlGS:0000000000000000
[  +0.000008] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  +0.000006] CR2: 000000000000000c CR3: 0000000121dc4000 CR4: 00000000001006f0
[  +0.000007] Call Trace:
[  +0.000010]  <TASK>
[  +0.000011]  ? __die+0x23/0x70
[  +0.000014]  ? page_fault_oops+0x171/0x4e0
[  +0.000012]  ? exc_page_fault+0x7f/0x180
[  +0.000010]  ? asm_exc_page_fault+0x26/0x30
[  +0.000009]  ? __pfx_abd_fletcher_4_iter+0x10/0x10 [zfs 9599260deb35358072448d0a802ac475d909352c]
[  +0.000530]  ? abd_iter_map+0x42/0x90 [zfs 9599260deb35358072448d0a802ac475d909352c]
[  +0.000525]  abd_iterate_func+0xd5/0x1a0 [zfs 9599260deb35358072448d0a802ac475d909352c]
[  +0.000565]  abd_fletcher_4_native+0x7f/0xc0 [zfs 9599260deb35358072448d0a802ac475d909352c]
[  +0.000543]  zio_checksum_error_impl+0x17b/0x6c0 [zfs 9599260deb35358072448d0a802ac475d909352c]
[  +0.000539]  ? __pfx_abd_fletcher_4_native+0x10/0x10 [zfs 9599260deb35358072448d0a802ac475d909352c]
[  +0.000507]  ? psi_task_switch+0x122/0x230
[  +0.000014]  ? vdev_queue_io_done+0x21a/0x260 [zfs 9599260deb35358072448d0a802ac475d909352c]
[  +0.000511]  zio_checksum_error+0x68/0xd0 [zfs 9599260deb35358072448d0a802ac475d909352c]
[  +0.000486]  zio_checksum_verify+0x4b/0x150 [zfs 9599260deb35358072448d0a802ac475d909352c]
[  +0.000514]  zio_execute+0x84/0x120 [zfs 9599260deb35358072448d0a802ac475d909352c]
[  +0.000514]  taskq_thread+0x2b4/0x570 [spl 21e6da6a41c300ca9f75ec94bfccd26c3074bd7f]
[  +0.000039]  ? __pfx_default_wake_function+0x10/0x10
[  +0.000013]  ? __pfx_zio_execute+0x10/0x10 [zfs 9599260deb35358072448d0a802ac475d909352c]
[  +0.000541]  ? __pfx_taskq_thread+0x10/0x10 [spl 21e6da6a41c300ca9f75ec94bfccd26c3074bd7f]
[  +0.000037]  kthread+0xe5/0x120
[  +0.000013]  ? __pfx_kthread+0x10/0x10
[  +0.000010]  ret_from_fork+0x31/0x50
[  +0.000010]  ? __pfx_kthread+0x10/0x10
[  +0.000008]  ret_from_fork_asm+0x1b/0x30
[  +0.000012]  </TASK>
[  +0.000005] Modules linked in: wireguard curve25519_x86_64 libchacha20poly1305 chacha_x86_64 poly1305_x86_64 libcurve25519_generic libchacha ip6_udp_tunnel udp_tunnel xt_tcpudp nft_ct iptable_filter iptable_nat nf_nat nf_conntrack nf_tables nf_defrag_ipv6 nf_defrag_ipv4 nct6775 libcrc32c nct6775_core crc32c_generic hwmon_vid intel_rapl_msr intel_rapl_common intel_soc_dts_thermal intel_soc_dts_iosf intel_powerclamp coretemp spi_nor iwlmvm crct10dif_pclmul mtd crc32_pclmul crc32c_intel mac80211 polyval_generic iTCO_wdt at24 mei_pxp mei_hdcp spi_intel_platform gf128mul libarc4 ptp i915 spi_intel intel_pmc_bxt pps_core iTCO_vendor_support ghash_clmulni_intel cryptd ppdev sha512_ssse3 r8169 sha256_ssse3 iwlwifi hci_uart btqca realtek sha1_ssse3 drm_buddy btusb btrtl i2c_algo_bit hid_generic mei_txe ttm vfat i2c_i801 mdio_devres btintel fat cfg80211 cdc_acm btbcm intel_cstate lpc_ich btmtk i2c_smbus mei libphy pcspkr drm_display_helper bluetooth cec intel_gtt i2c_hid_acpi parport_pc ecdh_generic video parport rfkill wmi
[  +0.000137]  i2c_hid pwm_lpss_platform crc16 pwm_lpss mac_hid crypto_user loop fuse dm_mod nfnetlink ip_tables x_tables usbhid zfs(POE) spl(OE) xhci_pci xhci_pci_renesas
[  +0.000070] CR2: 000000000000000c
[  +0.000007] ---[ end trace 0000000000000000 ]---
[  +0.000006] RIP: 0010:abd_iter_map+0x42/0x90 [zfs]
[  +0.000550] Code: 39 d7 74 1a 48 8b 70 28 f6 01 01 74 16 48 29 f2 48 89 50 08 48 8b 51 48 48 01 f2 48 89 10 c3 cc cc cc cc 4c 8b 40 30 48 29 fa <41> 8b 48 0c 48 29 f1 48 39 d1 48 0f 47 ca 48 89 48 08 49 8b 10 65
[  +0.000014] RSP: 0000:ffffa34d83a7f9d8 EFLAGS: 00010206
[  +0.000008] RAX: ffffa34d83a7fa00 RBX: 0000000000010000 RCX: ffff90809fc3e660
[  +0.000006] RDX: 0000000000010000 RSI: 0000000000000000 RDI: 0000000000002000
[  +0.000007] RBP: ffffa34d83a7fa00 R08: 0000000000000000 R09: 000082731e4c7523
[  +0.000006] R10: 0000006a9d326139 R11: 0050999bdb7e1e7c R12: ffff90809fc3e660
[  +0.000007] R13: ffffffffc03688b0 R14: 0000000000002000 R15: 0000000000000000
[  +0.000007] FS:  0000000000000000(0000) GS:ffff9083afc80000(0000) knlGS:0000000000000000
[  +0.000008] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  +0.000006] CR2: 000000000000000c CR3: 0000000121dc4000 CR4: 00000000001006f0
[  +0.000007] note: z_rd_int[342] exited with irqs disabled

But not errors in zpool status:

  pool: smart
 state: ONLINE
  scan: scrub repaired 0B in 00:02:36 with 0 errors on Mon Mar 25 00:36:03 2024
config:

    NAME                                                   STATE     READ WRITE CKSUM
    smart                                                  ONLINE       0     0     0
      ata-Samsung_SSD_860_EVO_250GB_S3YJNB0K339667A-part2  ONLINE       0     0     0

errors: No known data errors

  pool: tank
 state: ONLINE
  scan: scrub repaired 0B in 03:01:03 with 0 errors on Mon Mar 25 03:47:37 2024
  scan: resilvered (mirror-0) 1.02T in 02:44:52 with 0 errors on Mon Feb  7 23:42:58 2022
config:

    NAME                                                   STATE     READ WRITE CKSUM
    tank                                                   ONLINE       0     0     0
      mirror-0                                             ONLINE       0     0     0
        ata-WDC_WD40EFZX-68AWUN0_WD-WXA2DA1C9SU5           ONLINE       0     0     0
        ata-WDC_WD20EFAX-68FB5N0_WD-WXQ1AC8E8D2F           ONLINE       0     0     0
    logs    
      ata-Samsung_SSD_860_EVO_250GB_S3YJNB0K339667A-part3  ONLINE       0     0     0
    cache
      ata-Samsung_SSD_860_EVO_250GB_S3YJNB0K339667A-part4  ONLINE       0     0     0

errors: No known data errors

Feels like I should be taking a backup.

EDIT: SDMAIN is netdata reading the systemd journal.

lnicola commented 7 months ago

Could this be some kind of #15911?

Another one:

[ 4028.235480] BUG: kernel NULL pointer dereference, address: 000000000000000c
[ 4028.235507] #PF: supervisor read access in kernel mode
[ 4028.235520] #PF: error_code(0x0000) - not-present page
[ 4028.235531] PGD 0 P4D 0 
[ 4028.235545] Oops: 0000 [#3] PREEMPT SMP NOPTI
[ 4028.235558] CPU: 0 PID: 3054 Comm: less Tainted: P      D W  OE      6.8.2-arch2-1 #1 a430fb92f7ba43092b62bbe6bac995458d3d442d
[ 4028.235578] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Q1900DC-ITX, BIOS P1.70 03/01/2018
[ 4028.235592] RIP: 0010:abd_iter_map+0x42/0x90 [zfs]
[ 4028.236404] Code: 39 d7 74 1a 48 8b 70 28 f6 01 01 74 16 48 29 f2 48 89 50 08 48 8b 51 48 48 01 f2 48 89 10 c3 cc cc cc cc 4c 8b 40 30 48 29 fa <41> 8b 48 0c 48 29 f1 48 39 d1 48 0f 47 ca 48 89 48 08 49 8b 10 65
[ 4028.236426] RSP: 0000:ffffb4f7893ab6f0 EFLAGS: 00010206
[ 4028.236441] RAX: ffffb4f7893ab718 RBX: 0000000000010000 RCX: ffff8a0edd826420
[ 4028.236454] RDX: 0000000000010000 RSI: 0000000000000000 RDI: 0000000000004000
[ 4028.236466] RBP: ffffb4f7893ab718 R08: 0000000000000000 R09: 0000000000000000
[ 4028.236477] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8a0edd826420
[ 4028.236489] R13: ffffffffc044aaa0 R14: 0000000000004000 R15: 0000000000000000
[ 4028.236501] FS:  000076e754d2a740(0000) GS:ffff8a11efc00000(0000) knlGS:0000000000000000
[ 4028.236515] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 4028.236527] CR2: 000000000000000c CR3: 000000017ea62000 CR4: 00000000001006f0
[ 4028.236539] Call Trace:
[ 4028.236552]  <TASK>
[ 4028.236567]  ? __die+0x23/0x70
[ 4028.236587]  ? page_fault_oops+0x171/0x4e0
[ 4028.236607]  ? exc_page_fault+0x7f/0x180
[ 4028.236623]  ? asm_exc_page_fault+0x26/0x30
[ 4028.236637]  ? __pfx_abd_copy_to_buf_off_cb+0x10/0x10 [zfs 9599260deb35358072448d0a802ac475d909352c]
[ 4028.237482]  ? abd_iter_map+0x42/0x90 [zfs 9599260deb35358072448d0a802ac475d909352c]
[ 4028.238280]  abd_iterate_func+0xd5/0x1a0 [zfs 9599260deb35358072448d0a802ac475d909352c]
[ 4028.239103]  abd_borrow_buf_copy+0x7b/0x90 [zfs 9599260deb35358072448d0a802ac475d909352c]
[ 4028.239929]  zio_decompress_data+0x35/0x80 [zfs 9599260deb35358072448d0a802ac475d909352c]
[ 4028.240742]  arc_buf_fill+0x10c/0xc80 [zfs 9599260deb35358072448d0a802ac475d909352c]
[ 4028.241571]  ? arc_buf_alloc_impl.isra.0+0x155/0x310 [zfs 9599260deb35358072448d0a802ac475d909352c]
[ 4028.242398]  arc_read+0x14c5/0x16e0 [zfs 9599260deb35358072448d0a802ac475d909352c]
[ 4028.243243]  ? __pfx_dbuf_read_done+0x10/0x10 [zfs 9599260deb35358072448d0a802ac475d909352c]
[ 4028.244080]  ? dbuf_read_impl.constprop.0+0x50e/0x870 [zfs 9599260deb35358072448d0a802ac475d909352c]
[ 4028.244919]  dbuf_read_impl.constprop.0+0x50e/0x870 [zfs 9599260deb35358072448d0a802ac475d909352c]
[ 4028.245768]  dbuf_read+0xf5/0x5f0 [zfs 9599260deb35358072448d0a802ac475d909352c]
[ 4028.246604]  dmu_buf_hold_array_by_dnode+0x11d/0x690 [zfs 9599260deb35358072448d0a802ac475d909352c]
[ 4028.247454]  dmu_read_impl+0xb9/0x1d0 [zfs 9599260deb35358072448d0a802ac475d909352c]
[ 4028.248305]  dmu_read+0x63/0xa0 [zfs 9599260deb35358072448d0a802ac475d909352c]
[ 4028.249152]  zfs_fillpage+0x89/0x240 [zfs 9599260deb35358072448d0a802ac475d909352c]
[ 4028.249925]  zfs_getpage+0x56/0x110 [zfs 9599260deb35358072448d0a802ac475d909352c]
[ 4028.250699]  ? __pfx_zpl_read_folio+0x10/0x10 [zfs 9599260deb35358072448d0a802ac475d909352c]
[ 4028.251464]  zpl_read_folio+0x39/0x60 [zfs 9599260deb35358072448d0a802ac475d909352c]
[ 4028.252234]  ? __pfx_zpl_read_folio+0x10/0x10 [zfs 9599260deb35358072448d0a802ac475d909352c]
[ 4028.253002]  filemap_read_folio+0x3e/0xd0
[ 4028.253022]  filemap_fault+0x68d/0xb90
[ 4028.253043]  __do_fault+0x32/0x120
[ 4028.253059]  do_fault+0x271/0x490
[ 4028.253074]  __handle_mm_fault+0x81e/0xe40
[ 4028.253088]  ? vfs_write+0x29b/0x470
[ 4028.253107]  handle_mm_fault+0x17f/0x360
[ 4028.253121]  do_user_addr_fault+0x15b/0x670
[ 4028.253139]  exc_page_fault+0x7f/0x180
[ 4028.253155]  asm_exc_page_fault+0x26/0x30
[ 4028.253169] RIP: 0033:0x76e754f60b00
[ 4028.253227] Code: Unable to access opcode bytes at 0x76e754f60ad6.
[ 4028.253241] RSP: 002b:00007ffe4ea9c118 EFLAGS: 00010246
[ 4028.253255] RAX: 0000000000000000 RBX: 000060f821dd41a0 RCX: 0000000000000088
[ 4028.253267] RDX: 0000000000000000 RSI: 000060f821dd4228 RDI: 000060f821dd4363
[ 4028.253278] RBP: 000060f821dd4448 R08: 000060f821dd4290 R09: 00007ffe4ea9c250
[ 4028.253290] R10: 0000000000000000 R11: 000060f821dd41a0 R12: 000060f821dd4363
[ 4028.253301] R13: 000060f821dd4364 R14: 000060f821dd4360 R15: 00000000000000e4
[ 4028.253321]  </TASK>
[ 4028.253329] Modules linked in: wireguard curve25519_x86_64 libchacha20poly1305 chacha_x86_64 poly1305_x86_64 libcurve25519_generic libchacha ip6_udp_tunnel udp_tunnel nft_ct xt_tcpudp iptable_filter iptable_nat nf_nat nf_tables nf_conntrack nf_defrag_ipv6 nct6775 nf_defrag_ipv4 nct6775_core libcrc32c hwmon_vid crc32c_generic intel_rapl_msr iwlmvm intel_rapl_common intel_soc_dts_thermal intel_soc_dts_iosf intel_powerclamp mac80211 coretemp crct10dif_pclmul crc32_pclmul libarc4 crc32c_intel ptp pps_core i915 iwlwifi spi_nor polyval_generic gf128mul mtd ghash_clmulni_intel hci_uart btusb drm_buddy vfat btqca cryptd btrtl fat mei_pxp iTCO_wdt mei_hdcp i2c_algo_bit sha512_ssse3 r8169 ppdev btintel cfg80211 intel_pmc_bxt at24 iTCO_vendor_support btmtk spi_intel_platform hid_generic spi_intel btbcm ttm sha256_ssse3 realtek sha1_ssse3 intel_cstate cdc_acm mdio_devres bluetooth libphy i2c_i801 pcspkr mei_txe lpc_ich drm_display_helper mei i2c_smbus ecdh_generic cec intel_gtt video rfkill parport_pc wmi i2c_hid_acpi crc16
[ 4028.253552]  parport pwm_lpss_platform i2c_hid pwm_lpss mac_hid crypto_user loop fuse dm_mod nfnetlink ip_tables x_tables usbhid zfs(POE) spl(OE) xhci_pci xhci_pci_renesas
[ 4028.253675] CR2: 000000000000000c
[ 4028.253757] ---[ end trace 0000000000000000 ]---
[ 4028.253775] RIP: 0010:abd_iter_map+0x42/0x90 [zfs]
[ 4028.254632] Code: 39 d7 74 1a 48 8b 70 28 f6 01 01 74 16 48 29 f2 48 89 50 08 48 8b 51 48 48 01 f2 48 89 10 c3 cc cc cc cc 4c 8b 40 30 48 29 fa <41> 8b 48 0c 48 29 f1 48 39 d1 48 0f 47 ca 48 89 48 08 49 8b 10 65
[ 4028.254657] RSP: 0000:ffffb4f7813af9d8 EFLAGS: 00010206
[ 4028.254672] RAX: ffffb4f7813afa00 RBX: 0000000000010000 RCX: ffff8a0ed4cd5d80
[ 4028.254685] RDX: 0000000000010000 RSI: 0000000000000000 RDI: 0000000000002000
[ 4028.254696] RBP: ffffb4f7813afa00 R08: 0000000000000000 R09: 00005acaaf30f97e
[ 4028.254708] R10: 00000042607bc70d R11: 004a9cb6947bd87b R12: ffff8a0ed4cd5d80
[ 4028.254720] R13: ffffffffc03bf8b0 R14: 0000000000002000 R15: 0000000000000000
[ 4028.254733] FS:  000076e754d2a740(0000) GS:ffff8a11efc00000(0000) knlGS:0000000000000000
[ 4028.254747] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 4028.254758] CR2: 000076e754f60ad6 CR3: 000000017ea62000 CR4: 00000000001006f0
rincebrain commented 7 months ago

This does indeed seem like your SIMD state is getting trashed, from some of those backtraces.

What're your kernel config options, and which compiler and version is it built with?

lnicola commented 7 months ago

https://gitlab.archlinux.org/archlinux/packaging/packages/linux/-/blob/main/config?ref_type=heads and GCC 13.2.1.

rincebrain commented 7 months ago

Your stacktraces and that kernel config don't suggest you're building with kCFI, at least.

Wonder if it's some weird erratum with the strange little CPU in there.

rincebrain commented 7 months ago

I don't see any obvious J1900 erratum that might be germane.

A kernel running kASAN might be the fastest way to try and figure out what's going wrong, assuming that e.g. memtest doesn't report a bunch of memory errors.

lnicola commented 7 months ago

I'll have to try an older version of ZFS and memtest. Haven't built my own kernel in a long time, not sure if I'll try KASAN.

I've had this J1900 since 2014, haven't run into anything terrible until now.

rincebrain commented 7 months ago

I'm a little impressed, given that I understand the J1900 suffers from similar caveats to the infamous C2xxx Atom platform.

This does seem like something in kernel memory is getting trashed, and it's not immediately obvious where. You could bisect between those two commits, if you like, assuming that produced consistent results.

lnicola commented 7 months ago

I booted into a live image with the stable version (2.2.3), no errors in dmesg, some unreadable files were back (not sure if corrupted or not), scrub came clean. I downgraded to stable, works for now.

rincebrain commented 7 months ago

What are the properties on the dataset in question, out of curiosity?

lnicola commented 7 months ago
$ zfs get all -r smart -t filesystem | rg -v 'default|inherited'
NAME         PROPERTY              VALUE                     SOURCE
smart        type                  filesystem                -
smart        creation              Tue Jun 19 19:45 2018     -
smart        used                  49.7G                     -
smart        available             169G                      -
smart        referenced            272K                      -
smart        compressratio         2.83x                     -
smart        mounted               no                        -
smart        mountpoint            none                      local
smart        compression           on                        local
smart        atime                 off                       local
smart        createtxg             1                         -
smart        xattr                 sa                        local
smart        copies                2                         local
smart        version               5                         -
smart        utf8only              off                       -
smart        normalization         none                      -
smart        casesensitivity       sensitive                 -
smart        guid                  6799783423688149151       -
smart        usedbysnapshots       0B                        -
smart        usedbydataset         272K                      -
smart        usedbychildren        49.7G                     -
smart        usedbyrefreservation  0B                        -
smart        objsetid              54                        -
smart        dnodesize             auto                      local
smart        refcompressratio      1.00x                     -
smart        written               0                         -
smart        logicalused           64.4G                     -
smart        logicalreferenced     78K                       -
smart        snapshots_changed     Sun Mar 31  0:00:05 2024  local
smart/home   type                  filesystem                -
smart/home   creation              Tue Jun 19 19:56 2018     -
smart/home   used                  16.8G                     -
smart/home   available             169G                      -
smart/home   referenced            14.4G                     -
smart/home   compressratio         1.64x                     -
smart/home   mounted               yes                       -
smart/home   mountpoint            /home                     local
smart/home   createtxg             733                       -
smart/home   version               5                         -
smart/home   utf8only              off                       -
smart/home   normalization         none                      -
smart/home   casesensitivity       sensitive                 -
smart/home   guid                  2914329737695079349       -
smart/home   usedbysnapshots       2.38G                     -
smart/home   usedbydataset         14.4G                     -
smart/home   usedbychildren        0B                        -
smart/home   usedbyrefreservation  0B                        -
smart/home   objsetid              1307                      -
smart/home   refcompressratio      1.63x                     -
smart/home   written               6.29M                     -
smart/home   logicalused           12.3G                     -
smart/home   logicalreferenced     10.4G                     -
smart/home   snapshots_changed     Sun Mar 31  0:00:05 2024  local
smart/zroot  type                  filesystem                -
smart/zroot  creation              Tue Jun 19 19:46 2018     -
smart/zroot  used                  32.8G                     -
smart/zroot  available             169G                      -
smart/zroot  referenced            22.5G                     -
smart/zroot  compressratio         3.42x                     -
smart/zroot  mounted               yes                       -
smart/zroot  mountpoint            /                         local
smart/zroot  createtxg             23                        -
smart/zroot  version               5                         -
smart/zroot  utf8only              off                       -
smart/zroot  normalization         none                      -
smart/zroot  casesensitivity       sensitive                 -
smart/zroot  guid                  14710273763173895704      -
smart/zroot  usedbysnapshots       10.3G                     -
smart/zroot  usedbydataset         22.5G                     -
smart/zroot  usedbychildren        0B                        -
smart/zroot  usedbyrefreservation  0B                        -
smart/zroot  objsetid              134                       -
smart/zroot  refcompressratio      4.24x                     -
smart/zroot  written               773M                      -
smart/zroot  logicalused           52.1G                     -
smart/zroot  logicalreferenced     44.5G                     -
smart/zroot  acltype               posix                     local
smart/zroot  snapshots_changed     Sun Mar 31  0:00:05 2024  local

$ zpool get all smart | rg -v 'default|inherited'
NAME   PROPERTY                                      VALUE                                         SOURCE
smart  size                                          226G                                          -
smart  capacity                                      21%                                           -
smart  health                                        ONLINE                                        -
smart  guid                                          6014340175109259927                           -
smart  failmode                                      continue                                      local
smart  dedupratio                                    1.00x                                         -
smart  free                                          176G                                          -
smart  allocated                                     49.7G                                         -
smart  readonly                                      off                                           -
smart  ashift                                        13                                            local
smart  expandsize                                    -                                             -
smart  freeing                                       0                                             -
smart  fragmentation                                 43%                                           -
smart  leaked                                        0                                             -
smart  checkpoint                                    -                                             -
smart  load_guid                                     7630387544252668255                           -
smart  bcloneused                                    0                                             -
smart  bclonesaved                                   0                                             -
smart  bcloneratio                                   1.00x                                         -
smart  feature@async_destroy                         enabled                                       local
smart  feature@empty_bpobj                           active                                        local
smart  feature@lz4_compress                          active                                        local
smart  feature@multi_vdev_crash_dump                 enabled                                       local
smart  feature@spacemap_histogram                    active                                        local
smart  feature@enabled_txg                           active                                        local
smart  feature@hole_birth                            active                                        local
smart  feature@extensible_dataset                    active                                        local
smart  feature@embedded_data                         active                                        local
smart  feature@bookmarks                             enabled                                       local
smart  feature@filesystem_limits                     enabled                                       local
smart  feature@large_blocks                          enabled                                       local
smart  feature@large_dnode                           active                                        local
smart  feature@sha512                                enabled                                       local
smart  feature@skein                                 enabled                                       local
smart  feature@edonr                                 enabled                                       local
smart  feature@userobj_accounting                    active                                        local
smart  feature@encryption                            enabled                                       local
smart  feature@project_quota                         active                                        local
smart  feature@device_removal                        enabled                                       local
smart  feature@obsolete_counts                       enabled                                       local
smart  feature@zpool_checkpoint                      enabled                                       local
smart  feature@spacemap_v2                           active                                        local
smart  feature@allocation_classes                    enabled                                       local
smart  feature@resilver_defer                        enabled                                       local
smart  feature@bookmark_v2                           enabled                                       local
smart  feature@redaction_bookmarks                   enabled                                       local
smart  feature@redacted_datasets                     enabled                                       local
smart  feature@bookmark_written                      enabled                                       local
smart  feature@log_spacemap                          active                                        local
smart  feature@livelist                              enabled                                       local
smart  feature@device_rebuild                        enabled                                       local
smart  feature@zstd_compress                         enabled                                       local
smart  feature@draid                                 enabled                                       local
smart  feature@zilsaxattr                            active                                        local
smart  feature@head_errlog                           active                                        local
smart  feature@blake3                                enabled                                       local
smart  feature@block_cloning                         enabled                                       local
smart  feature@vdev_zaps_v2                          active                                        local
smart  unsupported@com.delphix:redaction_list_spill  inactive                                      local
smart  unsupported@org.openzfs:raidz_expansion       inactive                                      local
robn commented 7 months ago

@lnicola thanks for the info. I don't have any obvious ideas either, but if you're in a position to try, here's a couple of ideas.

One test without recompile: set zfs_vdev_disk_classic=1 at ZFS module load time. This will rule in/out a bunch of changes in how IO is handed off to the kernel.

Another thing to try if you can rebuild: revert 39be46f43, then rebuild (you'll need to rebuild from scratch, ie autogen -> configure -> make). That re-enables some long-dormant codepaths, which might be in play based on the most recent stack traces.

Of course, if you can rebuild, a full bisect between 8f2f6cd2a and 39be46f43 would really help narrow it down.

robn commented 7 months ago

Starting a full ZTS run against 6.8.2, just to see if anything shakes out.

lnicola commented 7 months ago

I made a VM with the "bad" version of ZFS and transferred the file systems to it, but I can't make it crash :disappointed:.

robn commented 7 months ago

Test run completed successfully. That's not to say there still isn't a problem, but it does exercise a few weird pressure points at least. Between that and you not seeing a problem on the different VM, we're definitely getting into either something specific to your system, or something very sensitive to timing to trigger.

For mine, at this point probably only bisecting on your original hardware is going to get us anywhere. If we can pin down a specific commit, then we can go over it carefully looking for concurrency bugs and trying to make a theory.

robn commented 7 months ago

16049 may or may not fix this. If you can try it, please do!

lnicola commented 7 months ago

Tried a couple of tests:

I still have to get a live image running 6.8.2 so I can try the linked PR.

UPDATE:

Not sure if relevant, but I also get a bunch of:

  LD [M]  /root/zfs/module/spl.o
/root/zfs/module/spl.o: warning: objtool: spl_kmem_cache_create+0x611: spl_panic() is missing a __noreturn annotation
/root/zfs/module/spl.o: warning: objtool: spl_kmem_cache_destroy+0x22c: spl_panic() is missing a __noreturn annotation
/root/zfs/module/spl.o: warning: objtool: kstat_seq_data_addr+0x77: spl_panic() is missing a __noreturn annotation
/root/zfs/module/spl.o: warning: objtool: kstat_seq_show+0x137: spl_panic() is missing a __noreturn annotation
/root/zfs/module/spl.o: warning: objtool: __kstat_delete+0x7e: spl_panic() is missing a __noreturn annotation
/root/zfs/module/spl.o: warning: objtool: __kstat_create+0x2e7: spl_panic() is missing a __noreturn annotation
/root/zfs/module/spl.o: warning: objtool: kstat_seq_start+0x369: spl_panic() is missing a __noreturn annotation
/root/zfs/module/spl.o: warning: objtool: taskq_destroy+0x1df: spl_panic() is missing a __noreturn annotation
/root/zfs/module/spl.o: warning: objtool: taskq_create_synced+0xcb: spl_panic() is missing a __noreturn annotation
  LD [M]  /root/zfs/module/zfs.o
/root/zfs/module/zfs.o: warning: objtool: luaD_throw() falls through to next function resume_error()
/root/zfs/module/zfs.o: warning: objtool: zfs_sha256_transform_x64+0x1d: unsupported stack pointer realignment
/root/zfs/module/zfs.o: warning: objtool: zfs_sha256_transform_ssse3+0x1d: unsupported stack pointer realignment
/root/zfs/module/zfs.o: warning: objtool: zfs_sha256_transform_avx+0x1d: unsupported stack pointer realignment
/root/zfs/module/zfs.o: warning: objtool: zfs_sha256_transform_avx2+0x1c: unsupported stack pointer realignment
/root/zfs/module/zfs.o: warning: objtool: zfs_sha512_transform_x64+0x20: unsupported stack pointer realignment
/root/zfs/module/zfs.o: warning: objtool: zfs_sha512_transform_avx+0x20: unsupported stack pointer realignment
/root/zfs/module/zfs.o: warning: objtool: zfs_sha512_transform_avx2+0x1c: unsupported stack pointer realignment
/root/zfs/module/zfs.o: warning: objtool: _aesni_ctr32_ghash_6x+0x57a: return with modified stack frame
/root/zfs/module/zfs.o: warning: objtool: _aesni_ctr32_ghash_no_movbe_6x+0x59a: return with modified stack frame
/root/zfs/module/zfs.o: warning: objtool: aesni_gcm_decrypt+0x48: unsupported stack pointer realignment
/root/zfs/module/zfs.o: warning: objtool: aesni_gcm_encrypt+0x52: unsupported stack pointer realignment

and

  INSTALL /lib/modules/6.8.2-arch2-1/extra/spl.ko
  SIGN    /lib/modules/6.8.2-arch2-1/extra/spl.ko
At main.c:167:
- SSL error:FFFFFFFF80000002:system library::No such file or directory: crypto/bio/bss_file.c:67
- SSL error:10000080:BIO routines::no such file: crypto/bio/bss_file.c:75
sign-file: ./certs/signing_key.pem
  ZSTD    /lib/modules/6.8.2-arch2-1/extra/spl.ko.zst
  INSTALL /lib/modules/6.8.2-arch2-1/extra/zfs.ko
  SIGN    /lib/modules/6.8.2-arch2-1/extra/zfs.ko
At main.c:167:
- SSL error:FFFFFFFF80000002:system library::No such file or directory: crypto/bio/bss_file.c:67
- SSL error:10000080:BIO routines::no such file: crypto/bio/bss_file.c:75
sign-file: ./certs/signing_key.pem
  ZSTD    /lib/modules/6.8.2-arch2-1/extra/zfs.ko.zst
  DEPMOD  /lib/modules/6.8.2-arch2-1

UPDATE 2:

robn commented 7 months ago

@lnicola this is promising. Thank you. I'm going to review the differences between 6.7 and 6.8 just to be sure, but I'm hopeful that this is actually the fix.

(regarding the build warnings, they're unrelated and fine - mostly it saying "you could have been a little tidier here, human").