Closed lnicola closed 7 months ago
@lnicola thanks for the report.
What action were you taking when this happened. From the log, the process is SDMAIN
, and its opening a file. Do you know any more about the file in question? And if you do, if it has any extended attributes?
When you say it was working before (at 8f2f6cd2ac), do you know if this same action was being taken then?
Initial analysis:
strcmp
will dereference %rdi
:
0f b6 14 07 movzbl (%rdi,%rax,1),%edx
So yes, its probably junk as the crash suggests.
nvt_lookup_name_type
loads %rdi
this:
49 8d 7c 24 28 lea 0x28(%r12),%rdi
That's the NVP_NAME
lookup. i_nvp_t
+0x10
is the start of the embedded nvpair_t
; its nvp_name
field is +0x18
beyond that.
for (i_nvp_t *e = entry; e != NULL; e = e->nvi_hashtable_next) {
if (strcmp(NVP_NAME(&e->nvi_nvp), name) == 0 &&
%r12
also looks similarly un-pointer-ish in the crash dump (very nearby %rdi
), so likely that's e
at fault. Its probably the first pass through the loop, unless the chain of nvi_hashtable_next
pointers is corrupt, but that feels less likely to me.
Regardless, entry
comes straight out of the nvlist hashtable.
That nvlist is zp->z_xattr_cached
, loaded straight up out of the SA area in zpl_xattr_get_sa()
. So that suggests there's garbage on disk, or we made a mess of reading it?
"It worked fine" suggests a recent break. There's nothing obvious in this list of commits that are implicated, unless its something coincidental in the myriad page reading changes. I think that's unlikely though if this is a very reliable failure.
39be46f43 evansr 2024-03-29 Linux 5.18+ compat: Detect filemap_range_has_page
2553f94c4 evansr 2024-03-29 Fix buffer underflow if sysfs file is empty
cfb96c772 rob.nor.. 2024-03-29 vdev_disk: clean up spa/bdev mode conversion
c0aab8b8f f.gruen.. 2024-03-29 zvols: prevent overflow of minor device numbers
b1e46f869 george... 2024-03-29 Add ashift validation when adding devices to a pool
e39e20b6d evansr 2024-03-27 ZTS: fix flakiness in cp_files_002_pos
0c8eb974f mav 2024-03-27 BRT: Check pool clone stats in more tests
b40342762 mav 2024-03-27 BRT: Fix tests to work on non-empty pools
a89d209bb mav 2024-03-27 BRT: Fix holes cloning.
8cd8ccca5 mav 2024-03-25 BRT: Skip getting length in brt_entry_lookup()
c6be6ce17 rob.nor.. 2024-03-25 abd_iter_page: don't use compound heads on Linux <4.5
72fd834c4 rob.nor.. 2024-03-25 vdev_disk: use bio_chain() to submit multiple BIOs
df2169d14 rob.nor.. 2024-03-25 vdev_disk: add module parameter to select BIO submission method
06a196020 rob.nor.. 2024-03-25 vdev_disk: rewrite BIO filling machinery to avoid split pages
c4a13ba48 rob.nor.. 2024-03-25 vdev_disk: make read/write IO function configurable
867178ae1 rob.nor.. 2024-03-25 vdev_disk: reorganise vdev_disk_io_start
f3b85d706 rob.nor.. 2024-03-25 vdev_disk: rename existing functions to vdev_classic_*
390b44872 rob.nor.. 2024-03-25 abd: add page iterator
df04efe32 rob.nor.. 2024-03-25 linux 5.4 compat: page_size()
f68bde723 mav 2024-03-25 BRT: Make BRT block sizes configurable
493fcce9b george... 2024-03-25 Provide macros for setting and getting blkptr birth times
4616b96a6 mav 2024-03-25 BRT: Relax brt_pending_apply() locking
80cc51629 mav 2024-03-25 ZAP: Massively switch to _by_dnode() interfaces
bf8f72359 mav 2024-03-25 BRT: Skip duplicate BRT prefetches
102b468b5 rrevans 2024-03-25 Fix corruption caused by mmap flushing problems
c28f94f32 mav 2024-03-21 ZAP: Some cleanups/micro-optimizations
f1b368359 f.gruen.. 2024-03-21 udev: correctly handle partition #16 and later
2c01cae8b mav 2024-03-21 BRT: Change brt_pending_tree sorting order
5c4a4f82c rob.nor.. 2024-03-21 zio: update ZIO type x stage documentation
c9d8f6c59 harr1 2024-03-21 Fix option string, adding -e and fixing order
45e23abed mav 2024-03-20 Update resume token at object receive.
ef08a4d40 robn 2024-03-20 Linux 6.8 compat: use splice_copy_file_range() for fallback
90ff73235 robn 2024-03-20 freebsd: fix missing headers in distribution tarball
However, just before that was 7a2e54b7d, which changed the test for the inode .permission
member. I can't fathom how it could actually be related, but I see to words the same and I wonder.
That's all I have for now.
From the log, the process is SDMAIN, and its opening a file. Do you know any more about the file in question?
That's a good question. I don't know what that process is, and the PID changes every time. I tried execsnoop
, but it's no longer crashing.
I do have a couple of these though:
Mar 31 2024 15:57:15.781944214 ereport.fs.zfs.deadman
class = "ereport.fs.zfs.deadman"
ena = 0xe264c66920e00801
detector = (embedded nvlist)
version = 0x0
scheme = "zfs"
pool = 0x53773a861fbf8297
vdev = 0x247c7a23b0318d42
(end detector)
pool = "smart"
pool_guid = 0x53773a861fbf8297
pool_state = 0x0
pool_context = 0x0
pool_failmode = "continue"
vdev_guid = 0x247c7a23b0318d42
vdev_type = "disk"
vdev_path = "/dev/disk/by-id/ata-Samsung_SSD_860_EVO_250GB_S3YJNB0K339667A-part2"
vdev_ashift = 0xd
vdev_complete_ts = 0x3e44013f9a06
vdev_delta_ts = 0x2d3204
vdev_read_errors = 0x0
vdev_write_errors = 0x0
vdev_cksum_errors = 0x0
vdev_delays = 0x0
parent_guid = 0x53773a861fbf8297
parent_type = "root"
vdev_spare_paths =
vdev_spare_guids =
zio_err = 0x0
zio_flags = 0x180080
zio_stage = 0x1000000
zio_pipeline = 0x3e00000
zio_delay = 0x3e8b0
zio_timestamp = 0x3ddddb3fdfce
zio_delta = 0x3ee44
zio_priority = 0x0
zio_offset = 0x1c35b88000
zio_size = 0x6000
zio_objset = 0x86
zio_object = 0x5eb5c
zio_level = 0x0
zio_blkid = 0x0
time = 0x66095dab 0x2e9b8596
eid = 0x45
Mar 31 2024 15:57:29.435268269 ereport.fs.zfs.deadman
class = "ereport.fs.zfs.deadman"
ena = 0xe297a3849ba00801
detector = (embedded nvlist)
version = 0x0
scheme = "zfs"
pool = 0x53773a861fbf8297
vdev = 0x247c7a23b0318d42
(end detector)
pool = "smart"
pool_guid = 0x53773a861fbf8297
pool_state = 0x0
pool_context = 0x0
pool_failmode = "continue"
vdev_guid = 0x247c7a23b0318d42
vdev_type = "disk"
vdev_path = "/dev/disk/by-id/ata-Samsung_SSD_860_EVO_250GB_S3YJNB0K339667A-part2"
vdev_ashift = 0xd
vdev_complete_ts = 0x3e44013f9a06
vdev_delta_ts = 0x2d3204
vdev_read_errors = 0x0
vdev_write_errors = 0x0
vdev_cksum_errors = 0x0
vdev_delays = 0x0
parent_guid = 0x53773a861fbf8297
parent_type = "root"
vdev_spare_paths =
vdev_spare_guids =
zio_err = 0x0
zio_flags = 0x180080
zio_stage = 0x2000000
zio_pipeline = 0x3e00000
zio_delay = 0x45dd4
zio_timestamp = 0x3de2bd5aa68a
zio_delta = 0x468cc
zio_priority = 0x0
zio_offset = 0x1889e68000
zio_size = 0x4000
zio_objset = 0x86
zio_object = 0x39d7a
zio_level = 0x0
zio_blkid = 0x0
time = 0x66095db9 0x19f1aaad
eid = 0x46
Looks like they might be caused by the crashes I noticed in the kernel log.
Another one after a forced reboot:
[ +0.620587] BUG: kernel NULL pointer dereference, address: 000000000000000c
[ +0.000022] #PF: supervisor read access in kernel mode
[ +0.000007] #PF: error_code(0x0000) - not-present page
[ +0.000006] PGD 0 P4D 0
[ +0.000009] Oops: 0000 [#1] PREEMPT SMP NOPTI
[ +0.000007] CPU: 1 PID: 342 Comm: z_rd_int Tainted: P OE 6.8.2-arch2-1 #1 a430fb92f7ba43092b62bbe6bac995458d3d442d
[ +0.000012] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Q1900DC-ITX, BIOS P1.70 03/01/2018
[ +0.000008] RIP: 0010:abd_iter_map+0x42/0x90 [zfs]
[ +0.000566] Code: 39 d7 74 1a 48 8b 70 28 f6 01 01 74 16 48 29 f2 48 89 50 08 48 8b 51 48 48 01 f2 48 89 10 c3 cc cc cc cc 4c 8b 40 30 48 29 fa <41> 8b 48 0c 48 29 f1 48 39 d1 48 0f 47 ca 48 89 48 08 49 8b 10 65
[ +0.000014] RSP: 0000:ffffa34d83a7f9d8 EFLAGS: 00010206
[ +0.000009] RAX: ffffa34d83a7fa00 RBX: 0000000000010000 RCX: ffff90809fc3e660
[ +0.000007] RDX: 0000000000010000 RSI: 0000000000000000 RDI: 0000000000002000
[ +0.000007] RBP: ffffa34d83a7fa00 R08: 0000000000000000 R09: 000082731e4c7523
[ +0.000006] R10: 0000006a9d326139 R11: 0050999bdb7e1e7c R12: ffff90809fc3e660
[ +0.000007] R13: ffffffffc03688b0 R14: 0000000000002000 R15: 0000000000000000
[ +0.000007] FS: 0000000000000000(0000) GS:ffff9083afc80000(0000) knlGS:0000000000000000
[ +0.000008] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ +0.000006] CR2: 000000000000000c CR3: 0000000121dc4000 CR4: 00000000001006f0
[ +0.000007] Call Trace:
[ +0.000010] <TASK>
[ +0.000011] ? __die+0x23/0x70
[ +0.000014] ? page_fault_oops+0x171/0x4e0
[ +0.000012] ? exc_page_fault+0x7f/0x180
[ +0.000010] ? asm_exc_page_fault+0x26/0x30
[ +0.000009] ? __pfx_abd_fletcher_4_iter+0x10/0x10 [zfs 9599260deb35358072448d0a802ac475d909352c]
[ +0.000530] ? abd_iter_map+0x42/0x90 [zfs 9599260deb35358072448d0a802ac475d909352c]
[ +0.000525] abd_iterate_func+0xd5/0x1a0 [zfs 9599260deb35358072448d0a802ac475d909352c]
[ +0.000565] abd_fletcher_4_native+0x7f/0xc0 [zfs 9599260deb35358072448d0a802ac475d909352c]
[ +0.000543] zio_checksum_error_impl+0x17b/0x6c0 [zfs 9599260deb35358072448d0a802ac475d909352c]
[ +0.000539] ? __pfx_abd_fletcher_4_native+0x10/0x10 [zfs 9599260deb35358072448d0a802ac475d909352c]
[ +0.000507] ? psi_task_switch+0x122/0x230
[ +0.000014] ? vdev_queue_io_done+0x21a/0x260 [zfs 9599260deb35358072448d0a802ac475d909352c]
[ +0.000511] zio_checksum_error+0x68/0xd0 [zfs 9599260deb35358072448d0a802ac475d909352c]
[ +0.000486] zio_checksum_verify+0x4b/0x150 [zfs 9599260deb35358072448d0a802ac475d909352c]
[ +0.000514] zio_execute+0x84/0x120 [zfs 9599260deb35358072448d0a802ac475d909352c]
[ +0.000514] taskq_thread+0x2b4/0x570 [spl 21e6da6a41c300ca9f75ec94bfccd26c3074bd7f]
[ +0.000039] ? __pfx_default_wake_function+0x10/0x10
[ +0.000013] ? __pfx_zio_execute+0x10/0x10 [zfs 9599260deb35358072448d0a802ac475d909352c]
[ +0.000541] ? __pfx_taskq_thread+0x10/0x10 [spl 21e6da6a41c300ca9f75ec94bfccd26c3074bd7f]
[ +0.000037] kthread+0xe5/0x120
[ +0.000013] ? __pfx_kthread+0x10/0x10
[ +0.000010] ret_from_fork+0x31/0x50
[ +0.000010] ? __pfx_kthread+0x10/0x10
[ +0.000008] ret_from_fork_asm+0x1b/0x30
[ +0.000012] </TASK>
[ +0.000005] Modules linked in: wireguard curve25519_x86_64 libchacha20poly1305 chacha_x86_64 poly1305_x86_64 libcurve25519_generic libchacha ip6_udp_tunnel udp_tunnel xt_tcpudp nft_ct iptable_filter iptable_nat nf_nat nf_conntrack nf_tables nf_defrag_ipv6 nf_defrag_ipv4 nct6775 libcrc32c nct6775_core crc32c_generic hwmon_vid intel_rapl_msr intel_rapl_common intel_soc_dts_thermal intel_soc_dts_iosf intel_powerclamp coretemp spi_nor iwlmvm crct10dif_pclmul mtd crc32_pclmul crc32c_intel mac80211 polyval_generic iTCO_wdt at24 mei_pxp mei_hdcp spi_intel_platform gf128mul libarc4 ptp i915 spi_intel intel_pmc_bxt pps_core iTCO_vendor_support ghash_clmulni_intel cryptd ppdev sha512_ssse3 r8169 sha256_ssse3 iwlwifi hci_uart btqca realtek sha1_ssse3 drm_buddy btusb btrtl i2c_algo_bit hid_generic mei_txe ttm vfat i2c_i801 mdio_devres btintel fat cfg80211 cdc_acm btbcm intel_cstate lpc_ich btmtk i2c_smbus mei libphy pcspkr drm_display_helper bluetooth cec intel_gtt i2c_hid_acpi parport_pc ecdh_generic video parport rfkill wmi
[ +0.000137] i2c_hid pwm_lpss_platform crc16 pwm_lpss mac_hid crypto_user loop fuse dm_mod nfnetlink ip_tables x_tables usbhid zfs(POE) spl(OE) xhci_pci xhci_pci_renesas
[ +0.000070] CR2: 000000000000000c
[ +0.000007] ---[ end trace 0000000000000000 ]---
[ +0.000006] RIP: 0010:abd_iter_map+0x42/0x90 [zfs]
[ +0.000550] Code: 39 d7 74 1a 48 8b 70 28 f6 01 01 74 16 48 29 f2 48 89 50 08 48 8b 51 48 48 01 f2 48 89 10 c3 cc cc cc cc 4c 8b 40 30 48 29 fa <41> 8b 48 0c 48 29 f1 48 39 d1 48 0f 47 ca 48 89 48 08 49 8b 10 65
[ +0.000014] RSP: 0000:ffffa34d83a7f9d8 EFLAGS: 00010206
[ +0.000008] RAX: ffffa34d83a7fa00 RBX: 0000000000010000 RCX: ffff90809fc3e660
[ +0.000006] RDX: 0000000000010000 RSI: 0000000000000000 RDI: 0000000000002000
[ +0.000007] RBP: ffffa34d83a7fa00 R08: 0000000000000000 R09: 000082731e4c7523
[ +0.000006] R10: 0000006a9d326139 R11: 0050999bdb7e1e7c R12: ffff90809fc3e660
[ +0.000007] R13: ffffffffc03688b0 R14: 0000000000002000 R15: 0000000000000000
[ +0.000007] FS: 0000000000000000(0000) GS:ffff9083afc80000(0000) knlGS:0000000000000000
[ +0.000008] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ +0.000006] CR2: 000000000000000c CR3: 0000000121dc4000 CR4: 00000000001006f0
[ +0.000007] note: z_rd_int[342] exited with irqs disabled
But not errors in zpool status
:
pool: smart
state: ONLINE
scan: scrub repaired 0B in 00:02:36 with 0 errors on Mon Mar 25 00:36:03 2024
config:
NAME STATE READ WRITE CKSUM
smart ONLINE 0 0 0
ata-Samsung_SSD_860_EVO_250GB_S3YJNB0K339667A-part2 ONLINE 0 0 0
errors: No known data errors
pool: tank
state: ONLINE
scan: scrub repaired 0B in 03:01:03 with 0 errors on Mon Mar 25 03:47:37 2024
scan: resilvered (mirror-0) 1.02T in 02:44:52 with 0 errors on Mon Feb 7 23:42:58 2022
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ata-WDC_WD40EFZX-68AWUN0_WD-WXA2DA1C9SU5 ONLINE 0 0 0
ata-WDC_WD20EFAX-68FB5N0_WD-WXQ1AC8E8D2F ONLINE 0 0 0
logs
ata-Samsung_SSD_860_EVO_250GB_S3YJNB0K339667A-part3 ONLINE 0 0 0
cache
ata-Samsung_SSD_860_EVO_250GB_S3YJNB0K339667A-part4 ONLINE 0 0 0
errors: No known data errors
Feels like I should be taking a backup.
EDIT: SDMAIN
is netdata
reading the systemd journal.
Could this be some kind of #15911?
Another one:
[ 4028.235480] BUG: kernel NULL pointer dereference, address: 000000000000000c
[ 4028.235507] #PF: supervisor read access in kernel mode
[ 4028.235520] #PF: error_code(0x0000) - not-present page
[ 4028.235531] PGD 0 P4D 0
[ 4028.235545] Oops: 0000 [#3] PREEMPT SMP NOPTI
[ 4028.235558] CPU: 0 PID: 3054 Comm: less Tainted: P D W OE 6.8.2-arch2-1 #1 a430fb92f7ba43092b62bbe6bac995458d3d442d
[ 4028.235578] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Q1900DC-ITX, BIOS P1.70 03/01/2018
[ 4028.235592] RIP: 0010:abd_iter_map+0x42/0x90 [zfs]
[ 4028.236404] Code: 39 d7 74 1a 48 8b 70 28 f6 01 01 74 16 48 29 f2 48 89 50 08 48 8b 51 48 48 01 f2 48 89 10 c3 cc cc cc cc 4c 8b 40 30 48 29 fa <41> 8b 48 0c 48 29 f1 48 39 d1 48 0f 47 ca 48 89 48 08 49 8b 10 65
[ 4028.236426] RSP: 0000:ffffb4f7893ab6f0 EFLAGS: 00010206
[ 4028.236441] RAX: ffffb4f7893ab718 RBX: 0000000000010000 RCX: ffff8a0edd826420
[ 4028.236454] RDX: 0000000000010000 RSI: 0000000000000000 RDI: 0000000000004000
[ 4028.236466] RBP: ffffb4f7893ab718 R08: 0000000000000000 R09: 0000000000000000
[ 4028.236477] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8a0edd826420
[ 4028.236489] R13: ffffffffc044aaa0 R14: 0000000000004000 R15: 0000000000000000
[ 4028.236501] FS: 000076e754d2a740(0000) GS:ffff8a11efc00000(0000) knlGS:0000000000000000
[ 4028.236515] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 4028.236527] CR2: 000000000000000c CR3: 000000017ea62000 CR4: 00000000001006f0
[ 4028.236539] Call Trace:
[ 4028.236552] <TASK>
[ 4028.236567] ? __die+0x23/0x70
[ 4028.236587] ? page_fault_oops+0x171/0x4e0
[ 4028.236607] ? exc_page_fault+0x7f/0x180
[ 4028.236623] ? asm_exc_page_fault+0x26/0x30
[ 4028.236637] ? __pfx_abd_copy_to_buf_off_cb+0x10/0x10 [zfs 9599260deb35358072448d0a802ac475d909352c]
[ 4028.237482] ? abd_iter_map+0x42/0x90 [zfs 9599260deb35358072448d0a802ac475d909352c]
[ 4028.238280] abd_iterate_func+0xd5/0x1a0 [zfs 9599260deb35358072448d0a802ac475d909352c]
[ 4028.239103] abd_borrow_buf_copy+0x7b/0x90 [zfs 9599260deb35358072448d0a802ac475d909352c]
[ 4028.239929] zio_decompress_data+0x35/0x80 [zfs 9599260deb35358072448d0a802ac475d909352c]
[ 4028.240742] arc_buf_fill+0x10c/0xc80 [zfs 9599260deb35358072448d0a802ac475d909352c]
[ 4028.241571] ? arc_buf_alloc_impl.isra.0+0x155/0x310 [zfs 9599260deb35358072448d0a802ac475d909352c]
[ 4028.242398] arc_read+0x14c5/0x16e0 [zfs 9599260deb35358072448d0a802ac475d909352c]
[ 4028.243243] ? __pfx_dbuf_read_done+0x10/0x10 [zfs 9599260deb35358072448d0a802ac475d909352c]
[ 4028.244080] ? dbuf_read_impl.constprop.0+0x50e/0x870 [zfs 9599260deb35358072448d0a802ac475d909352c]
[ 4028.244919] dbuf_read_impl.constprop.0+0x50e/0x870 [zfs 9599260deb35358072448d0a802ac475d909352c]
[ 4028.245768] dbuf_read+0xf5/0x5f0 [zfs 9599260deb35358072448d0a802ac475d909352c]
[ 4028.246604] dmu_buf_hold_array_by_dnode+0x11d/0x690 [zfs 9599260deb35358072448d0a802ac475d909352c]
[ 4028.247454] dmu_read_impl+0xb9/0x1d0 [zfs 9599260deb35358072448d0a802ac475d909352c]
[ 4028.248305] dmu_read+0x63/0xa0 [zfs 9599260deb35358072448d0a802ac475d909352c]
[ 4028.249152] zfs_fillpage+0x89/0x240 [zfs 9599260deb35358072448d0a802ac475d909352c]
[ 4028.249925] zfs_getpage+0x56/0x110 [zfs 9599260deb35358072448d0a802ac475d909352c]
[ 4028.250699] ? __pfx_zpl_read_folio+0x10/0x10 [zfs 9599260deb35358072448d0a802ac475d909352c]
[ 4028.251464] zpl_read_folio+0x39/0x60 [zfs 9599260deb35358072448d0a802ac475d909352c]
[ 4028.252234] ? __pfx_zpl_read_folio+0x10/0x10 [zfs 9599260deb35358072448d0a802ac475d909352c]
[ 4028.253002] filemap_read_folio+0x3e/0xd0
[ 4028.253022] filemap_fault+0x68d/0xb90
[ 4028.253043] __do_fault+0x32/0x120
[ 4028.253059] do_fault+0x271/0x490
[ 4028.253074] __handle_mm_fault+0x81e/0xe40
[ 4028.253088] ? vfs_write+0x29b/0x470
[ 4028.253107] handle_mm_fault+0x17f/0x360
[ 4028.253121] do_user_addr_fault+0x15b/0x670
[ 4028.253139] exc_page_fault+0x7f/0x180
[ 4028.253155] asm_exc_page_fault+0x26/0x30
[ 4028.253169] RIP: 0033:0x76e754f60b00
[ 4028.253227] Code: Unable to access opcode bytes at 0x76e754f60ad6.
[ 4028.253241] RSP: 002b:00007ffe4ea9c118 EFLAGS: 00010246
[ 4028.253255] RAX: 0000000000000000 RBX: 000060f821dd41a0 RCX: 0000000000000088
[ 4028.253267] RDX: 0000000000000000 RSI: 000060f821dd4228 RDI: 000060f821dd4363
[ 4028.253278] RBP: 000060f821dd4448 R08: 000060f821dd4290 R09: 00007ffe4ea9c250
[ 4028.253290] R10: 0000000000000000 R11: 000060f821dd41a0 R12: 000060f821dd4363
[ 4028.253301] R13: 000060f821dd4364 R14: 000060f821dd4360 R15: 00000000000000e4
[ 4028.253321] </TASK>
[ 4028.253329] Modules linked in: wireguard curve25519_x86_64 libchacha20poly1305 chacha_x86_64 poly1305_x86_64 libcurve25519_generic libchacha ip6_udp_tunnel udp_tunnel nft_ct xt_tcpudp iptable_filter iptable_nat nf_nat nf_tables nf_conntrack nf_defrag_ipv6 nct6775 nf_defrag_ipv4 nct6775_core libcrc32c hwmon_vid crc32c_generic intel_rapl_msr iwlmvm intel_rapl_common intel_soc_dts_thermal intel_soc_dts_iosf intel_powerclamp mac80211 coretemp crct10dif_pclmul crc32_pclmul libarc4 crc32c_intel ptp pps_core i915 iwlwifi spi_nor polyval_generic gf128mul mtd ghash_clmulni_intel hci_uart btusb drm_buddy vfat btqca cryptd btrtl fat mei_pxp iTCO_wdt mei_hdcp i2c_algo_bit sha512_ssse3 r8169 ppdev btintel cfg80211 intel_pmc_bxt at24 iTCO_vendor_support btmtk spi_intel_platform hid_generic spi_intel btbcm ttm sha256_ssse3 realtek sha1_ssse3 intel_cstate cdc_acm mdio_devres bluetooth libphy i2c_i801 pcspkr mei_txe lpc_ich drm_display_helper mei i2c_smbus ecdh_generic cec intel_gtt video rfkill parport_pc wmi i2c_hid_acpi crc16
[ 4028.253552] parport pwm_lpss_platform i2c_hid pwm_lpss mac_hid crypto_user loop fuse dm_mod nfnetlink ip_tables x_tables usbhid zfs(POE) spl(OE) xhci_pci xhci_pci_renesas
[ 4028.253675] CR2: 000000000000000c
[ 4028.253757] ---[ end trace 0000000000000000 ]---
[ 4028.253775] RIP: 0010:abd_iter_map+0x42/0x90 [zfs]
[ 4028.254632] Code: 39 d7 74 1a 48 8b 70 28 f6 01 01 74 16 48 29 f2 48 89 50 08 48 8b 51 48 48 01 f2 48 89 10 c3 cc cc cc cc 4c 8b 40 30 48 29 fa <41> 8b 48 0c 48 29 f1 48 39 d1 48 0f 47 ca 48 89 48 08 49 8b 10 65
[ 4028.254657] RSP: 0000:ffffb4f7813af9d8 EFLAGS: 00010206
[ 4028.254672] RAX: ffffb4f7813afa00 RBX: 0000000000010000 RCX: ffff8a0ed4cd5d80
[ 4028.254685] RDX: 0000000000010000 RSI: 0000000000000000 RDI: 0000000000002000
[ 4028.254696] RBP: ffffb4f7813afa00 R08: 0000000000000000 R09: 00005acaaf30f97e
[ 4028.254708] R10: 00000042607bc70d R11: 004a9cb6947bd87b R12: ffff8a0ed4cd5d80
[ 4028.254720] R13: ffffffffc03bf8b0 R14: 0000000000002000 R15: 0000000000000000
[ 4028.254733] FS: 000076e754d2a740(0000) GS:ffff8a11efc00000(0000) knlGS:0000000000000000
[ 4028.254747] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 4028.254758] CR2: 000076e754f60ad6 CR3: 000000017ea62000 CR4: 00000000001006f0
This does indeed seem like your SIMD state is getting trashed, from some of those backtraces.
What're your kernel config options, and which compiler and version is it built with?
Your stacktraces and that kernel config don't suggest you're building with kCFI, at least.
Wonder if it's some weird erratum with the strange little CPU in there.
I don't see any obvious J1900 erratum that might be germane.
A kernel running kASAN might be the fastest way to try and figure out what's going wrong, assuming that e.g. memtest doesn't report a bunch of memory errors.
I'll have to try an older version of ZFS and memtest. Haven't built my own kernel in a long time, not sure if I'll try KASAN.
I've had this J1900 since 2014, haven't run into anything terrible until now.
I'm a little impressed, given that I understand the J1900 suffers from similar caveats to the infamous C2xxx Atom platform.
This does seem like something in kernel memory is getting trashed, and it's not immediately obvious where. You could bisect between those two commits, if you like, assuming that produced consistent results.
I booted into a live image with the stable version (2.2.3), no errors in dmesg
, some unreadable files were back (not sure if corrupted or not), scrub came clean. I downgraded to stable, works for now.
What are the properties on the dataset in question, out of curiosity?
$ zfs get all -r smart -t filesystem | rg -v 'default|inherited'
NAME PROPERTY VALUE SOURCE
smart type filesystem -
smart creation Tue Jun 19 19:45 2018 -
smart used 49.7G -
smart available 169G -
smart referenced 272K -
smart compressratio 2.83x -
smart mounted no -
smart mountpoint none local
smart compression on local
smart atime off local
smart createtxg 1 -
smart xattr sa local
smart copies 2 local
smart version 5 -
smart utf8only off -
smart normalization none -
smart casesensitivity sensitive -
smart guid 6799783423688149151 -
smart usedbysnapshots 0B -
smart usedbydataset 272K -
smart usedbychildren 49.7G -
smart usedbyrefreservation 0B -
smart objsetid 54 -
smart dnodesize auto local
smart refcompressratio 1.00x -
smart written 0 -
smart logicalused 64.4G -
smart logicalreferenced 78K -
smart snapshots_changed Sun Mar 31 0:00:05 2024 local
smart/home type filesystem -
smart/home creation Tue Jun 19 19:56 2018 -
smart/home used 16.8G -
smart/home available 169G -
smart/home referenced 14.4G -
smart/home compressratio 1.64x -
smart/home mounted yes -
smart/home mountpoint /home local
smart/home createtxg 733 -
smart/home version 5 -
smart/home utf8only off -
smart/home normalization none -
smart/home casesensitivity sensitive -
smart/home guid 2914329737695079349 -
smart/home usedbysnapshots 2.38G -
smart/home usedbydataset 14.4G -
smart/home usedbychildren 0B -
smart/home usedbyrefreservation 0B -
smart/home objsetid 1307 -
smart/home refcompressratio 1.63x -
smart/home written 6.29M -
smart/home logicalused 12.3G -
smart/home logicalreferenced 10.4G -
smart/home snapshots_changed Sun Mar 31 0:00:05 2024 local
smart/zroot type filesystem -
smart/zroot creation Tue Jun 19 19:46 2018 -
smart/zroot used 32.8G -
smart/zroot available 169G -
smart/zroot referenced 22.5G -
smart/zroot compressratio 3.42x -
smart/zroot mounted yes -
smart/zroot mountpoint / local
smart/zroot createtxg 23 -
smart/zroot version 5 -
smart/zroot utf8only off -
smart/zroot normalization none -
smart/zroot casesensitivity sensitive -
smart/zroot guid 14710273763173895704 -
smart/zroot usedbysnapshots 10.3G -
smart/zroot usedbydataset 22.5G -
smart/zroot usedbychildren 0B -
smart/zroot usedbyrefreservation 0B -
smart/zroot objsetid 134 -
smart/zroot refcompressratio 4.24x -
smart/zroot written 773M -
smart/zroot logicalused 52.1G -
smart/zroot logicalreferenced 44.5G -
smart/zroot acltype posix local
smart/zroot snapshots_changed Sun Mar 31 0:00:05 2024 local
$ zpool get all smart | rg -v 'default|inherited'
NAME PROPERTY VALUE SOURCE
smart size 226G -
smart capacity 21% -
smart health ONLINE -
smart guid 6014340175109259927 -
smart failmode continue local
smart dedupratio 1.00x -
smart free 176G -
smart allocated 49.7G -
smart readonly off -
smart ashift 13 local
smart expandsize - -
smart freeing 0 -
smart fragmentation 43% -
smart leaked 0 -
smart checkpoint - -
smart load_guid 7630387544252668255 -
smart bcloneused 0 -
smart bclonesaved 0 -
smart bcloneratio 1.00x -
smart feature@async_destroy enabled local
smart feature@empty_bpobj active local
smart feature@lz4_compress active local
smart feature@multi_vdev_crash_dump enabled local
smart feature@spacemap_histogram active local
smart feature@enabled_txg active local
smart feature@hole_birth active local
smart feature@extensible_dataset active local
smart feature@embedded_data active local
smart feature@bookmarks enabled local
smart feature@filesystem_limits enabled local
smart feature@large_blocks enabled local
smart feature@large_dnode active local
smart feature@sha512 enabled local
smart feature@skein enabled local
smart feature@edonr enabled local
smart feature@userobj_accounting active local
smart feature@encryption enabled local
smart feature@project_quota active local
smart feature@device_removal enabled local
smart feature@obsolete_counts enabled local
smart feature@zpool_checkpoint enabled local
smart feature@spacemap_v2 active local
smart feature@allocation_classes enabled local
smart feature@resilver_defer enabled local
smart feature@bookmark_v2 enabled local
smart feature@redaction_bookmarks enabled local
smart feature@redacted_datasets enabled local
smart feature@bookmark_written enabled local
smart feature@log_spacemap active local
smart feature@livelist enabled local
smart feature@device_rebuild enabled local
smart feature@zstd_compress enabled local
smart feature@draid enabled local
smart feature@zilsaxattr active local
smart feature@head_errlog active local
smart feature@blake3 enabled local
smart feature@block_cloning enabled local
smart feature@vdev_zaps_v2 active local
smart unsupported@com.delphix:redaction_list_spill inactive local
smart unsupported@org.openzfs:raidz_expansion inactive local
@lnicola thanks for the info. I don't have any obvious ideas either, but if you're in a position to try, here's a couple of ideas.
One test without recompile: set zfs_vdev_disk_classic=1
at ZFS module load time. This will rule in/out a bunch of changes in how IO is handed off to the kernel.
Another thing to try if you can rebuild: revert 39be46f43, then rebuild (you'll need to rebuild from scratch, ie autogen -> configure -> make). That re-enables some long-dormant codepaths, which might be in play based on the most recent stack traces.
Of course, if you can rebuild, a full bisect between 8f2f6cd2a and 39be46f43 would really help narrow it down.
Starting a full ZTS run against 6.8.2, just to see if anything shakes out.
I made a VM with the "bad" version of ZFS and transferred the file systems to it, but I can't make it crash :disappointed:.
Test run completed successfully. That's not to say there still isn't a problem, but it does exercise a few weird pressure points at least. Between that and you not seeing a problem on the different VM, we're definitely getting into either something specific to your system, or something very sensitive to timing to trigger.
For mine, at this point probably only bisecting on your original hardware is going to get us anywhere. If we can pin down a specific commit, then we can go over it carefully looking for concurrency bugs and trying to make a theory.
Tried a couple of tests:
zfs-linux-git
on 6.8.2, got a wall of panicszfs_vdev_disk_classic=1
, everything seems finezfs-linux-git
still runsI still have to get a live image running 6.8.2 so I can try the linked PR.
UPDATE:
ls
says Input/output error, nothing in dmesg
, nothing in zpool status
)zfs_vdev_disk_classic=1
makes that directory readablezed
, it doesn't say anythingNot sure if relevant, but I also get a bunch of:
LD [M] /root/zfs/module/spl.o
/root/zfs/module/spl.o: warning: objtool: spl_kmem_cache_create+0x611: spl_panic() is missing a __noreturn annotation
/root/zfs/module/spl.o: warning: objtool: spl_kmem_cache_destroy+0x22c: spl_panic() is missing a __noreturn annotation
/root/zfs/module/spl.o: warning: objtool: kstat_seq_data_addr+0x77: spl_panic() is missing a __noreturn annotation
/root/zfs/module/spl.o: warning: objtool: kstat_seq_show+0x137: spl_panic() is missing a __noreturn annotation
/root/zfs/module/spl.o: warning: objtool: __kstat_delete+0x7e: spl_panic() is missing a __noreturn annotation
/root/zfs/module/spl.o: warning: objtool: __kstat_create+0x2e7: spl_panic() is missing a __noreturn annotation
/root/zfs/module/spl.o: warning: objtool: kstat_seq_start+0x369: spl_panic() is missing a __noreturn annotation
/root/zfs/module/spl.o: warning: objtool: taskq_destroy+0x1df: spl_panic() is missing a __noreturn annotation
/root/zfs/module/spl.o: warning: objtool: taskq_create_synced+0xcb: spl_panic() is missing a __noreturn annotation
LD [M] /root/zfs/module/zfs.o
/root/zfs/module/zfs.o: warning: objtool: luaD_throw() falls through to next function resume_error()
/root/zfs/module/zfs.o: warning: objtool: zfs_sha256_transform_x64+0x1d: unsupported stack pointer realignment
/root/zfs/module/zfs.o: warning: objtool: zfs_sha256_transform_ssse3+0x1d: unsupported stack pointer realignment
/root/zfs/module/zfs.o: warning: objtool: zfs_sha256_transform_avx+0x1d: unsupported stack pointer realignment
/root/zfs/module/zfs.o: warning: objtool: zfs_sha256_transform_avx2+0x1c: unsupported stack pointer realignment
/root/zfs/module/zfs.o: warning: objtool: zfs_sha512_transform_x64+0x20: unsupported stack pointer realignment
/root/zfs/module/zfs.o: warning: objtool: zfs_sha512_transform_avx+0x20: unsupported stack pointer realignment
/root/zfs/module/zfs.o: warning: objtool: zfs_sha512_transform_avx2+0x1c: unsupported stack pointer realignment
/root/zfs/module/zfs.o: warning: objtool: _aesni_ctr32_ghash_6x+0x57a: return with modified stack frame
/root/zfs/module/zfs.o: warning: objtool: _aesni_ctr32_ghash_no_movbe_6x+0x59a: return with modified stack frame
/root/zfs/module/zfs.o: warning: objtool: aesni_gcm_decrypt+0x48: unsupported stack pointer realignment
/root/zfs/module/zfs.o: warning: objtool: aesni_gcm_encrypt+0x52: unsupported stack pointer realignment
and
INSTALL /lib/modules/6.8.2-arch2-1/extra/spl.ko
SIGN /lib/modules/6.8.2-arch2-1/extra/spl.ko
At main.c:167:
- SSL error:FFFFFFFF80000002:system library::No such file or directory: crypto/bio/bss_file.c:67
- SSL error:10000080:BIO routines::no such file: crypto/bio/bss_file.c:75
sign-file: ./certs/signing_key.pem
ZSTD /lib/modules/6.8.2-arch2-1/extra/spl.ko.zst
INSTALL /lib/modules/6.8.2-arch2-1/extra/zfs.ko
SIGN /lib/modules/6.8.2-arch2-1/extra/zfs.ko
At main.c:167:
- SSL error:FFFFFFFF80000002:system library::No such file or directory: crypto/bio/bss_file.c:67
- SSL error:10000080:BIO routines::no such file: crypto/bio/bss_file.c:75
sign-file: ./certs/signing_key.pem
ZSTD /lib/modules/6.8.2-arch2-1/extra/zfs.ko.zst
DEPMOD /lib/modules/6.8.2-arch2-1
UPDATE 2:
zfs_vdev_disk_classic=1
fixes the original panics@lnicola this is promising. Thank you. I'm going to review the differences between 6.7 and 6.8 just to be sure, but I'm hopeful that this is actually the fix.
(regarding the build warnings, they're unrelated and fine - mostly it saying "you could have been a little tidier here, human").
System information
Describe the problem you're observing
Error message in the logs every two minutes.
2024.02.26.r9034.g8f2f6cd2ac_6.8.1.arch1.1-1
was fine.Describe how to reproduce the problem
Don't know, I just upgraded and rebooted.
Include any warning/errors/backtraces from the system logs