Closed sirn closed 1 year ago
Wonder if there's something spicy about AVX512 save/restore that's not captured in our custom save/restore. @AttilaFueloep
I can't imagine sapphire rapids' save/restore context for avx512 is very different from any other avx512 CPU. The VM detail seems germaine, though. Maybe the VM is stomping on avx512 register state because it doesn't realize it's on an avx512 CPU?
Well, I guess it must have or it wouldn't have selected the avx512 Fletcher4 implementation. Something tells me the fault is somewhere in the hypervisor though.
Maybe the VM is stomping on avx512 register state because it doesn't realize it's on an avx512 CPU?
FWIW, the VM is running Windows 11 Pro with host-passthrough
(4 vCPUs). It was always the first VM that got booted up. I will see if a Linux VM can trigger this.
Oh wait no it sounds like the host OS is what's panicking, not the VM. So yeah my guess is that your VM is stomping all over your avx512 context because your hypervisor isn't presenting itself as an avx512 CPU.
This seems like it should break more things than just ZFS if the guest can clobber the host's SIMD state, though...
Yeah, seems likely. Any better luck with a Linux guest?
I've tried a few restarts, but couldn't trigger it on both Windows and Linux VMs. I believe it'll take me a while to try to reproduce this.
However, it does look like to be some sort of CPU bug where it got stuck in a buggy state, that for some reason avx512f + launching a VM triggers it. I'll update once I have more crashes.
On the other hand, I have not yet observed this happening with avx512bw.
This one is new (it crashed at gcm_mode_encrypt_contiguous_blocks
instead of the usual fletcher_4_avx512f_native
), and I believe at the time it's running avx512bw:
[ 1696.497577] BUG: unable to handle page fault for address: ffff9e7989271cff
[ 1696.497583] #PF: supervisor write access in kernel mode
[ 1696.497585] #PF: error_code(0x0003) - permissions violation
[ 1696.497586] PGD 807f601067 P4D 807f601067 PUD 10855b063 PMD 11c71a063 PTE 8000000109271061
[ 1696.497589] Oops: 0003 [#1] PREEMPT SMP NOPTI
[ 1696.497592] CPU: 1 PID: 67531 Comm: z_wr_iss Tainted: P O 6.2.15_2 #1
[ 1696.497594] Hardware name: Supermicro Super Server/X13SWA-TF, BIOS 1.1a 04/25/2023
[ 1696.497595] RIP: 0010:kfpu_begin+0x2f/0x70 [icp]
[ 1696.497607] Code: 00 e8 35 b7 55 e6 fa 0f 1f 44 00 00 48 8b 1d d8 0f c6 ff e8 03 e8 fb e6 89 c0 48 8b 0c c3 0f 1f 44 00 00 b8 ff ff ff ff 89 c2 <0f> c7 29 5b c3 cc cc cc cc 0f 1f 44 00 00 b8 ff ff ff ff 89 c2 0f
[ 1696.497608] RSP: 0018:ffffad15e50075b0 EFLAGS: 00010096
[ 1696.497610] RAX: 00000000ffffffff RBX: ffff9e7984dae400 RCX: ffff9e798926f000
[ 1696.497611] RDX: 00000000ffffffff RSI: ffffffffa810adb1 RDI: ffffffffa81191e3
[ 1696.497612] RBP: ffff9e7ba31c0000 R08: 0000000000000000 R09: 0000000007af41f3
[ 1696.497613] R10: ffff9ef6fa272de8 R11: 0000000000000140 R12: 0000000000000000
[ 1696.497614] R13: 0000000000007fe0 R14: 0000000000000010 R15: ffffad15e5007980
[ 1696.497615] FS: 0000000000000000(0000) GS:ffff9ef6fa240000(0000) knlGS:0000000000000000
[ 1696.497616] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1696.497617] CR2: ffff9e7989271cff CR3: 000000807e810006 CR4: 0000000000772ee0
[ 1696.497618] PKRU: 55555554
[ 1696.497619] Call Trace:
[ 1696.497621] <TASK>
[ 1696.497623] gcm_mode_encrypt_contiguous_blocks+0x552/0x7c0 [icp]
[ 1696.497631] ? __pfx_aes_encrypt_block+0x10/0x10 [icp]
[ 1696.497639] ? __pfx_aes_encrypt_contiguous_blocks+0x10/0x10 [icp]
[ 1696.497645] aes_encrypt_contiguous_blocks+0xac/0xc0 [icp]
[ 1696.497651] ? __pfx_aes_copy_block+0x10/0x10 [icp]
[ 1696.497657] ? __pfx_aes_xor_block+0x10/0x10 [icp]
[ 1696.497662] crypto_update_uio+0xe8/0x140 [icp]
[ 1696.497671] aes_encrypt_atomic+0x149/0x330 [icp]
[ 1696.497679] crypto_encrypt+0x281/0x350 [icp]
[ 1696.497686] ? zio_crypt_bp_do_hmac_updates+0x6b/0xa0 [zfs]
[ 1696.497744] ? i_mod_hash_find_nosync+0x5f/0xa0 [icp]
[ 1696.497752] ? preempt_count_add+0x47/0xa0
[ 1696.497755] ? up_read+0x37/0x70
[ 1696.497759] zio_do_crypt_uio+0x1dc/0x2d0 [zfs]
[ 1696.497800] zio_do_crypt_data+0x273/0x10a0 [zfs]
[ 1696.497835] ? preempt_count_add+0x47/0xa0
[ 1696.497837] ? up_read+0x37/0x70
[ 1696.497839] spa_do_crypt_abd+0x150/0x330 [zfs]
[ 1696.497897] zio_encrypt+0x300/0x730 [zfs]
[ 1696.497955] zio_execute+0x80/0x120 [zfs]
[ 1696.498007] taskq_thread+0x2e0/0x510 [spl]
[ 1696.498014] ? __pfx_default_wake_function+0x10/0x10
[ 1696.498016] ? __pfx_zio_execute+0x10/0x10 [zfs]
[ 1696.498066] ? __pfx_taskq_thread+0x10/0x10 [spl]
[ 1696.498071] kthread+0xe6/0x110
[ 1696.498074] ? __pfx_kthread+0x10/0x10
[ 1696.498076] ret_from_fork+0x29/0x50
[ 1696.498079] </TASK>
[ 1696.498080] Modules linked in: snd_seq_dummy snd_hrtimer xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp ip6table_mangle ip6table_nat ip6table_filter ip6_tables iptable_mangle iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_filter ip_tables x_tables bridge stp llc bnep msr nls_iso8859_1 nls_cp437 vfat fat hid_logitech_hidpp hid_logitech_dj mei_hdcp mei_pxp ipmi_ssif snd_sof_pci_intel_tgl snd_sof_intel_hda_common intel_rapl_msr intel_rapl_common soundwire_intel soundwire_generic_allocation intel_uncore_frequency intel_uncore_frequency_common soundwire_cadence snd_sof_intel_hda snd_sof_pci i10nm_edac snd_sof_xtensa_dsp nfit snd_sof x86_pkg_temp_thermal intel_powerclamp snd_sof_utils snd_soc_hdac_hda snd_hda_codec_realtek snd_hda_ext_core snd_soc_acpi_intel_match snd_hda_codec_generic snd_soc_acpi soundwire_bus ledtrig_audio snd_hda_codec_hdmi kvm_intel snd_soc_core snd_compress snd_pcm_dmaengine uvcvideo ac97_bus videobuf2_vmalloc
[ 1696.498116] videobuf2_memops videobuf2_v4l2 i2c_designware_platform snd_hda_intel kvm i2c_designware_core videodev snd_intel_dspcfg snd_intel_sdw_acpi snd_usb_audio rndis_host iTCO_wdt snd_hda_codec pmt_telemetry videobuf2_common snd_usbmidi_lib cdc_ether intel_pmc_bxt irqbypass pmt_crashlog intel_sdsi pmt_class iTCO_vendor_support snd_hda_core snd_rawmidi acpi_ipmi mei_gsc rapl atlantic igb idxd intel_lpss_pci joydev input_leds mc aquacomputer_d5next usbnet pcspkr intel_vsec snd_hwdep mei_me idxd_bus intel_cstate ipmi_si intel_lpss dca macsec ipmi_devintf snd_pcm i2c_i801 idma64 evdev spi_intel_pci mei virt_dma spi_intel i2c_smbus tpm_crb mac_hid ipmi_msghandler tpm_tis tpm_tis_core tiny_power_button tpm rng_core button nct6775 nct6775_core hwmon_vid coretemp sg snd_seq snd_seq_device snd_timer snd soundcore vhost_vsock vmw_vsock_virtio_transport_common vsock vhost_net vhost vhost_iotlb tap uhid hci_vhci bluetooth ecdh_generic rfkill ecc crc16 vfio_iommu_type1 vfio iommufd dm_mod
[ 1696.498165] uinput userio ppp_generic slhc tun loop nvram btrfs blake2b_generic xor raid6_pq libcrc32c crc32c_generic cuse fuse usbmouse usbkbd hid_generic usbhid hid zfs(PO) i915 zunicode(PO) zzstd(O) drm_buddy intel_gtt zlua(O) video wmi zavl(PO) crct10dif_pclmul crc32_pclmul ast crc32c_intel icp(PO) polyval_clmulni drm_display_helper drm_shmem_helper polyval_generic i2c_algo_bit gf128mul ghash_clmulni_intel drm_kms_helper cec sha512_ssse3 ahci rc_core xhci_pci syscopyarea libahci aesni_intel zcommon(PO) sysfillrect xhci_pci_renesas ttm sysimgblt znvpair(PO) crypto_simd xhci_hcd spl(O) libata cryptd drm usbcore scsi_mod agpgart scsi_common usb_common
[ 1696.498201] CR2: ffff9e7989271cff
[ 1696.498203] ---[ end trace 0000000000000000 ]---
[ 1696.541839] RIP: 0010:kfpu_begin+0x2f/0x70 [icp]
[ 1696.541852] Code: 00 e8 35 b7 55 e6 fa 0f 1f 44 00 00 48 8b 1d d8 0f c6 ff e8 03 e8 fb e6 89 c0 48 8b 0c c3 0f 1f 44 00 00 b8 ff ff ff ff 89 c2 <0f> c7 29 5b c3 cc cc cc cc 0f 1f 44 00 00 b8 ff ff ff ff 89 c2 0f
[ 1696.541854] RSP: 0018:ffffad15e50075b0 EFLAGS: 00010096
[ 1696.541855] RAX: 00000000ffffffff RBX: ffff9e7984dae400 RCX: ffff9e798926f000
[ 1696.541857] RDX: 00000000ffffffff RSI: ffffffffa810adb1 RDI: ffffffffa81191e3
[ 1696.541858] RBP: ffff9e7ba31c0000 R08: 0000000000000000 R09: 0000000007af41f3
[ 1696.541859] R10: ffff9ef6fa272de8 R11: 0000000000000140 R12: 0000000000000000
[ 1696.541860] R13: 0000000000007fe0 R14: 0000000000000010 R15: ffffad15e5007980
[ 1696.541861] FS: 0000000000000000(0000) GS:ffff9ef6fa240000(0000) knlGS:0000000000000000
[ 1696.541863] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1696.541864] CR2: ffff9e7989271cff CR3: 000000807e810006 CR4: 0000000000772ee0
[ 1696.541865] PKRU: 55555554
[ 1696.541866] note: z_wr_iss[67531] exited with irqs disabled
[ 1696.541885] note: z_wr_iss[67531] exited with preempt_count 1
I also had a Windows VM running at the time of the crash (I almost always have one running for controlling the watercooling loop), though this time it didn't crash when starting the VM, but some time after.
Since both of my original crash and this one seems to indicate RIP: 0010:kfpu_begin
, is it possible this is an issue with something in kfpu_begin
rather than AVX512? There's also some mention about XRSTORS
in SPR Errata (although this is for the server variant)...
Interestingly, I tried running a Windows KVM VM on my Ice Lake box with AVX512, running prime95, and scrubbing a pool using fletcher4 in a loop, and nothing has broken in a day.
(Oh, and kernel 6.2.16, just to be as close as I could get it. I may go try 6.2.15 to see if there's somehow a secret bug fixed in the point release, but I didn't see anything suspicious in the changelog...)
So maybe it really is somehow Sapphire Rapids specific...
Hi, this is just a wild guess, but maybe try to disable preemption with preempt=none
kernel cmdline parameter.
I do not have any specific reason for why this should work - just a hunch.
I did try running y-cruncher with all AVX512 tests turned on for 8 hours, 4 hours with VM running, and no crashes there either. It really seems to require some specific processor state to trigger this crash.
Today I updated the kernel to 6.3.6, and it took me around 7 reboots to trigger a crash, then two in rows. (Not sure if coincidence or the processor somehow stuck in that state across reboot, but after the second crash I did a power cycle, and four reboots later it still doesn't crash…)
Interestingly, with 6.3.6 my call trace now includes fpu_begin
up to kernel handlers:
[ 116.723673] BUG: unable to handle page fault for address: ffff9a0a21324cff
[ 116.723677] #PF: supervisor write access in kernel mode
[ 116.723679] #PF: error_code(0x0003) - permissions violation
[ 116.723680] PGD 807f601067 P4D 807f601067 PUD 1085b4063 PMD 108914063 PTE 8000000121324161
[ 116.723684] Oops: 0003 [#2] PREEMPT SMP NOPTI
[ 116.723686] CPU: 11 PID: 18934 Comm: z_wr_iss Tainted: P D O 6.3.6_1 #1
[ 116.723689] Hardware name: Supermicro Super Server/X13SWA-TF, BIOS 1.1a 04/25/2023
[ 116.723690] RIP: 0010:kfpu_begin+0x2f/0x70 [zcommon]
[ 116.723698] Code: 00 e8 05 49 3e d2 fa 0f 1f 44 00 00 48 8b 1d 98 af 00 00 e8 93 73 e5 d2 89 c0 48 8b 0c c3 0f 1f 44 00 00 b8 ff ff ff ff 89 c2 <0f> c7 29 5b c3 cc cc cc cc 0f 1f 44 00 00 b8 ff ff ff ff 89 c2 0f
[ 116.723700] RSP: 0018:ffffa5c60372f988 EFLAGS: 00010082
[ 116.723701] RAX: 00000000ffffffff RBX: ffff9a0a047dec00 RCX: ffff9a0a21322000
[ 116.723702] RDX: 00000000ffffffff RSI: ffffffff94112eeb RDI: ffffffff94121573
[ 116.723703] RBP: ffff9a0b355c1000 R08: ffffa5c60372faa0 R09: ffff9a0a088c45c0
[ 116.723704] R10: ffff9a0a40d0c000 R11: 0000000000000cd7 R12: ffffa5c60372fac0
[ 116.723705] R13: 0000000000001000 R14: 0000000000000000 R15: 0000000000000008
[ 116.723707] FS: 0000000000000000(0000) GS:ffff9a877a4c0000(0000) knlGS:0000000000000000
[ 116.723708] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 116.723709] CR2: ffff9a0a21324cff CR3: 000000807e832001 CR4: 0000000000772ee0
[ 116.723711] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 116.723712] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[ 116.723713] PKRU: 55555554
[ 116.723714] Call Trace:
[ 116.723716] <TASK>
[ 116.723717] ? __die+0x1f/0x70
[ 116.723722] ? page_fault_oops+0x159/0x460
[ 116.723726] ? fixup_exception+0x22/0x300
[ 116.723729] ? exc_page_fault+0xca/0x170
[ 116.723733] ? asm_exc_page_fault+0x22/0x30
[ 116.723737] ? kfpu_begin+0x2f/0x70 [zcommon]
[ 116.723742] ? kfpu_begin+0x1d/0x70 [zcommon]
[ 116.723747] fletcher_4_avx512f_native+0x18/0xa0 [zcommon]
[ 116.723751] ? abd_fletcher_4_iter+0x64/0xc0 [zcommon]
[ 116.723758] ? abd_iterate_func+0xe8/0x1c0 [zfs]
[ 116.723837] ? __pfx_abd_fletcher_4_iter+0x10/0x10 [zcommon]
[ 116.723843] ? abd_fletcher_4_native+0x7b/0xc0 [zfs]
[ 116.723906] ? spl_kmem_cache_alloc+0x66/0x760 [spl]
[ 116.723915] ? percpu_counter_add_batch+0x2e/0xa0
[ 116.723919] ? zio_checksum_compute+0xc3/0x3e0 [zfs]
[ 116.723970] ? __kmem_cache_alloc_node+0x151/0x2a0
[ 116.723974] ? spl_kmem_alloc+0xb6/0x110 [spl]
[ 116.723980] ? spl_kmem_alloc+0xb6/0x110 [spl]
[ 116.723985] ? __kmalloc_node+0x4c/0x150
[ 116.723988] ? spl_kmem_alloc+0xb6/0x110 [spl]
[ 116.723993] ? zio_write_compress+0x745/0x9e0 [zfs]
[ 116.724047] ? zio_checksum_generate+0x49/0x70 [zfs]
[ 116.724097] ? zio_execute+0x80/0x120 [zfs]
[ 116.724145] ? taskq_thread+0x2e0/0x510 [spl]
[ 116.724151] ? __pfx_default_wake_function+0x10/0x10
[ 116.724155] ? __pfx_zio_execute+0x10/0x10 [zfs]
[ 116.724204] ? __pfx_taskq_thread+0x10/0x10 [spl]
[ 116.724210] ? kthread+0xe6/0x110
[ 116.724212] ? __pfx_kthread+0x10/0x10
[ 116.724214] ? ret_from_fork+0x29/0x50
[ 116.724218] </TASK>
[ 116.724218] Modules linked in: snd_seq_dummy snd_hrtimer xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp ip6table_mangle ip6table_nat ip6table_filter ip6_tables iptable_mangle iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_filter ip_tables x_tables bridge stp llc bnep msr nls_iso8859_1 nls_cp437 vfat fat snd_sof_pci_intel_tgl snd_sof_intel_hda_common soundwire_intel hid_logitech_hidpp soundwire_generic_allocation soundwire_cadence snd_sof_intel_hda snd_sof_pci snd_sof_xtensa_dsp mei_hdcp mei_pxp ipmi_ssif intel_rapl_msr snd_sof intel_rapl_common intel_uncore_frequency snd_sof_utils intel_uncore_frequency_common snd_soc_hdac_hda snd_hda_ext_core snd_soc_acpi_intel_match i10nm_edac snd_soc_acpi nfit soundwire_bus x86_pkg_temp_thermal intel_powerclamp snd_soc_core snd_hda_codec_realtek snd_compress snd_hda_codec_generic snd_hda_codec_hdmi ac97_bus ledtrig_audio snd_pcm_dmaengine kvm_intel snd_hda_intel snd_intel_dspcfg i2c_designware_platform i2c_designware...
[ 116.724254] snd_intel_sdw_acpi snd_usb_audio kvm snd_hda_codec snd_usbmidi_lib irqbypass pmt_telemetry iTCO_wdt intel_lpss_pci rndis_host intel_pmc_bxt mei_gsc pmt_class intel_sdsi acpi_ipmi snd_hda_core snd_rawmidi iTCO_vendor_support cdc_ether intel_lpss rapl intel_cstate joydev input_leds atlantic igb mc snd_hwdep idxd i2c_i801 mei_me ipmi_si hid_logitech_dj evdev aquacomputer_d5next usbnet idma64 spi_intel_pci pcspkr dax_hmem intel_vsec idxd_bus snd_pcm mac_hid dca macsec spi_intel i2c_smbus ipmi_devintf mei virt_dma tpm_crb ipmi_msghandler tpm_tis tiny_power_button tpm_tis_core tpm rng_core button nct6775 nct6775_core hwmon_vid coretemp sg snd_seq snd_seq_device snd_timer snd soundcore vhost_vsock vmw_vsock_virtio_transport_common vsock vhost_net vhost vhost_iotlb tap hci_vhci bluetooth ecdh_generic rfkill ecc crc16 vfio_iommu_type1 vfio iommufd uhid dm_mod uinput userio ppp_generic slhc tun loop nvram btrfs blake2b_generic xor raid6_pq libcrc32c crc32c_generic cuse fuse usbkbd usbmouse hid_generi...
[ 116.724308] zfs(PO) i915 zunicode(PO) zzstd(O) drm_buddy zlua(O) intel_gtt video zavl(PO) wmi ast crct10dif_pclmul crc32_pclmul crc32c_intel icp(PO) drm_shmem_helper polyval_clmulni drm_display_helper i2c_algo_bit ahci polyval_generic gf128mul libahci cec drm_kms_helper ghash_clmulni_intel xhci_pci zcommon(PO) rc_core xhci_pci_renesas syscopyarea sha512_ssse3 sysfillrect znvpair(PO) ttm sysimgblt xhci_hcd libata aesni_intel spl(O) crypto_simd drm cryptd usbcore scsi_mod agpgart scsi_common usb_common
[ 116.724337] CR2: ffff9a0a21324cff
[ 116.724339] ---[ end trace 0000000000000000 ]---
[ 116.727668] pstore: backend (erst) writing error (-28)
[ 116.727671] RIP: 0010:kfpu_begin+0x2f/0x70 [zcommon]
[ 116.727678] Code: 00 e8 05 49 3e d2 fa 0f 1f 44 00 00 48 8b 1d 98 af 00 00 e8 93 73 e5 d2 89 c0 48 8b 0c c3 0f 1f 44 00 00 b8 ff ff ff ff 89 c2 <0f> c7 29 5b c3 cc cc cc cc 0f 1f 44 00 00 b8 ff ff ff ff 89 c2 0f
[ 116.727680] RSP: 0018:ffffa5c3875c3908 EFLAGS: 00010086
[ 116.727682] RAX: 00000000ffffffff RBX: ffff9a0a047dec00 RCX: ffff9a0a2132d000
[ 116.727683] RDX: 00000000ffffffff RSI: ffffffff94112eeb RDI: ffffffff94121573
[ 116.727684] RBP: ffff9a0adf852000 R08: ffffa5c3875c3a20 R09: ffffffffc11ab760
[ 116.727685] R10: 0000000000000000 R11: 000000000000448e R12: ffffa5c3875c3a40
[ 116.727686] R13: 0000000000001000 R14: 0000000000000000 R15: 0000000000000008
[ 116.727688] FS: 0000000000000000(0000) GS:ffff9a877a4c0000(0000) knlGS:0000000000000000
[ 116.727689] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 116.727690] CR2: ffff9a0a21324cff CR3: 000000807e832001 CR4: 0000000000772ee0
[ 116.727692] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 116.727693] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[ 116.727694] PKRU: 55555554
Hi, this is just a wild guess, but maybe try to disable preemption with
preempt=none
kernel cmdline parameter. I do not have any specific reason for why this should work - just a hunch.
Thanks for the suggestion! I'll try this.
Another member in Level1Techs also suggested disabling AMX with clearcpuid=600
since it seems to cause some troubles with XSAVES in KVM. It'll probably take a while to test, though, since it seemingly happens at a very random pace…
Oh.
I think I might understand the bad place, here.
If XSAVES
will save XTILEDATA
with the bitmask we're using, I don't think Linux's fpregs_state
contains that, so we're going to be underestimating the buffer size we need and going bang.
So we might be well off rn setting the bit to not save/restore AMX, since we're not using them anywhere, and then refactoring how we allocate preservation state, because XTILEDATA is very big indeed, comparatively.
e: I could be wrong, I'm not an expert, but if we were undersizing the save/restore area, I think this might be what that explosion would look like...it could also be an errata, I'm not ruling that out. If it is the errata and we're mispreserving the AMX state, then assuming we're not touching the AMX state, masking off the AMX bits might still be a good plan.
e2: if you wanted a hacky experiment patch that doesn't do any of the checking for if we have the defines for this like would be needed if we actually merged it:
diff --git a/include/os/linux/kernel/linux/simd_x86.h b/include/os/linux/kernel/linux/simd_x86.h
index 660f0d42d..7e1f46e81 100644
--- a/include/os/linux/kernel/linux/simd_x86.h
+++ b/include/os/linux/kernel/linux/simd_x86.h
@@ -290,7 +290,7 @@ kfpu_begin(void)
*/
union fpregs_state *state = zfs_kfpu_fpregs[smp_processor_id()];
if (static_cpu_has(X86_FEATURE_XSAVE)) {
- kfpu_save_xsave(&state->xsave, ~0);
+ kfpu_save_xsave(&state->xsave, ~(XFEATURE_MASK_XTILE));
} else if (static_cpu_has(X86_FEATURE_FXSR)) {
kfpu_save_fxsr(&state->fxsave);
} else {
@@ -319,18 +319,18 @@ kfpu_begin(void)
union fpregs_state *state = zfs_kfpu_fpregs[smp_processor_id()];
#if defined(HAVE_XSAVES)
if (static_cpu_has(X86_FEATURE_XSAVES)) {
- kfpu_do_xsave("xsaves", &state->xsave, ~0);
+ kfpu_do_xsave("xsaves", &state->xsave, ~(XFEATURE_MASK_XTILE));
return;
}
#endif
#if defined(HAVE_XSAVEOPT)
if (static_cpu_has(X86_FEATURE_XSAVEOPT)) {
- kfpu_do_xsave("xsaveopt", &state->xsave, ~0);
+ kfpu_do_xsave("xsaveopt", &state->xsave, ~(XFEATURE_MASK_XTILE));
return;
}
#endif
if (static_cpu_has(X86_FEATURE_XSAVE)) {
- kfpu_do_xsave("xsave", &state->xsave, ~0);
+ kfpu_do_xsave("xsave", &state->xsave, ~(XFEATURE_MASK_XTILE));
} else if (static_cpu_has(X86_FEATURE_FXSR)) {
kfpu_save_fxsr(&state->fxsave);
} else {
@@ -396,7 +396,7 @@ kfpu_end(void)
union fpregs_state *state = zfs_kfpu_fpregs[smp_processor_id()];
if (static_cpu_has(X86_FEATURE_XSAVE)) {
- kfpu_restore_xsave(&state->xsave, ~0);
+ kfpu_restore_xsave(&state->xsave, ~(XFEATURE_MASK_XTILE));
} else if (static_cpu_has(X86_FEATURE_FXSR)) {
kfpu_restore_fxsr(&state->fxsave);
} else {
@@ -415,12 +415,12 @@ kfpu_end(void)
union fpregs_state *state = zfs_kfpu_fpregs[smp_processor_id()];
#if defined(HAVE_XSAVES)
if (static_cpu_has(X86_FEATURE_XSAVES)) {
- kfpu_do_xrstor("xrstors", &state->xsave, ~0);
+ kfpu_do_xrstor("xrstors", &state->xsave, ~(XFEATURE_MASK_XTILE));
goto out;
}
#endif
if (static_cpu_has(X86_FEATURE_XSAVE)) {
- kfpu_do_xrstor("xrstor", &state->xsave, ~0);
+ kfpu_do_xrstor("xrstor", &state->xsave, ~(XFEATURE_MASK_XTILE));
} else if (static_cpu_has(X86_FEATURE_FXSR)) {
kfpu_restore_fxsr(&state->fxsave);
} else {
Thanks, I'm testing this patch. Let's see for a few days if I'll have any crashes. For a reference, this is what my dmesg |grep x86/fpu
looks like:
[ 0.000000] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
[ 0.000000] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
[ 0.000000] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
[ 0.000000] x86/fpu: Supporting XSAVE feature 0x020: 'AVX-512 opmask'
[ 0.000000] x86/fpu: Supporting XSAVE feature 0x040: 'AVX-512 Hi256'
[ 0.000000] x86/fpu: Supporting XSAVE feature 0x080: 'AVX-512 ZMM_Hi256'
[ 0.000000] x86/fpu: Supporting XSAVE feature 0x200: 'Protection Keys User registers'
[ 0.000000] x86/fpu: Supporting XSAVE feature 0x400: 'PASID state'
[ 0.000000] x86/fpu: Supporting XSAVE feature 0x20000: 'AMX Tile config'
[ 0.000000] x86/fpu: Supporting XSAVE feature 0x40000: 'AMX Tile data'
[ 0.000000] x86/fpu: xstate_offset[2]: 576, xstate_sizes[2]: 256
[ 0.000000] x86/fpu: xstate_offset[5]: 832, xstate_sizes[5]: 64
[ 0.000000] x86/fpu: xstate_offset[6]: 896, xstate_sizes[6]: 512
[ 0.000000] x86/fpu: xstate_offset[7]: 1408, xstate_sizes[7]: 1024
[ 0.000000] x86/fpu: xstate_offset[9]: 2432, xstate_sizes[9]: 8
[ 0.000000] x86/fpu: xstate_offset[10]: 2440, xstate_sizes[10]: 8
[ 0.000000] x86/fpu: xstate_offset[17]: 2496, xstate_sizes[17]: 64
[ 0.000000] x86/fpu: xstate_offset[18]: 2560, xstate_sizes[18]: 8192
[ 0.000000] x86/fpu: Enabled xstate features 0x606e7, context size is 10752 bytes, using 'compacted' format.
I've been trying to reproduce the crash with the above patch applied, with multiple reboots and multiple VM relaunches over the past 2 days. So far, not a single crash yet.
Never noticed this on Zen 4 (supports AVX512 but not AMX) so this one seems to be an Intel specific-issue. Nice finding!
Not just Intel-specific, either, it doesn't happen on my Ice Lake box (which has AVX512F/BW, but not AMX).
I have not had a single crash in over a week after applying @rincebrain's patch (even after countless reboots & VM start/shutdown). I think XTILEDATA might indeed be the cause of the issue.
The system (the host) will become unresponsive soon after this happens, and any attempt to access the filesystem will simply freeze.
Does this show as slight IO wait in the beginning (once the first VM starts using a ZFS volume) that then grows over time until the entire file system gets unresponsive (although at the same time iostat
shows no actual disk activity at all but zd0
as 100% busy)?
I stumbled over this issue while using ZFS with Proxmox on a new build of Intel Xeon Silver ~4116+~ 4416+ servers. File system getting unresponsive after a while with consistently growing IO wait—like nearly linearly growing in small discrete steps once the first VM is launched. There is nothing special about that VM (just a plain and empty Debian VM), and I can reproduce that every time I reinstall the entire host system.
Not sure if it‘s really the same issue at play, but the only error I found in the syslog is a similar page fault with RIP: 0010:kfpu_begin+0x2f/0x70 [zcommon]
, and disabling AMX support in the kernel (via clearcpuid=600
) makes it go away entirely. So this Github issue seems to be the closest I have found so far.
Xeon Silver 4116 should be Skylake, not SPR, so not this issue, whatever's broken, I think.
I can't...imagine why clearing AMX would matter on those CPUs since I just double-checked, and they absolutely shouldn't have that.
(4416, on the other hand, if you misspoke, would indeed be SPR, and potentially blow up like this.)
Xeon Silver 4116 should be Skylake, not SPR, so not this issue, whatever's broken, I think.
Apologies for the typo, 4416+ (Sapphire Rapid). The 4116 would be quite a dated choice for a new build.
Just want to report that the spr nobiscuit patch works great, I have not had a single crash for over a month already :-)
Does this show as slight IO wait in the beginning (once the first VM starts using a ZFS volume) that then grows over time until the entire file system gets unresponsive (although at the same time
iostat
shows no actual disk activity at all butzd0
as 100% busy)?
I have never actually measured this, but some disk access was possible for a brief moment after the crash. I've also seen reports in Level1Techs forum running Xeon Gold 6438Y+ having this issue on Proxmox as well (with Linux guest), so at least this does seem to affect all SPRs.
Sigh. Got it by that one. The fix is in zfs 2.2, I'm told. Hasn't been backported to 2.1.x.
System information
Describe the problem you're observing
During some circumstances, ZFS may panic in
fletcher_4_avx512f_native
while starting a VM on Xeon 4th Gen (Sapphire Rapids). The system (the host) will become unresponsive soon after this happens, and any attempt to access the filesystem will simply freeze. (e.g., running a new program, but an already-running program continues to work as long as it doesn't do any filesystem access).The problem is usually goes away on the next boot after pressing a reset button. So far, I've been trying to find a way to reproduce this, but it seems to happen at random. During all crashes, I've observed the following:
The machine is running on Xeon w9-3495X with ECC memory (everything stock, no overclocks). IPMI shows no errors in the event logs.
I've also observed the crash with VM on zvol, and with VM as a disk image on a ZFS mount.
Describe how to reproduce the problem
Include any warning/errors/backtraces from the system logs
This is from the latest crash:
This is from the subsequent reboot:
zpool:
I've also posted a full dmesg of two occasions of a crash here (there are more, but I have not been able to capture): https://gist.github.com/sirn/0a9489444b4e9627ee5c2aa1bf60c242