Page faults on Sapphire Rapids

sirn commented 1 year ago

System information

Type	Version/Name
Distribution Name	Void Linux
Distribution Version	N/A (rolling)
Kernel Version	6.2.15
Architecture	x86_64
OpenZFS Version	2.1.12

Describe the problem you're observing

During some circumstances, ZFS may panic in fletcher_4_avx512f_native while starting a VM on Xeon 4th Gen (Sapphire Rapids). The system (the host) will become unresponsive soon after this happens, and any attempt to access the filesystem will simply freeze. (e.g., running a new program, but an already-running program continues to work as long as it doesn't do any filesystem access).

The problem is usually goes away on the next boot after pressing a reset button. So far, I've been trying to find a way to reproduce this, but it seems to happen at random. During all crashes, I've observed the following:

The panic only happens when avx512f was chosen
It only happens when I start a VM
If it doesn't happen the first time I start a VM in that boot (even when avx512 was chosen), then it won't ever happen until I reboot

The machine is running on Xeon w9-3495X with ECC memory (everything stock, no overclocks). IPMI shows no errors in the event logs.

I've also observed the crash with VM on zvol, and with VM as a disk image on a ZFS mount.

Describe how to reproduce the problem

Power-off the machine
Boot the OS
Start a VM
Crashes with the following backtrace
Subsequent reboot may work even when avx512f is selected

Include any warning/errors/backtraces from the system logs

This is from the latest crash:

[   70.390218] BUG: unable to handle page fault for address: ffff96efe92cdcff
[   70.390221] #PF: supervisor write access in kernel mode
[   70.390222] #PF: error_code(0x0003) - permissions violation
[   70.390223] PGD 807f601067 P4D 807f601067 PUD 1085d8063 PMD 132d13063 PTE 80000001292cd161
[   70.390226] Oops: 0003 [#1] PREEMPT SMP NOPTI
[   70.390228] CPU: 0 PID: 1671 Comm: z_rd_int_8 Tainted: P           O       6.2.15_2 #1
[   70.390230] Hardware name: Supermicro Super Server/X13SWA-TF, BIOS 1.1 02/15/2023
[   70.390231] RIP: 0010:kfpu_begin+0x2f/0x70 [zcommon]
[   70.390237] Code: 00 e8 f5 46 c2 db fa 0f 1f 44 00 00 48 8b 1d 98 af 00 00 e8 c3 77 68 dc 89 c0 48 8b 0c c3 0f 1f 44 00 00 b8 ff ff ff ff 89 c2 <0f> c7 29 5b c3 cc cc cc cc 0f 1f 44 00 00 b8 ff ff ff ff 89 c2 0f
[   70.390238] RSP: 0018:ffffbbb848b5b908 EFLAGS: 00010086
[   70.390240] RAX: 00000000ffffffff RBX: ffff96efdae5ac00 RCX: ffff96efe92cb000
[   70.390241] RDX: 00000000ffffffff RSI: ffffffff9d10adb1 RDI: ffffffff9d1191e3
[   70.390242] RBP: ffff96f11a190000 R08: ffffbbb848b5ba20 R09: ffffffffc0e288a0
[   70.390242] R10: 0000000000000000 R11: 00000000c12c4cc9 R12: ffffbbb848b5ba40
[   70.390243] R13: 0000000000010000 R14: 0000000000000000 R15: 0000000000000008
[   70.390244] FS:  0000000000000000(0000) GS:ffff976d3a200000(0000) knlGS:0000000000000000
[   70.390245] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   70.390246] CR2: ffff96efe92cdcff CR3: 000000807e810005 CR4: 0000000000772ef0
[   70.390247] PKRU: 55555554
[   70.390247] Call Trace:
[   70.390249]  <TASK>
[   70.390251]  fletcher_4_avx512f_native+0x18/0xa0 [zcommon]
[   70.390254]  ? abd_fletcher_4_iter+0x64/0xc0 [zcommon]
[   70.390259]  ? abd_iterate_func+0xe8/0x1c0 [zfs]
[   70.390321]  ? __pfx_abd_fletcher_4_iter+0x10/0x10 [zcommon]
[   70.390325]  ? abd_fletcher_4_native+0x7b/0xc0 [zfs]
[   70.390369]  ? preempt_count_add+0x47/0xa0
[   70.390373]  ? abd_iter_map+0x6c/0xb0 [zfs]
[   70.390411]  ? abd_copy_to_buf_off_cb+0x1b/0x30 [zfs]
[   70.390438]  ? abd_iter_advance+0x3f/0x70 [zfs]
[   70.390474]  ? abd_iterate_func+0x10d/0x1c0 [zfs]
[   70.390500]  ? zio_checksum_error_impl+0xfb/0x630 [zfs]
[   70.390536]  ? __pfx_abd_fletcher_4_native+0x10/0x10 [zfs]
[   70.390570]  ? update_load_avg+0x7e/0x780
[   70.390573]  ? newidle_balance+0x2eb/0x420
[   70.390575]  ? vdev_queue_io_to_issue+0x3db/0xc40 [zfs]
[   70.390622]  ? zio_checksum_error+0x64/0xd0 [zfs]
[   70.390659]  ? zio_checksum_verify+0x36/0x140 [zfs]
[   70.390698]  ? preempt_count_add+0x6a/0xa0
[   70.390699]  ? _raw_spin_lock+0x13/0x40
[   70.390702]  ? _raw_spin_unlock+0x12/0x30
[   70.390703]  ? zio_wait_for_children+0x88/0xc0 [zfs]
[   70.390740]  ? zio_vdev_io_assess+0x20/0x2d0 [zfs]
[   70.390776]  ? zio_execute+0x80/0x120 [zfs]
[   70.390812]  ? taskq_thread+0x2e0/0x510 [spl]
[   70.390817]  ? __pfx_default_wake_function+0x10/0x10
[   70.390819]  ? __pfx_zio_execute+0x10/0x10 [zfs]
[   70.390855]  ? __pfx_taskq_thread+0x10/0x10 [spl]
[   70.390859]  ? kthread+0xe6/0x110
[   70.390861]  ? __pfx_kthread+0x10/0x10
[   70.390862]  ? ret_from_fork+0x29/0x50
[   70.390865]  </TASK>
[   70.390865] Modules linked in: snd_seq_dummy snd_hrtimer xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp ip6table_mangle ip6table_nat ip6table_filter ip6_tables iptable_mangle iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_filter ip_tables x_tables bridge stp llc bnep msr nls_iso8859_1 nls_cp437 vfat fat hid_logitech_hidpp snd_usb_audio hid_logitech_dj snd_usbmidi_lib snd_rawmidi mc mei_hdcp mei_pxp ipmi_ssif snd_sof_pci_intel_tgl intel_rapl_msr intel_rapl_common snd_sof_intel_hda_common intel_uncore_frequency intel_uncore_frequency_common soundwire_intel soundwire_generic_allocation i10nm_edac soundwire_cadence nfit snd_sof_intel_hda snd_sof_pci snd_sof_xtensa_dsp x86_pkg_temp_thermal intel_powerclamp snd_sof snd_sof_utils snd_soc_hdac_hda snd_hda_ext_core snd_soc_acpi_intel_match snd_soc_acpi soundwire_bus snd_soc_core kvm_intel snd_hda_codec_realtek snd_hda_codec_generic snd_compress i2c_designware_platform i2c_designware_core
[   70.390894]  snd_hda_codec_hdmi ledtrig_audio snd_pcm_dmaengine ac97_bus kvm snd_hda_intel snd_intel_dspcfg iTCO_wdt intel_pmc_bxt snd_intel_sdw_acpi snd_hda_codec mei_gsc irqbypass snd_hda_core intel_lpss_pci pmt_crashlog iTCO_vendor_support rndis_host pmt_telemetry mei_me rapl snd_hwdep intel_lpss acpi_ipmi i2c_i801 atlantic igb intel_sdsi idma64 cdc_ether spi_intel_pci pmt_class idxd intel_cstate pcspkr tpm_crb intel_vsec idxd_bus input_leds usbnet macsec snd_pcm aquacomputer_d5next dca virt_dma mei spi_intel i2c_smbus ipmi_si ipmi_devintf ipmi_msghandler tpm_tis tpm_tis_core tpm tiny_power_button rng_core evdev joydev button mac_hid nct6775 nct6775_core hwmon_vid coretemp sg snd_seq snd_seq_device snd_timer snd soundcore vhost_vsock vmw_vsock_virtio_transport_common vsock vhost_net vhost vhost_iotlb tap uhid hci_vhci bluetooth ecdh_generic rfkill ecc crc16 vfio_iommu_type1 vfio iommufd dm_mod uinput userio ppp_generic slhc tun loop nvram btrfs blake2b_generic xor raid6_pq libcrc32c
[   70.390936]  crc32c_generic cuse fuse usbkbd hid_generic usbmouse usbhid hid i915 zfs(PO) zunicode(PO) zzstd(O) drm_buddy intel_gtt zlua(O) video zavl(PO) wmi crct10dif_pclmul ast icp(PO) crc32_pclmul drm_display_helper drm_shmem_helper crc32c_intel i2c_algo_bit polyval_clmulni polyval_generic gf128mul ghash_clmulni_intel cec sha512_ssse3 drm_kms_helper xhci_pci ahci xhci_pci_renesas rc_core zcommon(PO) syscopyarea aesni_intel znvpair(PO) libahci sysfillrect ttm crypto_simd sysimgblt xhci_hcd spl(O) libata cryptd drm usbcore scsi_mod agpgart scsi_common usb_common
[   70.390960] CR2: ffff96efe92cdcff
[   70.390961] ---[ end trace 0000000000000000 ]---
[   70.435252] RIP: 0010:kfpu_begin+0x2f/0x70 [zcommon]
[   70.435259] Code: 00 e8 f5 46 c2 db fa 0f 1f 44 00 00 48 8b 1d 98 af 00 00 e8 c3 77 68 dc 89 c0 48 8b 0c c3 0f 1f 44 00 00 b8 ff ff ff ff 89 c2 <0f> c7 29 5b c3 cc cc cc cc 0f 1f 44 00 00 b8 ff ff ff ff 89 c2 0f
[   70.435261] RSP: 0018:ffffbbb848b5b908 EFLAGS: 00010086
[   70.435262] RAX: 00000000ffffffff RBX: ffff96efdae5ac00 RCX: ffff96efe92cb000
[   70.435264] RDX: 00000000ffffffff RSI: ffffffff9d10adb1 RDI: ffffffff9d1191e3
[   70.435265] RBP: ffff96f11a190000 R08: ffffbbb848b5ba20 R09: ffffffffc0e288a0
[   70.435266] R10: 0000000000000000 R11: 00000000c12c4cc9 R12: ffffbbb848b5ba40
[   70.435267] R13: 0000000000010000 R14: 0000000000000000 R15: 0000000000000008
[   70.435268] FS:  0000000000000000(0000) GS:ffff976d3a200000(0000) knlGS:0000000000000000
[   70.435269] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   70.435270] CR2: ffff96efe92cdcff CR3: 000000807e810005 CR4: 0000000000772ef0
[   70.435272] PKRU: 55555554
[   70.435273] note: z_rd_int_8[1671] exited with irqs disabled
[   70.435280] note: z_rd_int_8[1671] exited with preempt_count 2

This is from the subsequent reboot:

$ cat /proc/spl/kstat/zfs/fletcher_4_bench           
0 0 0x01 -1 0 3841410212 23164682577818
implementation   native         byteswap       
scalar           6131494505     3682395754     
superscalar      6893107063     5785582061     
superscalar4     5963349527     5956917028     
sse2             9968202613     7153706558     
ssse3            10518859344    10243226016    
avx2             18925912723    13055030872    
avx512f          27811684645    9326706825     
avx512bw         25533654531    20591705983    
fastest          avx512f        avx512bw

zpool:

$ zpool status                                                                                                       
  pool: zroot
 state: ONLINE
  scan: scrub repaired 0B in 00:03:12 with 0 errors on Sat Jun 17 15:05:39 2023
config:

        NAME                                                     STATE     READ WRITE CKSUM
        zroot                                                    ONLINE       0     0     0
          mirror-0                                               ONLINE       0     0     0
            nvme-SOLIDIGM_SSDPFKKW020X7_SSC1N514011201I6N-part2  ONLINE       0     0     0
            nvme-SOLIDIGM_SSDPFKKW020X7_SSC1N514011101I0I-part2  ONLINE       0     0     0
          mirror-1                                               ONLINE       0     0     0
            nvme-SOLIDIGM_SSDPFKKW020X7_SSC1N514011201I3R-part1  ONLINE       0     0     0
            nvme-SOLIDIGM_SSDPFKKW020X7_SSC1N514011301H6Y-part1  ONLINE       0     0     0

errors: No known data errors

I've also posted a full dmesg of two occasions of a crash here (there are more, but I have not been able to capture): https://gist.github.com/sirn/0a9489444b4e9627ee5c2aa1bf60c242

rincebrain commented 1 year ago

Wonder if there's something spicy about AVX512 save/restore that's not captured in our custom save/restore. @AttilaFueloep

KungFuJesus commented 1 year ago

I can't imagine sapphire rapids' save/restore context for avx512 is very different from any other avx512 CPU. The VM detail seems germaine, though. Maybe the VM is stomping on avx512 register state because it doesn't realize it's on an avx512 CPU?

Well, I guess it must have or it wouldn't have selected the avx512 Fletcher4 implementation. Something tells me the fault is somewhere in the hypervisor though.

sirn commented 1 year ago

Maybe the VM is stomping on avx512 register state because it doesn't realize it's on an avx512 CPU?

FWIW, the VM is running Windows 11 Pro with host-passthrough (4 vCPUs). It was always the first VM that got booted up. I will see if a Linux VM can trigger this.

KungFuJesus commented 1 year ago

Oh wait no it sounds like the host OS is what's panicking, not the VM. So yeah my guess is that your VM is stomping all over your avx512 context because your hypervisor isn't presenting itself as an avx512 CPU.

rincebrain commented 1 year ago

This seems like it should break more things than just ZFS if the guest can clobber the host's SIMD state, though...

KungFuJesus commented 1 year ago

Yeah, seems likely. Any better luck with a Linux guest?

sirn commented 1 year ago

I've tried a few restarts, but couldn't trigger it on both Windows and Linux VMs. I believe it'll take me a while to try to reproduce this.

However, it does look like to be some sort of CPU bug where it got stuck in a buggy state, that for some reason avx512f + launching a VM triggers it. I'll update once I have more crashes.

On the other hand, I have not yet observed this happening with avx512bw.

sirn commented 1 year ago

This one is new (it crashed at gcm_mode_encrypt_contiguous_blocks instead of the usual fletcher_4_avx512f_native), and I believe at the time it's running avx512bw:

[ 1696.497577] BUG: unable to handle page fault for address: ffff9e7989271cff
[ 1696.497583] #PF: supervisor write access in kernel mode
[ 1696.497585] #PF: error_code(0x0003) - permissions violation
[ 1696.497586] PGD 807f601067 P4D 807f601067 PUD 10855b063 PMD 11c71a063 PTE 8000000109271061
[ 1696.497589] Oops: 0003 [#1] PREEMPT SMP NOPTI
[ 1696.497592] CPU: 1 PID: 67531 Comm: z_wr_iss Tainted: P           O       6.2.15_2 #1
[ 1696.497594] Hardware name: Supermicro Super Server/X13SWA-TF, BIOS 1.1a 04/25/2023
[ 1696.497595] RIP: 0010:kfpu_begin+0x2f/0x70 [icp]
[ 1696.497607] Code: 00 e8 35 b7 55 e6 fa 0f 1f 44 00 00 48 8b 1d d8 0f c6 ff e8 03 e8 fb e6 89 c0 48 8b 0c c3 0f 1f 44 00 00 b8 ff ff ff ff 89 c2 <0f> c7 29 5b c3 cc cc cc cc 0f 1f 44 00 00 b8 ff ff ff ff 89 c2 0f
[ 1696.497608] RSP: 0018:ffffad15e50075b0 EFLAGS: 00010096
[ 1696.497610] RAX: 00000000ffffffff RBX: ffff9e7984dae400 RCX: ffff9e798926f000
[ 1696.497611] RDX: 00000000ffffffff RSI: ffffffffa810adb1 RDI: ffffffffa81191e3
[ 1696.497612] RBP: ffff9e7ba31c0000 R08: 0000000000000000 R09: 0000000007af41f3
[ 1696.497613] R10: ffff9ef6fa272de8 R11: 0000000000000140 R12: 0000000000000000
[ 1696.497614] R13: 0000000000007fe0 R14: 0000000000000010 R15: ffffad15e5007980
[ 1696.497615] FS:  0000000000000000(0000) GS:ffff9ef6fa240000(0000) knlGS:0000000000000000
[ 1696.497616] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1696.497617] CR2: ffff9e7989271cff CR3: 000000807e810006 CR4: 0000000000772ee0
[ 1696.497618] PKRU: 55555554
[ 1696.497619] Call Trace:
[ 1696.497621]  <TASK>
[ 1696.497623]  gcm_mode_encrypt_contiguous_blocks+0x552/0x7c0 [icp]
[ 1696.497631]  ? __pfx_aes_encrypt_block+0x10/0x10 [icp]
[ 1696.497639]  ? __pfx_aes_encrypt_contiguous_blocks+0x10/0x10 [icp]
[ 1696.497645]  aes_encrypt_contiguous_blocks+0xac/0xc0 [icp]
[ 1696.497651]  ? __pfx_aes_copy_block+0x10/0x10 [icp]
[ 1696.497657]  ? __pfx_aes_xor_block+0x10/0x10 [icp]
[ 1696.497662]  crypto_update_uio+0xe8/0x140 [icp]
[ 1696.497671]  aes_encrypt_atomic+0x149/0x330 [icp]
[ 1696.497679]  crypto_encrypt+0x281/0x350 [icp]
[ 1696.497686]  ? zio_crypt_bp_do_hmac_updates+0x6b/0xa0 [zfs]
[ 1696.497744]  ? i_mod_hash_find_nosync+0x5f/0xa0 [icp]
[ 1696.497752]  ? preempt_count_add+0x47/0xa0
[ 1696.497755]  ? up_read+0x37/0x70
[ 1696.497759]  zio_do_crypt_uio+0x1dc/0x2d0 [zfs]
[ 1696.497800]  zio_do_crypt_data+0x273/0x10a0 [zfs]
[ 1696.497835]  ? preempt_count_add+0x47/0xa0
[ 1696.497837]  ? up_read+0x37/0x70
[ 1696.497839]  spa_do_crypt_abd+0x150/0x330 [zfs]
[ 1696.497897]  zio_encrypt+0x300/0x730 [zfs]
[ 1696.497955]  zio_execute+0x80/0x120 [zfs]
[ 1696.498007]  taskq_thread+0x2e0/0x510 [spl]
[ 1696.498014]  ? __pfx_default_wake_function+0x10/0x10
[ 1696.498016]  ? __pfx_zio_execute+0x10/0x10 [zfs]
[ 1696.498066]  ? __pfx_taskq_thread+0x10/0x10 [spl]
[ 1696.498071]  kthread+0xe6/0x110
[ 1696.498074]  ? __pfx_kthread+0x10/0x10
[ 1696.498076]  ret_from_fork+0x29/0x50
[ 1696.498079]  </TASK>
[ 1696.498080] Modules linked in: snd_seq_dummy snd_hrtimer xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp ip6table_mangle ip6table_nat ip6table_filter ip6_tables iptable_mangle iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_filter ip_tables x_tables bridge stp llc bnep msr nls_iso8859_1 nls_cp437 vfat fat hid_logitech_hidpp hid_logitech_dj mei_hdcp mei_pxp ipmi_ssif snd_sof_pci_intel_tgl snd_sof_intel_hda_common intel_rapl_msr intel_rapl_common soundwire_intel soundwire_generic_allocation intel_uncore_frequency intel_uncore_frequency_common soundwire_cadence snd_sof_intel_hda snd_sof_pci i10nm_edac snd_sof_xtensa_dsp nfit snd_sof x86_pkg_temp_thermal intel_powerclamp snd_sof_utils snd_soc_hdac_hda snd_hda_codec_realtek snd_hda_ext_core snd_soc_acpi_intel_match snd_hda_codec_generic snd_soc_acpi soundwire_bus ledtrig_audio snd_hda_codec_hdmi kvm_intel snd_soc_core snd_compress snd_pcm_dmaengine uvcvideo ac97_bus videobuf2_vmalloc
[ 1696.498116]  videobuf2_memops videobuf2_v4l2 i2c_designware_platform snd_hda_intel kvm i2c_designware_core videodev snd_intel_dspcfg snd_intel_sdw_acpi snd_usb_audio rndis_host iTCO_wdt snd_hda_codec pmt_telemetry videobuf2_common snd_usbmidi_lib cdc_ether intel_pmc_bxt irqbypass pmt_crashlog intel_sdsi pmt_class iTCO_vendor_support snd_hda_core snd_rawmidi acpi_ipmi mei_gsc rapl atlantic igb idxd intel_lpss_pci joydev input_leds mc aquacomputer_d5next usbnet pcspkr intel_vsec snd_hwdep mei_me idxd_bus intel_cstate ipmi_si intel_lpss dca macsec ipmi_devintf snd_pcm i2c_i801 idma64 evdev spi_intel_pci mei virt_dma spi_intel i2c_smbus tpm_crb mac_hid ipmi_msghandler tpm_tis tpm_tis_core tiny_power_button tpm rng_core button nct6775 nct6775_core hwmon_vid coretemp sg snd_seq snd_seq_device snd_timer snd soundcore vhost_vsock vmw_vsock_virtio_transport_common vsock vhost_net vhost vhost_iotlb tap uhid hci_vhci bluetooth ecdh_generic rfkill ecc crc16 vfio_iommu_type1 vfio iommufd dm_mod
[ 1696.498165]  uinput userio ppp_generic slhc tun loop nvram btrfs blake2b_generic xor raid6_pq libcrc32c crc32c_generic cuse fuse usbmouse usbkbd hid_generic usbhid hid zfs(PO) i915 zunicode(PO) zzstd(O) drm_buddy intel_gtt zlua(O) video wmi zavl(PO) crct10dif_pclmul crc32_pclmul ast crc32c_intel icp(PO) polyval_clmulni drm_display_helper drm_shmem_helper polyval_generic i2c_algo_bit gf128mul ghash_clmulni_intel drm_kms_helper cec sha512_ssse3 ahci rc_core xhci_pci syscopyarea libahci aesni_intel zcommon(PO) sysfillrect xhci_pci_renesas ttm sysimgblt znvpair(PO) crypto_simd xhci_hcd spl(O) libata cryptd drm usbcore scsi_mod agpgart scsi_common usb_common
[ 1696.498201] CR2: ffff9e7989271cff
[ 1696.498203] ---[ end trace 0000000000000000 ]---
[ 1696.541839] RIP: 0010:kfpu_begin+0x2f/0x70 [icp]
[ 1696.541852] Code: 00 e8 35 b7 55 e6 fa 0f 1f 44 00 00 48 8b 1d d8 0f c6 ff e8 03 e8 fb e6 89 c0 48 8b 0c c3 0f 1f 44 00 00 b8 ff ff ff ff 89 c2 <0f> c7 29 5b c3 cc cc cc cc 0f 1f 44 00 00 b8 ff ff ff ff 89 c2 0f
[ 1696.541854] RSP: 0018:ffffad15e50075b0 EFLAGS: 00010096
[ 1696.541855] RAX: 00000000ffffffff RBX: ffff9e7984dae400 RCX: ffff9e798926f000
[ 1696.541857] RDX: 00000000ffffffff RSI: ffffffffa810adb1 RDI: ffffffffa81191e3
[ 1696.541858] RBP: ffff9e7ba31c0000 R08: 0000000000000000 R09: 0000000007af41f3
[ 1696.541859] R10: ffff9ef6fa272de8 R11: 0000000000000140 R12: 0000000000000000
[ 1696.541860] R13: 0000000000007fe0 R14: 0000000000000010 R15: ffffad15e5007980
[ 1696.541861] FS:  0000000000000000(0000) GS:ffff9ef6fa240000(0000) knlGS:0000000000000000
[ 1696.541863] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1696.541864] CR2: ffff9e7989271cff CR3: 000000807e810006 CR4: 0000000000772ee0
[ 1696.541865] PKRU: 55555554
[ 1696.541866] note: z_wr_iss[67531] exited with irqs disabled
[ 1696.541885] note: z_wr_iss[67531] exited with preempt_count 1

I also had a Windows VM running at the time of the crash (I almost always have one running for controlling the watercooling loop), though this time it didn't crash when starting the VM, but some time after.

Since both of my original crash and this one seems to indicate RIP: 0010:kfpu_begin, is it possible this is an issue with something in kfpu_begin rather than AVX512? There's also some mention about XRSTORS in SPR Errata (although this is for the server variant)...

rincebrain commented 1 year ago

Interestingly, I tried running a Windows KVM VM on my Ice Lake box with AVX512, running prime95, and scrubbing a pool using fletcher4 in a loop, and nothing has broken in a day.

(Oh, and kernel 6.2.16, just to be as close as I could get it. I may go try 6.2.15 to see if there's somehow a secret bug fixed in the point release, but I didn't see anything suspicious in the changelog...)

So maybe it really is somehow Sapphire Rapids specific...

filip-paczynski commented 1 year ago

Hi, this is just a wild guess, but maybe try to disable preemption with preempt=none kernel cmdline parameter. I do not have any specific reason for why this should work - just a hunch.

sirn commented 1 year ago

I did try running y-cruncher with all AVX512 tests turned on for 8 hours, 4 hours with VM running, and no crashes there either. It really seems to require some specific processor state to trigger this crash.

Today I updated the kernel to 6.3.6, and it took me around 7 reboots to trigger a crash, then two in rows. (Not sure if coincidence or the processor somehow stuck in that state across reboot, but after the second crash I did a power cycle, and four reboots later it still doesn't crash…)

Interestingly, with 6.3.6 my call trace now includes fpu_begin up to kernel handlers:

[  116.723673] BUG: unable to handle page fault for address: ffff9a0a21324cff
[  116.723677] #PF: supervisor write access in kernel mode
[  116.723679] #PF: error_code(0x0003) - permissions violation
[  116.723680] PGD 807f601067 P4D 807f601067 PUD 1085b4063 PMD 108914063 PTE 8000000121324161
[  116.723684] Oops: 0003 [#2] PREEMPT SMP NOPTI
[  116.723686] CPU: 11 PID: 18934 Comm: z_wr_iss Tainted: P      D    O       6.3.6_1 #1
[  116.723689] Hardware name: Supermicro Super Server/X13SWA-TF, BIOS 1.1a 04/25/2023
[  116.723690] RIP: 0010:kfpu_begin+0x2f/0x70 [zcommon]
[  116.723698] Code: 00 e8 05 49 3e d2 fa 0f 1f 44 00 00 48 8b 1d 98 af 00 00 e8 93 73 e5 d2 89 c0 48 8b 0c c3 0f 1f 44 00 00 b8 ff ff ff ff 89 c2 <0f> c7 29 5b c3 cc cc cc cc 0f 1f 44 00 00 b8 ff ff ff ff 89 c2 0f
[  116.723700] RSP: 0018:ffffa5c60372f988 EFLAGS: 00010082
[  116.723701] RAX: 00000000ffffffff RBX: ffff9a0a047dec00 RCX: ffff9a0a21322000
[  116.723702] RDX: 00000000ffffffff RSI: ffffffff94112eeb RDI: ffffffff94121573
[  116.723703] RBP: ffff9a0b355c1000 R08: ffffa5c60372faa0 R09: ffff9a0a088c45c0
[  116.723704] R10: ffff9a0a40d0c000 R11: 0000000000000cd7 R12: ffffa5c60372fac0
[  116.723705] R13: 0000000000001000 R14: 0000000000000000 R15: 0000000000000008
[  116.723707] FS:  0000000000000000(0000) GS:ffff9a877a4c0000(0000) knlGS:0000000000000000
[  116.723708] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  116.723709] CR2: ffff9a0a21324cff CR3: 000000807e832001 CR4: 0000000000772ee0
[  116.723711] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  116.723712] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[  116.723713] PKRU: 55555554
[  116.723714] Call Trace:
[  116.723716]  <TASK>
[  116.723717]  ? __die+0x1f/0x70
[  116.723722]  ? page_fault_oops+0x159/0x460
[  116.723726]  ? fixup_exception+0x22/0x300
[  116.723729]  ? exc_page_fault+0xca/0x170
[  116.723733]  ? asm_exc_page_fault+0x22/0x30
[  116.723737]  ? kfpu_begin+0x2f/0x70 [zcommon]
[  116.723742]  ? kfpu_begin+0x1d/0x70 [zcommon]
[  116.723747]  fletcher_4_avx512f_native+0x18/0xa0 [zcommon]
[  116.723751]  ? abd_fletcher_4_iter+0x64/0xc0 [zcommon]
[  116.723758]  ? abd_iterate_func+0xe8/0x1c0 [zfs]
[  116.723837]  ? __pfx_abd_fletcher_4_iter+0x10/0x10 [zcommon]
[  116.723843]  ? abd_fletcher_4_native+0x7b/0xc0 [zfs]
[  116.723906]  ? spl_kmem_cache_alloc+0x66/0x760 [spl]
[  116.723915]  ? percpu_counter_add_batch+0x2e/0xa0
[  116.723919]  ? zio_checksum_compute+0xc3/0x3e0 [zfs]
[  116.723970]  ? __kmem_cache_alloc_node+0x151/0x2a0
[  116.723974]  ? spl_kmem_alloc+0xb6/0x110 [spl]
[  116.723980]  ? spl_kmem_alloc+0xb6/0x110 [spl]
[  116.723985]  ? __kmalloc_node+0x4c/0x150
[  116.723988]  ? spl_kmem_alloc+0xb6/0x110 [spl]
[  116.723993]  ? zio_write_compress+0x745/0x9e0 [zfs]
[  116.724047]  ? zio_checksum_generate+0x49/0x70 [zfs]
[  116.724097]  ? zio_execute+0x80/0x120 [zfs]
[  116.724145]  ? taskq_thread+0x2e0/0x510 [spl]
[  116.724151]  ? __pfx_default_wake_function+0x10/0x10
[  116.724155]  ? __pfx_zio_execute+0x10/0x10 [zfs]
[  116.724204]  ? __pfx_taskq_thread+0x10/0x10 [spl]
[  116.724210]  ? kthread+0xe6/0x110
[  116.724212]  ? __pfx_kthread+0x10/0x10
[  116.724214]  ? ret_from_fork+0x29/0x50
[  116.724218]  </TASK>
[  116.724218] Modules linked in: snd_seq_dummy snd_hrtimer xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp ip6table_mangle ip6table_nat ip6table_filter ip6_tables iptable_mangle iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_filter ip_tables x_tables bridge stp llc bnep msr nls_iso8859_1 nls_cp437 vfat fat snd_sof_pci_intel_tgl snd_sof_intel_hda_common soundwire_intel hid_logitech_hidpp soundwire_generic_allocation soundwire_cadence snd_sof_intel_hda snd_sof_pci snd_sof_xtensa_dsp mei_hdcp mei_pxp ipmi_ssif intel_rapl_msr snd_sof intel_rapl_common intel_uncore_frequency snd_sof_utils intel_uncore_frequency_common snd_soc_hdac_hda snd_hda_ext_core snd_soc_acpi_intel_match i10nm_edac snd_soc_acpi nfit soundwire_bus x86_pkg_temp_thermal intel_powerclamp snd_soc_core snd_hda_codec_realtek snd_compress snd_hda_codec_generic snd_hda_codec_hdmi ac97_bus ledtrig_audio snd_pcm_dmaengine kvm_intel snd_hda_intel snd_intel_dspcfg i2c_designware_platform i2c_designware...
[  116.724254]  snd_intel_sdw_acpi snd_usb_audio kvm snd_hda_codec snd_usbmidi_lib irqbypass pmt_telemetry iTCO_wdt intel_lpss_pci rndis_host intel_pmc_bxt mei_gsc pmt_class intel_sdsi acpi_ipmi snd_hda_core snd_rawmidi iTCO_vendor_support cdc_ether intel_lpss rapl intel_cstate joydev input_leds atlantic igb mc snd_hwdep idxd i2c_i801 mei_me ipmi_si hid_logitech_dj evdev aquacomputer_d5next usbnet idma64 spi_intel_pci pcspkr dax_hmem intel_vsec idxd_bus snd_pcm mac_hid dca macsec spi_intel i2c_smbus ipmi_devintf mei virt_dma tpm_crb ipmi_msghandler tpm_tis tiny_power_button tpm_tis_core tpm rng_core button nct6775 nct6775_core hwmon_vid coretemp sg snd_seq snd_seq_device snd_timer snd soundcore vhost_vsock vmw_vsock_virtio_transport_common vsock vhost_net vhost vhost_iotlb tap hci_vhci bluetooth ecdh_generic rfkill ecc crc16 vfio_iommu_type1 vfio iommufd uhid dm_mod uinput userio ppp_generic slhc tun loop nvram btrfs blake2b_generic xor raid6_pq libcrc32c crc32c_generic cuse fuse usbkbd usbmouse hid_generi...
[  116.724308]  zfs(PO) i915 zunicode(PO) zzstd(O) drm_buddy zlua(O) intel_gtt video zavl(PO) wmi ast crct10dif_pclmul crc32_pclmul crc32c_intel icp(PO) drm_shmem_helper polyval_clmulni drm_display_helper i2c_algo_bit ahci polyval_generic gf128mul libahci cec drm_kms_helper ghash_clmulni_intel xhci_pci zcommon(PO) rc_core xhci_pci_renesas syscopyarea sha512_ssse3 sysfillrect znvpair(PO) ttm sysimgblt xhci_hcd libata aesni_intel spl(O) crypto_simd drm cryptd usbcore scsi_mod agpgart scsi_common usb_common
[  116.724337] CR2: ffff9a0a21324cff
[  116.724339] ---[ end trace 0000000000000000 ]---
[  116.727668] pstore: backend (erst) writing error (-28)
[  116.727671] RIP: 0010:kfpu_begin+0x2f/0x70 [zcommon]
[  116.727678] Code: 00 e8 05 49 3e d2 fa 0f 1f 44 00 00 48 8b 1d 98 af 00 00 e8 93 73 e5 d2 89 c0 48 8b 0c c3 0f 1f 44 00 00 b8 ff ff ff ff 89 c2 <0f> c7 29 5b c3 cc cc cc cc 0f 1f 44 00 00 b8 ff ff ff ff 89 c2 0f
[  116.727680] RSP: 0018:ffffa5c3875c3908 EFLAGS: 00010086
[  116.727682] RAX: 00000000ffffffff RBX: ffff9a0a047dec00 RCX: ffff9a0a2132d000
[  116.727683] RDX: 00000000ffffffff RSI: ffffffff94112eeb RDI: ffffffff94121573
[  116.727684] RBP: ffff9a0adf852000 R08: ffffa5c3875c3a20 R09: ffffffffc11ab760
[  116.727685] R10: 0000000000000000 R11: 000000000000448e R12: ffffa5c3875c3a40
[  116.727686] R13: 0000000000001000 R14: 0000000000000000 R15: 0000000000000008
[  116.727688] FS:  0000000000000000(0000) GS:ffff9a877a4c0000(0000) knlGS:0000000000000000
[  116.727689] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  116.727690] CR2: ffff9a0a21324cff CR3: 000000807e832001 CR4: 0000000000772ee0
[  116.727692] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  116.727693] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[  116.727694] PKRU: 55555554

sirn commented 1 year ago

Hi, this is just a wild guess, but maybe try to disable preemption with preempt=none kernel cmdline parameter. I do not have any specific reason for why this should work - just a hunch.

Thanks for the suggestion! I'll try this.

Another member in Level1Techs also suggested disabling AMX with clearcpuid=600 since it seems to cause some troubles with XSAVES in KVM. It'll probably take a while to test, though, since it seemingly happens at a very random pace…

rincebrain commented 1 year ago

Oh.

I think I might understand the bad place, here.

If XSAVES will save XTILEDATA with the bitmask we're using, I don't think Linux's fpregs_state contains that, so we're going to be underestimating the buffer size we need and going bang.

So we might be well off rn setting the bit to not save/restore AMX, since we're not using them anywhere, and then refactoring how we allocate preservation state, because XTILEDATA is very big indeed, comparatively.

e: I could be wrong, I'm not an expert, but if we were undersizing the save/restore area, I think this might be what that explosion would look like...it could also be an errata, I'm not ruling that out. If it is the errata and we're mispreserving the AMX state, then assuming we're not touching the AMX state, masking off the AMX bits might still be a good plan.

e2: if you wanted a hacky experiment patch that doesn't do any of the checking for if we have the defines for this like would be needed if we actually merged it:

diff --git a/include/os/linux/kernel/linux/simd_x86.h b/include/os/linux/kernel/linux/simd_x86.h
index 660f0d42d..7e1f46e81 100644
--- a/include/os/linux/kernel/linux/simd_x86.h
+++ b/include/os/linux/kernel/linux/simd_x86.h
@@ -290,7 +290,7 @@ kfpu_begin(void)
         */
        union fpregs_state *state = zfs_kfpu_fpregs[smp_processor_id()];
        if (static_cpu_has(X86_FEATURE_XSAVE)) {
-               kfpu_save_xsave(&state->xsave, ~0);
+               kfpu_save_xsave(&state->xsave, ~(XFEATURE_MASK_XTILE));
        } else if (static_cpu_has(X86_FEATURE_FXSR)) {
                kfpu_save_fxsr(&state->fxsave);
        } else {
@@ -319,18 +319,18 @@ kfpu_begin(void)
        union fpregs_state *state = zfs_kfpu_fpregs[smp_processor_id()];
 #if defined(HAVE_XSAVES)
        if (static_cpu_has(X86_FEATURE_XSAVES)) {
-               kfpu_do_xsave("xsaves", &state->xsave, ~0);
+               kfpu_do_xsave("xsaves", &state->xsave, ~(XFEATURE_MASK_XTILE));
                return;
        }
 #endif
 #if defined(HAVE_XSAVEOPT)
        if (static_cpu_has(X86_FEATURE_XSAVEOPT)) {
-               kfpu_do_xsave("xsaveopt", &state->xsave, ~0);
+               kfpu_do_xsave("xsaveopt", &state->xsave, ~(XFEATURE_MASK_XTILE));
                return;
        }
 #endif
        if (static_cpu_has(X86_FEATURE_XSAVE)) {
-               kfpu_do_xsave("xsave", &state->xsave, ~0);
+               kfpu_do_xsave("xsave", &state->xsave, ~(XFEATURE_MASK_XTILE));
        } else if (static_cpu_has(X86_FEATURE_FXSR)) {
                kfpu_save_fxsr(&state->fxsave);
        } else {
@@ -396,7 +396,7 @@ kfpu_end(void)
        union fpregs_state *state = zfs_kfpu_fpregs[smp_processor_id()];

        if (static_cpu_has(X86_FEATURE_XSAVE)) {
-               kfpu_restore_xsave(&state->xsave, ~0);
+               kfpu_restore_xsave(&state->xsave, ~(XFEATURE_MASK_XTILE));
        } else if (static_cpu_has(X86_FEATURE_FXSR)) {
                kfpu_restore_fxsr(&state->fxsave);
        } else {
@@ -415,12 +415,12 @@ kfpu_end(void)
        union fpregs_state *state = zfs_kfpu_fpregs[smp_processor_id()];
 #if defined(HAVE_XSAVES)
        if (static_cpu_has(X86_FEATURE_XSAVES)) {
-               kfpu_do_xrstor("xrstors", &state->xsave, ~0);
+               kfpu_do_xrstor("xrstors", &state->xsave, ~(XFEATURE_MASK_XTILE));
                goto out;
        }
 #endif
        if (static_cpu_has(X86_FEATURE_XSAVE)) {
-               kfpu_do_xrstor("xrstor", &state->xsave, ~0);
+               kfpu_do_xrstor("xrstor", &state->xsave, ~(XFEATURE_MASK_XTILE));
        } else if (static_cpu_has(X86_FEATURE_FXSR)) {
                kfpu_restore_fxsr(&state->fxsave);
        } else {

sirn commented 1 year ago

Thanks, I'm testing this patch. Let's see for a few days if I'll have any crashes. For a reference, this is what my dmesg |grep x86/fpu looks like:

[    0.000000] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
[    0.000000] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
[    0.000000] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
[    0.000000] x86/fpu: Supporting XSAVE feature 0x020: 'AVX-512 opmask'
[    0.000000] x86/fpu: Supporting XSAVE feature 0x040: 'AVX-512 Hi256'
[    0.000000] x86/fpu: Supporting XSAVE feature 0x080: 'AVX-512 ZMM_Hi256'
[    0.000000] x86/fpu: Supporting XSAVE feature 0x200: 'Protection Keys User registers'
[    0.000000] x86/fpu: Supporting XSAVE feature 0x400: 'PASID state'
[    0.000000] x86/fpu: Supporting XSAVE feature 0x20000: 'AMX Tile config'
[    0.000000] x86/fpu: Supporting XSAVE feature 0x40000: 'AMX Tile data'
[    0.000000] x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]:  256
[    0.000000] x86/fpu: xstate_offset[5]:  832, xstate_sizes[5]:   64
[    0.000000] x86/fpu: xstate_offset[6]:  896, xstate_sizes[6]:  512
[    0.000000] x86/fpu: xstate_offset[7]: 1408, xstate_sizes[7]: 1024
[    0.000000] x86/fpu: xstate_offset[9]: 2432, xstate_sizes[9]:    8
[    0.000000] x86/fpu: xstate_offset[10]: 2440, xstate_sizes[10]:    8
[    0.000000] x86/fpu: xstate_offset[17]: 2496, xstate_sizes[17]:   64
[    0.000000] x86/fpu: xstate_offset[18]: 2560, xstate_sizes[18]: 8192
[    0.000000] x86/fpu: Enabled xstate features 0x606e7, context size is 10752 bytes, using 'compacted' format.

sirn commented 1 year ago

I've been trying to reproduce the crash with the above patch applied, with multiple reboots and multiple VM relaunches over the past 2 days. So far, not a single crash yet.

admnd commented 1 year ago

Never noticed this on Zen 4 (supports AVX512 but not AMX) so this one seems to be an Intel specific-issue. Nice finding!

rincebrain commented 1 year ago

Not just Intel-specific, either, it doesn't happen on my Ice Lake box (which has AVX512F/BW, but not AMX).

sirn commented 1 year ago

I have not had a single crash in over a week after applying @rincebrain's patch (even after countless reboots & VM start/shutdown). I think XTILEDATA might indeed be the cause of the issue.

hohl commented 1 year ago

The system (the host) will become unresponsive soon after this happens, and any attempt to access the filesystem will simply freeze.

Does this show as slight IO wait in the beginning (once the first VM starts using a ZFS volume) that then grows over time until the entire file system gets unresponsive (although at the same time iostat shows no actual disk activity at all but zd0 as 100% busy)?

I stumbled over this issue while using ZFS with Proxmox on a new build of Intel Xeon Silver ~4116+~ 4416+ servers. File system getting unresponsive after a while with consistently growing IO wait—like nearly linearly growing in small discrete steps once the first VM is launched. There is nothing special about that VM (just a plain and empty Debian VM), and I can reproduce that every time I reinstall the entire host system.

Not sure if it‘s really the same issue at play, but the only error I found in the syslog is a similar page fault with RIP: 0010:kfpu_begin+0x2f/0x70 [zcommon], and disabling AMX support in the kernel (via clearcpuid=600) makes it go away entirely. So this Github issue seems to be the closest I have found so far.

rincebrain commented 1 year ago

Xeon Silver 4116 should be Skylake, not SPR, so not this issue, whatever's broken, I think.

I can't...imagine why clearing AMX would matter on those CPUs since I just double-checked, and they absolutely shouldn't have that.

(4416, on the other hand, if you misspoke, would indeed be SPR, and potentially blow up like this.)

hohl commented 1 year ago

Xeon Silver 4116 should be Skylake, not SPR, so not this issue, whatever's broken, I think.

Apologies for the typo, 4416+ (Sapphire Rapid). The 4116 would be quite a dated choice for a new build.

sirn commented 1 year ago

Just want to report that the spr nobiscuit patch works great, I have not had a single crash for over a month already :-)

Does this show as slight IO wait in the beginning (once the first VM starts using a ZFS volume) that then grows over time until the entire file system gets unresponsive (although at the same time iostat shows no actual disk activity at all but zd0 as 100% busy)?

I have never actually measured this, but some disk access was possible for a brief moment after the crash. I've also seen reports in Level1Techs forum running Xeon Gold 6438Y+ having this issue on Proxmox as well (with Linux guest), so at least this does seem to affect all SPRs.

gl-yziquel commented 8 months ago

Sigh. Got it by that one. The fix is in zfs 2.2, I'm told. Hasn't been backported to 2.1.x.

openzfs / zfs