Regression: intel-ucode 20191113 causes lockup on reboot

ahesford commented 5 years ago

On an Arch Linux installation, intel-ucode 20191113 causes a lockup when rebooting a Dell Precision 5820 workstation with an Intel Xeon W-2145 CPU. The system cold boots fine, but once the system is running, a reboot will cause a lockup when the kernel is reloaded. This is a regression from intel-ucode 20190918 and affects more than one bootloader and at least the linux, linux-zen and linux-lts kernels distributed by Arch. Because these three kernels are all affected and switching to the earlier intel-ucode package resolves the issue, I believe this is an upstream issue rather than a problem with the Arch package.

Additional info:

Affected version: 20191113 Last working version: 20190918

Bootloaders affected:

systemd-boot (from systemd 243, Arch package systemd-243.78-2)
rEFInd (version 0.11.3, Arch package refind-efi-0.11.3-1)

Note: rEFInd was not configured to apply the microcode patch at boot. Instead, the system was cold-booted from systemd-boot to apply the microcode patch, the boot manager was replaced with rEFInd, and the system was warm-rebooted using rEFInd. Thus, the reboot lockup does not appear to be caused by the act of loading the microcode, but instead causes the CPU to lock up after applying the microcode at least once and warm resetting.

Kernels affected:

linux (5.3.11, Arch package linux-5.3.11.1-1)
linux longterm (4.19.84, Arch package linux-lts-4.19.84-1)
linux zen (5.3.11, Arch package linux-zen-5.3.11.1-1)

For systemd-boot, the loader entry is: title Arch Linux linux /vmlinuz-linux initrd /intel-ucode.img initrd /initramfs-linux.img options root=UUID=[masked] rw options consoleblank=600 options audit=0

For rEFInd, the loader entry was created automatically using refind-install and applied the same kernel arguments (root, consoleblank and audit) as the systemd loader.

Steps to reproduce:

Install Arch Linux and any of the kernels listed above (other kernels may be similarly affected).
Install package intel-ucode 20191113-1.
Configure systemd-boot to boot with a loader entry like that above, making sure to load the microcode during boot.
Cold boot the system; everything should boot as expected.
Invoke "shutdown -r now".
After systemd-boot selects the kernel to boot, the system should hang on the "SHA256 validated" message.
Forcibly power down the system.
[In my case, attempted to turn on the system at this point will cause the fans to spin, then the system to immediately shut itself down; powering on a second time will bring up the system as expected.]
Replace intel-ucode 20191113-1 with version 20190918-1 and confirm that the system boots and reboots as expected.

hmh commented 5 years ago

Can you tell us the output of "cat /proc/cpuinfo" ?

esyr-rh commented 5 years ago

And grep -r . /sys/devices/system/cpu/vulnerabilities as well, please?

ahesford commented 5 years ago

Below is the content of /proc/cpuinfo for the first virtual CPU. This is an eight-core model with hyperthreading, so this block repeats 15 more times. The only differences are the frequencies (which obviously jump around), the core ID (which matches the processor index), and the apicid and initial apicid fields (which have the same value for each processor: twice the processor index for indices 0 thorugh 7, and twice the processor index plus one for indices 8 through 15).

If you want all of the other blocks, please let me know.

processor   : 0
vendor_id   : GenuineIntel
cpu family  : 6
model       : 85
model name  : Intel(R) Xeon(R) W-2145 CPU @ 3.70GHz
stepping    : 4
microcode   : 0x2000064
cpu MHz     : 1450.713
cache size  : 11264 KB
physical id : 0
siblings    : 16
core id     : 0
cpu cores   : 8
apicid      : 0
initial apicid  : 0
fpu     : yes
fpu_exception   : yes
cpuid level : 22
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req md_clear flush_l1d
bugs        : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa itlb_multihit
bogomips    : 7402.02
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:

ahesford commented 5 years ago

Below are the contents of /sys/devices/system/cpu/vulnerabilities.

/sys/devices/system/cpu/vulnerabilities/spectre_v2:Mitigation: Full generic retpoline, IBPB: conditional, IBRS_FW, STIBP: conditional, RSB filling
/sys/devices/system/cpu/vulnerabilities/itlb_multihit:KVM: Mitigation: Split huge pages
/sys/devices/system/cpu/vulnerabilities/mds:Mitigation: Clear CPU buffers; SMT vulnerable
/sys/devices/system/cpu/vulnerabilities/l1tf:Mitigation: PTE Inversion; VMX: conditional cache flushes, SMT vulnerable
/sys/devices/system/cpu/vulnerabilities/spec_store_bypass:Mitigation: Speculative Store Bypass disabled via prctl and seccomp
/sys/devices/system/cpu/vulnerabilities/tsx_async_abort:Mitigation: Clear CPU buffers; SMT vulnerable
/sys/devices/system/cpu/vulnerabilities/spectre_v1:Mitigation: usercopy/swapgs barriers and __user pointer sanitization
/sys/devices/system/cpu/vulnerabilities/meltdown:Mitigation: PTI

esyr-rh commented 5 years ago

Hm, may I ask to supply "debug" parameter to the kernel's command line?

ahesford commented 5 years ago

Adding "debug" does not cause a change of behavior after a warm reboot. The system locks up after the systemd-boot "SHA256 validated" message without any additional information. The dmesg output after the first (cold) boot with debug enabled is attached.

How else may I help to isolate this issue?

dmesg.debug.log

sclarkson commented 5 years ago

I'm experiencing this as well.

System information

Mobo: ASUS WS X299 SAGE CPU: 9920x and 9820x OS: Ubuntu 18.04.3

I've tried Ubuntu's kernel 5.0.0-36-generic, as well as the mainline 5.3.11. Both exhibit the problem. I've also tried the latest BIOS from ASUS.

Confirmed that upgrading to 20191112 from 20190918 caused the issue.

System hangs when GRUB tries to load the kernel after either running reboot on the command line or pressing the reset button on the motherboard.

esyr-rh commented 5 years ago

It seems that there's a new release has just made available[1], may I ask to try it out?

[1] https://github.com/intel/Intel-Linux-Processor-Microcode-Data-Files/releases/tag/microcode-20191115

hagar-dunor commented 5 years ago

Gentoo user, same issue reported here on a Core i9-7920x. 20191115 doesn't seem to do any better.

esyr-rh commented 5 years ago

Right, there are no updates for 06-55-04 in microcode-20191115, so there's no point in testing 2019115, my apologies.

hmh commented 5 years ago

Looks server-die related (all reports are from HEDT parts or Xeon parts) ?

Kabylake desktop and mobile (0x806e9, 0x906e9) are not showing the reboot issue here. CoffeeLake mobile (0x906ea) did now show any reboot issues, either.

esyr-rh commented 5 years ago

Looks server-die related (all reports are from HEDT parts or Xeon parts) ?

Both reports are against CPUID 0x50654 parts, yes.

Skylake mobile and desktop (0x806e9, 0x906e9) are not showing the reboot issue here.

FYI, Skylake mobile/desktop have CPUID 0x406e3 and 0x506e3, respectively.

esyr-rh commented 5 years ago

Gentoo user, same issue reported here on a Core i9-7920x. 20191115 doesn't seem to do any better.

May I ask to try out microcode-20190918 release, with 06-55-04 microcode revision 0x2000064? Thank you.

hmh commented 5 years ago

easy-rh: oops, sorry about that! wrong names but correct cpuids: tests were done on 0x806e9, 0x906e9, 0x906ea. None show the reboot issue. (edited the incorrect post, it now has the proper processor names).

ahesford commented 5 years ago

I can confirm that the issue persists on a Xeon W-2145 with 20191115, while the issue has never affected a Core i7-8705G in a Dell XPS 15 9575.

esyr-rh commented 5 years ago

I can confirm that the issue persists on a Xeon W-2145 with 20191115, I apologise for a pointless question regarding 20191115, checking against 20190918 makes much more sense (as that's the release where the previous 0x2000064 revision of 06-55-04 microcode is provided), may I kindly ask to try to do so?

ahesford commented 5 years ago

I can confirm that the issue persists on a Xeon W-2145 with 20191115,

I apologise for a pointless question regarding 20191115, checking against 20190918 makes much more sense (as that's the release where the previous 0x2000064 revision of 06-55-04 microcode is provided), may I kindly ask to try to do so?

I'm not clear about what you'd like me to try; the 20190918 release works perfectly with my CPU.

esyr-rh commented 5 years ago

Sorry, I've forgotten about the fact that you have already provided this information; again, my apologies.

ahesford commented 5 years ago

Sorry, I've forgotten about the fact that you have already provided this information; again, my apologies.

No worries. Please let me know if I can try anything else to illuminate the problem.

ahesford commented 5 years ago

I may have a resolution: the Arch package is built using

iucode_tool -w kernel/x86/microcode/GenuineIntel.bin intel-ucode{,-with-caveats}/

The manpage for iucode_tool indicates that early firmware images must be 16-byte aligned and that the --write-earlyfw option enforces this. Altering the Arch PKGBUILD to use --write-earlyfw instead of -w changes the size of the microcode image and seems to fix the issue. I've warm-rebooted four times with the modified image and see no issues.

It looks to me like Gentoo is using the --write-firmware instead of --write-earlyfw as well. Maybe this explains the problem seen by @hagar-dunor.

hagar-dunor commented 5 years ago

ahesford: I'm following this wiki and therefore using --write-earlyfw

to be more specific: this is exactly the command I type iucode_tool -S --write-earlyfw=/boot/early_ucode.cpio /lib/firmware/intel-ucode/* and then update the grub config file which picks up the microcode

Are you certain that you actually load the microcode with the modified Arch PKGBUILD? (if must be the first line in your "dmesg")

esyr-rh: revision 0x2000064 doesn't show the problem (which I extracted using the same command above)

esyr-rh commented 5 years ago

I may have a resolution: the Arch package is built using

iucode_tool -w kernel/x86/microcode/GenuineIntel.bin intel-ucide{,-with-caveats}/

What if only 06-55-04 microcode is added, like iucode_tool --write-earlyfw kernel/x86/microcode/GenuineIntel.bin intel-ucode/06-55-04 (or -w, for that matter)?

ahesford commented 5 years ago

My mistake; I overlooked that --write-earlyfw creates the CPIO archive directly, but -w creates a binary image. Changing the Arch PKGBUILD as I suggest creates an invalid initrd that the microcode update driver ignores.

@esyr-rh, a custom early initrd that contains only 06-55-04 still exhibits the warm-reboot issue.

hmh commented 5 years ago

The 16-byte alignment is an old requirement, it might have even been lifted from the IA32 manual nowadays. I need to hunt it down one of these days, and update iucode_tool accordingly...

whpenner commented 5 years ago

The 16-byte alignment has been and continues to be a requirement. This can be found in the Intel(R) 64 and IA-32 Architectures Software Developer's Manual, vol 3A, page 9-34 (https://software.intel.com/sites/default/files/managed/39/c5/325462-sdm-vol-1-2abcd-3abcd.pdf). "Note that the microcode update must be aligned on a 16-byte boundary and the size of the microcode update must be 1-KByte granular."

whpenner commented 5 years ago

So, did I read this thread correctly and it is not a microcode issue?

ahesford commented 5 years ago

So, did I read this thread correctly and it is not a microcode issue?

I believe this is a microcode issue. My first attempt with the --write-earlyfw option to to iucode_tool was incorrect, producing an invalid image that was ignored by the kernel loader. Subsequent attempts with proper use of --write-earlyfw produce images that are properly loaded but that continue to show the warm-reboot lockup issue.

hmh commented 5 years ago

@whpenner, thanks. I will keep the remark about it in the iucode-tool manual, then.

But current Intel processors do not care. It would be really nice to know if future processors will care, though.

For the record, they don't care about the 1KiB size either, but it is a good idea to ensure that padding is there just in case the dang thing will read-past-the-end and cause a fault.

wendigo commented 4 years ago

Same here. Microcode 3.20191115 on i7-7820X CPU. Systems boots fine but after warm-reboot it got stuck

xwjabc commented 4 years ago

Same here. In dmesg: microcode: microcode updated early to revision 0x2000065, date = 2019-09-05 Ubuntu 18.04.3 LTS on i9 9900X CPU. Systems boots fine but after warm-reboot it got stuck.

However, when I use (home) ~$ apt list --installed | grep micro it shows:

amd64-microcode/bionic-updates,bionic-security,now 3.20191021.1+really3.20181128.1~ubuntu0.18.04.1 amd64 [installed,automatic]
intel-microcode/bionic-updates,bionic-security,now 3.20191115.1ubuntu0.18.04.1 amd64 [installed,automatic]

stevebeattie commented 4 years ago

@xwjabc yes, the intel-microcode 3.20191115.1ubuntu0.18.04.1 (and other 3.20191115.1 packages for other Ubuntu releases) includes:

sig 0x00050654, pf_mask 0xb7, 2019-09-05, rev 0x2000065, size 34816

which is the latest microcode from this repository for your processor class.

stevebeattie commented 4 years ago

@whpenner we now have reports from Ubuntu users getting hit by this (e.g. https://bugs.launchpad.net/ubuntu/+source/intel-microcode/+bug/1854764 ); the way that early load microcode is written to the initramfs in Debian and Ubuntu is with iucode-tool --write-earlyfw, and thus should be 16-byte aligned, so it does appear that this is a problem with the microcode itself.

whpenner commented 4 years ago

Intel has received reports of reboot failures on certain Skylake based Intel® Xeon® W and Intel® Core™ X-series single socket platforms following the OS load of processor microcode revision 0x65. We have received no reports and have no evidence that these failures affect Skylake based Intel® Xeon® Scalable Performance multi-socket platforms. We are debugging the issue to establish root cause. In the interim, processor microcode revision 0x64 remains available for use.

esyr-rh commented 4 years ago

Here is the revision 0x2000064 of 06-55-04 microcode, for the reference: https://github.com/intel/Intel-Linux-Processor-Microcode-Data-Files/blob/microcode-20190918/intel-ucode/06-55-04

hmh commented 4 years ago

@whpenner, thanks for the official guidance.

hmh commented 4 years ago

There are reports, one on Debian and another on Ubuntu, that revision 0x2000064 of the 0x50654 microcode DOES have the hang-on-reboot issue.

That leaves us on a very nasty position of either telling users to deal with it and never reboot, or go back to revision 0x200005e, which has the JCC erratum and a lot of other nasty issues as far as I know (please correct me if I am wrong about this).

Can Intel give us a tentative timeframe for a fix? Or guidance on the least dangerous workarounds available?

eworm-de commented 4 years ago

No complains from Arch users...

Possibly those reporting the issue installed a fixed package, then did a warm reboot but had revision 0x65 loaded into their processors already?

hmh commented 4 years ago

No, I asked. They cold-booted into 0x64 :-(

Debian report: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=946515#37

Ubuntu report: https://bugs.launchpad.net/ubuntu/+source/intel-microcode/+bug/1855784

ahesford commented 4 years ago

There are reports, one on Debian and another on Ubuntu, that revision 0x2000064 of the 0x50654 microcode DOES have the hang-on-reboot issue.

I definitely do not see the warm-reboot lockup with version 0x2000064 on the Xeon W-2145. Only the newer version causes the issue.

hmh commented 4 years ago

Time to compare the chip names and/or process flags :-(

eworm-de commented 4 years ago

The microcode has been updated in release 20200609. Any changes regarding the lockup on reboot?

hmh commented 4 years ago

It should be fixed: we got reports that it was fixed several months ago in an update that was not distributed to the general public.

I will remove the update block on signature 0x50654 in Debian and Ubuntu because of that, and ship the microcode update revision included in this new release (20200609) for signature 0x50654.

(and no, there was nothing I could do about it, I asked for permission to distribute it, and did not get any sort of reply -- maybe there were some issues with that update, or maybe Intel had all hands tied to the fixes released today)

eworm-de commented 4 years ago

Thanks for the info! Will drop the workaround from Arch Linux package then.

ahesford commented 4 years ago

I just tried the new release on the affected system and it survived a warm reboot from the newer microcode. It appears this issue is fixed.

hmh commented 4 years ago

Yes, note that removing the workaround is itself a security update on its own.

intel / Intel-Linux-Processor-Microcode-Data-Files

Regression: intel-ucode 20191113 causes lockup on reboot #21