Closed ahesford closed 4 years ago
Can you tell us the output of "cat /proc/cpuinfo" ?
And grep -r . /sys/devices/system/cpu/vulnerabilities
as well, please?
Below is the content of /proc/cpuinfo for the first virtual CPU. This is an eight-core model with hyperthreading, so this block repeats 15 more times. The only differences are the frequencies (which obviously jump around), the core ID (which matches the processor index), and the apicid and initial apicid fields (which have the same value for each processor: twice the processor index for indices 0 thorugh 7, and twice the processor index plus one for indices 8 through 15).
If you want all of the other blocks, please let me know.
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 85
model name : Intel(R) Xeon(R) W-2145 CPU @ 3.70GHz
stepping : 4
microcode : 0x2000064
cpu MHz : 1450.713
cache size : 11264 KB
physical id : 0
siblings : 16
core id : 0
cpu cores : 8
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 22
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req md_clear flush_l1d
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa itlb_multihit
bogomips : 7402.02
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
power management:
Below are the contents of /sys/devices/system/cpu/vulnerabilities.
/sys/devices/system/cpu/vulnerabilities/spectre_v2:Mitigation: Full generic retpoline, IBPB: conditional, IBRS_FW, STIBP: conditional, RSB filling
/sys/devices/system/cpu/vulnerabilities/itlb_multihit:KVM: Mitigation: Split huge pages
/sys/devices/system/cpu/vulnerabilities/mds:Mitigation: Clear CPU buffers; SMT vulnerable
/sys/devices/system/cpu/vulnerabilities/l1tf:Mitigation: PTE Inversion; VMX: conditional cache flushes, SMT vulnerable
/sys/devices/system/cpu/vulnerabilities/spec_store_bypass:Mitigation: Speculative Store Bypass disabled via prctl and seccomp
/sys/devices/system/cpu/vulnerabilities/tsx_async_abort:Mitigation: Clear CPU buffers; SMT vulnerable
/sys/devices/system/cpu/vulnerabilities/spectre_v1:Mitigation: usercopy/swapgs barriers and __user pointer sanitization
/sys/devices/system/cpu/vulnerabilities/meltdown:Mitigation: PTI
Hm, may I ask to supply "debug" parameter to the kernel's command line?
Adding "debug" does not cause a change of behavior after a warm reboot. The system locks up after the systemd-boot "SHA256 validated" message without any additional information. The dmesg output after the first (cold) boot with debug enabled is attached.
How else may I help to isolate this issue?
I'm experiencing this as well.
System information
Mobo: ASUS WS X299 SAGE CPU: 9920x and 9820x OS: Ubuntu 18.04.3
I've tried Ubuntu's kernel 5.0.0-36-generic, as well as the mainline 5.3.11. Both exhibit the problem. I've also tried the latest BIOS from ASUS.
Confirmed that upgrading to 20191112 from 20190918 caused the issue.
System hangs when GRUB tries to load the kernel after either running reboot on the command line or pressing the reset button on the motherboard.
It seems that there's a new release has just made available[1], may I ask to try it out?
[1] https://github.com/intel/Intel-Linux-Processor-Microcode-Data-Files/releases/tag/microcode-20191115
Gentoo user, same issue reported here on a Core i9-7920x. 20191115 doesn't seem to do any better.
Right, there are no updates for 06-55-04 in microcode-20191115, so there's no point in testing 2019115, my apologies.
Looks server-die related (all reports are from HEDT parts or Xeon parts) ?
Kabylake desktop and mobile (0x806e9, 0x906e9) are not showing the reboot issue here. CoffeeLake mobile (0x906ea) did now show any reboot issues, either.
Looks server-die related (all reports are from HEDT parts or Xeon parts) ?
Both reports are against CPUID 0x50654 parts, yes.
Skylake mobile and desktop (0x806e9, 0x906e9) are not showing the reboot issue here.
FYI, Skylake mobile/desktop have CPUID 0x406e3 and 0x506e3, respectively.
Gentoo user, same issue reported here on a Core i9-7920x. 20191115 doesn't seem to do any better.
May I ask to try out microcode-20190918 release, with 06-55-04 microcode revision 0x2000064? Thank you.
easy-rh: oops, sorry about that! wrong names but correct cpuids: tests were done on 0x806e9, 0x906e9, 0x906ea. None show the reboot issue. (edited the incorrect post, it now has the proper processor names).
I can confirm that the issue persists on a Xeon W-2145 with 20191115, while the issue has never affected a Core i7-8705G in a Dell XPS 15 9575.
I can confirm that the issue persists on a Xeon W-2145 with 20191115, I apologise for a pointless question regarding 20191115, checking against 20190918 makes much more sense (as that's the release where the previous 0x2000064 revision of 06-55-04 microcode is provided), may I kindly ask to try to do so?
I can confirm that the issue persists on a Xeon W-2145 with 20191115,
I apologise for a pointless question regarding 20191115, checking against 20190918 makes much more sense (as that's the release where the previous 0x2000064 revision of 06-55-04 microcode is provided), may I kindly ask to try to do so?
I'm not clear about what you'd like me to try; the 20190918 release works perfectly with my CPU.
Sorry, I've forgotten about the fact that you have already provided this information; again, my apologies.
Sorry, I've forgotten about the fact that you have already provided this information; again, my apologies.
No worries. Please let me know if I can try anything else to illuminate the problem.
I may have a resolution: the Arch package is built using
iucode_tool -w kernel/x86/microcode/GenuineIntel.bin intel-ucode{,-with-caveats}/
The manpage for iucode_tool indicates that early firmware images must be 16-byte aligned and that the --write-earlyfw
option enforces this. Altering the Arch PKGBUILD to use --write-earlyfw
instead of -w
changes the size of the microcode image and seems to fix the issue. I've warm-rebooted four times with the modified image and see no issues.
It looks to me like Gentoo is using the --write-firmware
instead of --write-earlyfw
as well. Maybe this explains the problem seen by @hagar-dunor.
ahesford: I'm following this wiki and therefore using --write-earlyfw
to be more specific: this is exactly the command I type
iucode_tool -S --write-earlyfw=/boot/early_ucode.cpio /lib/firmware/intel-ucode/*
and then update the grub config file which picks up the microcode
Are you certain that you actually load the microcode with the modified Arch PKGBUILD? (if must be the first line in your "dmesg")
esyr-rh: revision 0x2000064 doesn't show the problem (which I extracted using the same command above)
I may have a resolution: the Arch package is built using
iucode_tool -w kernel/x86/microcode/GenuineIntel.bin intel-ucide{,-with-caveats}/
What if only 06-55-04 microcode is added, like iucode_tool --write-earlyfw kernel/x86/microcode/GenuineIntel.bin intel-ucode/06-55-04
(or -w
, for that matter)?
My mistake; I overlooked that --write-earlyfw
creates the CPIO archive directly, but -w
creates a binary image. Changing the Arch PKGBUILD as I suggest creates an invalid initrd that the microcode update driver ignores.
@esyr-rh, a custom early initrd that contains only 06-55-04 still exhibits the warm-reboot issue.
The 16-byte alignment is an old requirement, it might have even been lifted from the IA32 manual nowadays. I need to hunt it down one of these days, and update iucode_tool accordingly...
The 16-byte alignment has been and continues to be a requirement. This can be found in the Intel(R) 64 and IA-32 Architectures Software Developer's Manual, vol 3A, page 9-34 (https://software.intel.com/sites/default/files/managed/39/c5/325462-sdm-vol-1-2abcd-3abcd.pdf). "Note that the microcode update must be aligned on a 16-byte boundary and the size of the microcode update must be 1-KByte granular."
So, did I read this thread correctly and it is not a microcode issue?
So, did I read this thread correctly and it is not a microcode issue?
I believe this is a microcode issue. My first attempt with the --write-earlyfw
option to to iucode_tool
was incorrect, producing an invalid image that was ignored by the kernel loader. Subsequent attempts with proper use of --write-earlyfw
produce images that are properly loaded but that continue to show the warm-reboot lockup issue.
@whpenner, thanks. I will keep the remark about it in the iucode-tool manual, then.
But current Intel processors do not care. It would be really nice to know if future processors will care, though.
For the record, they don't care about the 1KiB size either, but it is a good idea to ensure that padding is there just in case the dang thing will read-past-the-end and cause a fault.
Same here. Microcode 3.20191115 on i7-7820X CPU. Systems boots fine but after warm-reboot it got stuck
Same here. In dmesg: microcode: microcode updated early to revision 0x2000065, date = 2019-09-05 Ubuntu 18.04.3 LTS on i9 9900X CPU. Systems boots fine but after warm-reboot it got stuck.
However, when I use
(home) ~$ apt list --installed | grep micro
it shows:
amd64-microcode/bionic-updates,bionic-security,now 3.20191021.1+really3.20181128.1~ubuntu0.18.04.1 amd64 [installed,automatic]
intel-microcode/bionic-updates,bionic-security,now 3.20191115.1ubuntu0.18.04.1 amd64 [installed,automatic]
@xwjabc yes, the intel-microcode 3.20191115.1ubuntu0.18.04.1 (and other 3.20191115.1 packages for other Ubuntu releases) includes:
sig 0x00050654, pf_mask 0xb7, 2019-09-05, rev 0x2000065, size 34816
which is the latest microcode from this repository for your processor class.
@whpenner we now have reports from Ubuntu users getting hit by this (e.g. https://bugs.launchpad.net/ubuntu/+source/intel-microcode/+bug/1854764 ); the way that early load microcode is written to the initramfs in Debian and Ubuntu is with iucode-tool --write-earlyfw, and thus should be 16-byte aligned, so it does appear that this is a problem with the microcode itself.
Intel has received reports of reboot failures on certain Skylake based Intel® Xeon® W and Intel® Core™ X-series single socket platforms following the OS load of processor microcode revision 0x65. We have received no reports and have no evidence that these failures affect Skylake based Intel® Xeon® Scalable Performance multi-socket platforms. We are debugging the issue to establish root cause. In the interim, processor microcode revision 0x64 remains available for use.
Here is the revision 0x2000064 of 06-55-04 microcode, for the reference: https://github.com/intel/Intel-Linux-Processor-Microcode-Data-Files/blob/microcode-20190918/intel-ucode/06-55-04
@whpenner, thanks for the official guidance.
There are reports, one on Debian and another on Ubuntu, that revision 0x2000064 of the 0x50654 microcode DOES have the hang-on-reboot issue.
That leaves us on a very nasty position of either telling users to deal with it and never reboot, or go back to revision 0x200005e, which has the JCC erratum and a lot of other nasty issues as far as I know (please correct me if I am wrong about this).
Can Intel give us a tentative timeframe for a fix? Or guidance on the least dangerous workarounds available?
No complains from Arch users...
Possibly those reporting the issue installed a fixed package, then did a warm reboot but had revision 0x65 loaded into their processors already?
No, I asked. They cold-booted into 0x64 :-(
Debian report: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=946515#37
Ubuntu report: https://bugs.launchpad.net/ubuntu/+source/intel-microcode/+bug/1855784
There are reports, one on Debian and another on Ubuntu, that revision 0x2000064 of the 0x50654 microcode DOES have the hang-on-reboot issue.
I definitely do not see the warm-reboot lockup with version 0x2000064 on the Xeon W-2145. Only the newer version causes the issue.
Time to compare the chip names and/or process flags :-(
The microcode has been updated in release 20200609. Any changes regarding the lockup on reboot?
It should be fixed: we got reports that it was fixed several months ago in an update that was not distributed to the general public.
I will remove the update block on signature 0x50654 in Debian and Ubuntu because of that, and ship the microcode update revision included in this new release (20200609) for signature 0x50654.
(and no, there was nothing I could do about it, I asked for permission to distribute it, and did not get any sort of reply -- maybe there were some issues with that update, or maybe Intel had all hands tied to the fixes released today)
Thanks for the info! Will drop the workaround from Arch Linux package then.
I just tried the new release on the affected system and it survived a warm reboot from the newer microcode. It appears this issue is fixed.
Yes, note that removing the workaround is itself a security update on its own.
On an Arch Linux installation, intel-ucode 20191113 causes a lockup when rebooting a Dell Precision 5820 workstation with an Intel Xeon W-2145 CPU. The system cold boots fine, but once the system is running, a reboot will cause a lockup when the kernel is reloaded. This is a regression from intel-ucode 20190918 and affects more than one bootloader and at least the linux, linux-zen and linux-lts kernels distributed by Arch. Because these three kernels are all affected and switching to the earlier intel-ucode package resolves the issue, I believe this is an upstream issue rather than a problem with the Arch package.
Additional info:
Affected version: 20191113 Last working version: 20190918
Bootloaders affected:
Note: rEFInd was not configured to apply the microcode patch at boot. Instead, the system was cold-booted from systemd-boot to apply the microcode patch, the boot manager was replaced with rEFInd, and the system was warm-rebooted using rEFInd. Thus, the reboot lockup does not appear to be caused by the act of loading the microcode, but instead causes the CPU to lock up after applying the microcode at least once and warm resetting.
Kernels affected:
For systemd-boot, the loader entry is: title Arch Linux linux /vmlinuz-linux initrd /intel-ucode.img initrd /initramfs-linux.img options root=UUID=[masked] rw options consoleblank=600 options audit=0
For rEFInd, the loader entry was created automatically using refind-install and applied the same kernel arguments (root, consoleblank and audit) as the systemd loader.
Steps to reproduce: