Open jwrdegoede opened 3 years ago
CPUID is 0x30678 (FF-MM-SS 06-37-08), I presume?
@jwrdegoede I couldn't post to your blog, but I got the same issue exactly a week ago on my HP x2 10 (atom z8350) with Fedora. After the usual reboot for updates the pc was completely dead. Just the caps lock led turning on for a second when connecting the keyboard dock. I managed to recover it by formatting a usb stick with the official bios tool, and with the ctrl V combination at boot it started the bios recovery procedure.
CPUID is 0x30678 (FF-MM-SS 06-37-08), I presume?
Correct, for some extra info I've tried to figure out what the microcode version installed by the BIOS is on the Acer S1002, where the microcode update does not cause issues is. If I boot that device with "dis_ucode_ldr" on the kernel commandline and then lookup the microcode version in /proc/cpuinfo I get 0x82b. When I don't specify "dis_ucode_ldr" on the (working) S1002 I get the following kernel log messages related to microcode:
[ 0.000000] microcode: microcode updated early to revision 0x838, date = 2019-04-22
[ 5.314952] microcode: sig=0x30678, pf=0x2, revision=0x838
[ 5.315523] microcode: Microcode Update Driver: v2.2.
So the 2 cases which I have are:
Note I'm not claiming that the difference in the BIOS installed microcode is the reason things are failing on the Glavey tablet, but it is a possible cause for this.
@jwrdegoede I couldn't post to your blog, but I got the same issue exactly a week ago on my HP x2 10 (atom z8350) with Fedora.
Interesting, is this microcode related too, or is the similarity just that you also got corrupt BIOS settings somehow?
Interesting, is this microcode related too, or is the similarity just that you also got corrupt BIOS settings somehow?
I'm not sure if it was microcode related, but I find it curious that it happened at the same time, for the same family of processors. It's not a common fault after all.
I'm not sure if it was microcode related, but I find it curious that it happened at the same time, for the same family of processors. It's not a common fault after all.
Cherry Trail and Bay Trail are related but use a (somewhat) different generation of CPU cores, also the last microcode update for these devices was quite a while ago. I hit this now because I only tried Linux on the Glavey tablet recently. I doubt that your issue is related to the ucode issue which I'm seeing. What it does have in common is that Bay and Cherry Trail devices are both susceptible to having their BIOS settings corrupted relatively easily. If you want to discuss this further please drop me an email at hdegoede@redhat.com, so that we can keep this github discussion focussed on the Bay Trail ucode issue.
Ugh, I hit this again while I was testing Linux on a third tablet with a Z3735G CPU. At first things where fine, but the I updated the kernel from 5.12.0 to 5.13.0-rc1 and then it stopped booting. And adding "dis_ucode_ldr" made it boot again. 5.13 has only one new commit under arch/x86/kernel/cpu/microcode: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=7189b3c11903667808029ec9766a6e96de5012a5
I tried reverting this but it does not help.
FWIW this third tablet BIOS updates the microcode to 0x832 before booting the OS.
I just hit this again, on the same device. After re-installing Linux it took me a while to remember this issue. Is there anything we can do about this?
One more note, on a hunch I tried updating the microcode after Linux booted by doing:
echo 1 > /sys/devices/system/cpu/microcode/reload
And this works fine, so I suspect there is some bad interaction here between the early microcode loader and the BIOS on the TM800A550L tablet. Maybe the early microcode loader has certain expectations about the pre-boot memory layout or some such ?
Any suggestions how I can debug the early microcode loading code. Is there some way to make it log messages using e.g. EFI calls to show the messages on the EFI console?
It runs too early, last time I needed to debug the early update microcode driver, I had to add a static buffer to fill with debug data, and print it later. Maybe early printk on a system with an early console active can do it as well.
That said, do try a very recent upstream kernel, there were quite a few changes on that driver from what I recall from LKML mails, it might have increased compatibility with weird firmware, especially if it is EFI...
That said, do try a very recent upstream kernel, there were quite a few changes on that driver from what I recall from LKML mails, it might have increased compatibility with weird firmware, especially if it is EFI...
Last time I hit this I tried 5.16-rc5, I guess I could try 5.17-rc1 once it is out if you think that might help?
I don't have any reason to believe 5.17-rc would be any better than 5.16-rc5, unfortunately...
Quick update on this, just hit this on a third tablet an Acer Iconia One 7 B1-750.
Both the Glavey TM800A550L and the Acer Iconia One 7 B1-750 are tablets which ship with Android 4.4 x86 as factory OS. And although they use more or less standard UEFI firmware, at least the ACPI tables are very funky, e.g. filled with not there I2C devices since the Android vendor kernel has everything hardcoded anyways. I guess the ucode update problem is related to the fw in these tablets being funky in other ways too. Any ideas ?
Unfortunately I did not write down here what the 3th Bay Trail device on which I hit this was, but I'm pretty sure it was a device with Android as factory OS too (as a hobby project I'm working on making these devices work with standard Linux).
And this works fine, so I suspect there is some bad interaction here between the early microcode loader and the BIOS on the TM800A550L tablet. Maybe the early microcode loader has certain expectations about the pre-boot memory layout or some such ?
Any suggestions how I can debug the early microcode loading code. Is there some way to make it log messages using e.g. EFI calls to show the messages on the EFI console?
Try the microcode driver in the newest kernel, lots of fixes went in... who knows.
Also, this is a very long shot, but the Linux early microcode loader fails to ensure 16-byte alignment on the microcode patch [when loading from early-initramfs] -- it has been like that since day one of the early loading support, and I don't think this has been -- or will be -- fixed.
The Intel manual used to (and maybe still) require such 16-byte alignment, but apparently almost every non-ancient Intel x86-64 processor only cares for 4-byte alignment (at least almost all the time. Only Intel knows what happens if the start or end of the hot area of the microcode data crosses a page boundary due to miss-alignment, etc. It would not be the first CPU operation that is highly allergic to this kind of border condition).
We have long worked around this driver shortcoming in userspace when using iucode-tool to generate the initramfs, so you could try to use it to generate your early-initramfs.
Note that this is indeed a long shot, I have never seen any microcode loading issues be solved by forcing this alignment. Also, I do not know if there are alignment issues when you have the microcode early upload data hardcoded into the kernel in some other way the firmware loader supports.
While running Linux on a Glavey TM800A550L tablet I noticed that it hangs at boot, sometimes showing various color patterns on the display, suggesting that the processor is writing over random memory, including the framebuffer.
It took me a while to figure this out, but adding
dis_ucode_ldr
on the kernel commandline fixes this. This is quite a sever bug, at one point in when I forgot to adddis_ucode_ldr
on the kernel commandline the CPU overwrote parts of the memory-mapped SPI flash which contains the EFI nvram variables, including the Setup EFI variable which contains the BIOS settings. After this the tablet would no longer boot at all. I eventually managed to unbrick it without needing an eeprom programmer by using DNX mode which still worked, see the blogpost which I wrote on this.The BIOS on the troublesome tablet comes with ucode version 0x830 and the attempted update (which breaks things) tries to update it to 0x838. Perhaps there is a problem with the specific jump from 0x830 to 0x838 ?
Note I have an Acer S1002 tablet which also has a Z3735G processor and there the microcode update works fine.