intel / Intel-Linux-Processor-Microcode-Data-Files

Other
620 stars 68 forks source link

microcode-20200609 release, intel-ucode 06-8e-0c/0x806ec revision=0xd6 causes freezes on warm boot #35

Open stevebeattie opened 4 years ago

stevebeattie commented 4 years ago

Per debian bug 962757, ubuntu bug 1883002, and from internal testing, some systems are seeing freezes, particularly after warm reboots, with the 06-8e-0c/0x806ec revision 0xd6 from the 20200609 microcode release.

microcode: microcode updated early to revision 0xd6, date = 2020-04-23
microcode: sig=0x806ec, pf=0x80, revision=0xd6

The systems from the collected reports are:

Dell Latitude 7400, i5-8265U (debian bug) Dell Latitude 7300, i7-8665U (ubuntu bug) Dell Latitude 5410, i5-8365U

Specifically, the reporter from ubuntu bug report inidats that they were initially affected by the similar issue against 0xca as reported in #24; a BIOS update from Dell addressed that, but it was re-introduced with the 20200609 update that moved from 0xca to 0xd6. That users testing also indicated a much higher frequency of freezes occurring with warm reboots (the freeze seen with the third system was also after a warm reboot). The debian bug reported that their system was also stable with the 0xca version but not the 0xd6 version.

paulmenzel commented 4 years ago

Does this only happen when running on battery, that means without plugged-in power cable?

In our case, plugging in the power cable fixes the issue (but starting with maxcpus=1, and bringing CPUs online in GNU/Linux) still sometimes causes freezes after one or two CPUs.

hmh commented 4 years ago

Do you have any complains about MSR 0x123 in the kernel logs either when you resume from sleep-to-RAM (S3/suspend) or when you bring CPUs online?

vicamo commented 4 years ago

@hmh do you have an example? Didn't see anything containing "MSR" or "0x123".

alyf80 commented 4 years ago

Reporter of Ubuntu bug 1883002 here...

Yes, I have MSR 0x123 errors when CPUs are brought online during boot:

[    0.216842] unchecked MSR access error: RDMSR from 0x123 at rIP: 0xffffffffb0a78938 (native_read_msr+0x8/0x40)
[    0.216856] unchecked MSR access error: WRMSR to 0x123 (tried to write 0x0000000000000001) at rIP: 0xffffffffb0a78b24 

Complete kernel log attached.

(note: this is with microcode 0xca, which is loaded by the system firmware and currenlty works fine; if needed I can test with 0xd6)

dmesg.txt

paulmenzel commented 4 years ago

Upgrading a Dell XPS 13 9360 with Intel i7-7500U from Ubuntu 19.10 to Ubuntu 20.04 (microcode update 0xd6), looking through the logs, I am seeing the MSR messages at least once during the first resume from suspend. (No errors encountered.)

[  121.966360] x86: Booting SMP configuration:
[  121.966368] smpboot: Booting Node 0 Processor 1 APIC 0x2
[  121.967890] unchecked MSR access error: RDMSR from 0x123 at rIP: 0xffffffff94478938 (native_read_msr+0x8/0x40)
[  121.967893] Call Trace:
[  121.967900]  update_srbds_msr+0x38/0x80
[  121.967903]  identify_secondary_cpu+0x7a/0x90
[  121.967907]  smp_store_cpu_info+0x4e/0x60
[  121.967910]  start_secondary+0x63/0x1c0
[  121.967915]  secondary_startup_64+0xa4/0xb0
[  121.967928] unchecked MSR access error: WRMSR to 0x123 (tried to write 0x0000000000000000) at rIP: 0xffffffff94478b24 (native_write_msr+0x4/0x30)
[  121.967929] Call Trace:
[  121.967932]  ? update_srbds_msr+0x61/0x80
[  121.967935]  identify_secondary_cpu+0x7a/0x90
[  121.967938]  smp_store_cpu_info+0x4e/0x60
[  121.967941]  start_secondary+0x63/0x1c0
[  121.967945]  secondary_startup_64+0xa4/0xb0
[  121.967960] microcode: sig=0x806e9, pf=0x80, revision=0xc6
[  121.969216] microcode: updated to revision 0xd6, date = 2020-04-27
[  121.969872] CPU1 is up
[  121.970115] smpboot: Booting Node 0 Processor 2 APIC 0x1
[  121.970483] microcode: sig=0x806e9, pf=0x80, revision=0xd6
[  121.971269] CPU2 is up
[  121.971515] smpboot: Booting Node 0 Processor 3 APIC 0x3
[  121.972558] CPU3 is up

@hmh, should I submit a separate bug report in Launchpad, or create one here?

  1. First resume: Ixpees_mem_dmesg.txt
  2. Second resume: Ixpees_mem_dmesg.txt
hmh commented 4 years ago

The MSR access is a kernel bug. It might be relatively harmless or not harmless at all, depending on just how much (and what) code is running on the AP before its microcode is updated. I haven't checked. But my 0x806e9 is coping with it well enough as long as I keep everything at the defaults (i.e. what the microcode has as a default for the new MSRs is actually what Linux is using).

This bug will not contribute to better stability when microcode updates are required, obviously. So, it is something to be fixed ASAP.

hmh commented 4 years ago

@vicamo: you will see such illegal MSR access splats in the kernel log only when your UEFI/BIOS microcode is old enough to not have such an MSR, the new microcode (updated through Linux) adds the support for the new MSR, and Linux sees a need to try to read/write such MSRs early.

The bug is that it is not updating the secondary cores (read: not the core used for boot or to resume) early enough -- or that new code was added that is running too early, same thing.

It is easier to show up in the resume-from-suspend path, but the boot path also needs a look just in case.

alyf80 commented 4 years ago

@hmh: There are definitely other scenarios that cause those illegal accesses, as in the case of my logs above the microcode is not being updated by Linux.

hmh commented 4 years ago

@alyf80: noted... looks like there is more than one bug involved, kernel side. I have seen possibly related fixes in the latest round of stable kernels related to access to MSR 0x123 in situations where it shouldn't be accessed in the new microcode, so the case you described might have been addressed already. But I don't recall any patches related to such accesses being done before AP ucode update in the resume-from-S3 path.

In my Dell laptop with microcode revision 0xc6 in UEFI, one can clearly see the touch-MSR-0x123-before-AP-was-updated. You can also clearly see that both hyperthreads of the first core (thread 0 being the BSP) are already at revision 0xd6, unlike the two hyperthreads of the second core (AP) -- when it hits the second thread that shares the core of the BSP, it doesn't have to update anything and just says it is already at revision 0xd6.

kernel: microcode: microcode updated early to revision 0xd6, date = 2020-04-27
kernel: Linux version 4.19.0-9-amd64 (debian-kernel@lists.debian.org) (gcc version 8.3.0 (Debian 8.3.0-6)) #1 SMP Debian 4.19.118-2+deb10u1 (2020-06-07)
...
kernel: PM: suspend entry (deep)
...
kernel: PM: Saving platform NVS memory
kernel: Disabling non-boot CPUs ...
kernel: smpboot: CPU 1 is now offline
kernel: smpboot: CPU 2 is now offline
kernel: smpboot: CPU 3 is now offline
... S3 SLEEP ...
kernel: ACPI: Low-level resume complete
kernel: ACPI: EC: EC started
kernel: PM: Restoring platform NVS memory
kernel: Enabling non-boot CPUs ...
kernel: x86: Booting SMP configuration:
kernel: smpboot: Booting Node 0 Processor 1 APIC 0x2
kernel: unchecked MSR access error: RDMSR from 0x123 at rIP: 0xffffffffa065ebc3 (native_read_msr+0x3/0x30)
kernel: Call Trace:
kernel:  update_srbds_msr+0x34/0x70
kernel:  smp_store_cpu_info+0x45/0x50
kernel:  start_secondary+0xa3/0x1f0
kernel:  secondary_startup_64+0xa4/0xb0
kernel: unchecked MSR access error: WRMSR to 0x123 (tried to write 0x0000000000000000) at rIP: 0xffffffffa065edb4 (native_write_msr+0x4/0x20)
kernel: Call Trace:
kernel:  update_srbds_msr+0x5d/0x70
kernel:  smp_store_cpu_info+0x45/0x50
kernel:  start_secondary+0xa3/0x1f0
kernel:  secondary_startup_64+0xa4/0xb0
kernel: microcode: sig=0x806e9, pf=0x80, revision=0xc6
kernel: microcode: updated to revision 0xd6, date = 2020-04-27
kernel:  cache: parent cpu1 should not be sleeping
kernel: CPU1 is up
kernel: smpboot: Booting Node 0 Processor 2 APIC 0x1
kernel: microcode: sig=0x806e9, pf=0x80, revision=0xd6
kernel:  cache: parent cpu2 should not be sleeping
kernel: CPU2 is up
kernel: smpboot: Booting Node 0 Processor 3 APIC 0x3
kernel:  cache: parent cpu3 should not be sleeping
kernel: CPU3 is up
kernel: ACPI: Waking up from system sleep state S3
kernel: ACPI: EC: interrupt unblocked
kernel: ACPI: EC: event unblocked
alyf80 commented 4 years ago

FYI, Dell released system firmware 1.9.1 which includes microcode revision 0xd6. With the new firmware, both the lockups and the MSR 0x123 errors are gone.

LM-HZG commented 3 years ago

FYI, Dell released system firmware 1.9.1 which includes microcode revision 0xd6. With the new firmware, both the lockups and the MSR 0x123 errors are gone.

This seems not to be true, for my system, a DELL 5591, at least (respective bug report [here](https://bugs.launchpad.net/ubuntu/+source/intel-microcode/+bug/1882943, thanks @mirekingr fot that). I'm struggling with this on my daily work driver since early may now and this bug has put me into serious trouble on various occasions ( general linux system user embarrassment, data loss and subsequent general system instabilities inclusive) since the day a firmware update of some sort has been rolled out on my machine.

$ dmesg | grep microcode says:

[    1.694792] microcode: sig=0x906ea, pf=0x20, revision=0xd6
[    1.695210] microcode: Microcode Update Driver: v2.2.
[16078.105692] microcode: sig=0x906ea, pf=0x20, revision=0xca
[16078.106927] microcode: updated to revision 0xd6, date = 2020-04-27
[16078.114714] microcode: sig=0x906ea, pf=0x20, revision=0xd6

So I doubt this has been solved at all. Not even on firmware 0.1.11.1 that the system was using the last couple of days/weeks. I wonder if anyone at intel has understood the seriousness of this bug. A mobile device that is only operable with a powercord attached to it in order to boot (either from scratch, or from sleep state) - really?

Sorry for the rant, but I can't believe it takes weeks to rollback some bad decision on whatever caused all this nuisance.

paulmenzel commented 3 years ago

FYI, Dell released system firmware 1.9.1 which includes microcode revision 0xd6. With the new firmware, both the lockups and the MSR 0x123 errors are gone.

I can confirm the problem is fixed for the Dell Precision 3540.

This seems not to be true, for my system, a DELL 5591, at least (respective bug report [here](https://bugs.launchpad.net/ubuntu/+source/intel-microcode/+bug/1882943, thanks @mirekingr fot that).

[…]

$ dmesg | grep microcode says:

[    1.694792] microcode: sig=0x906ea, pf=0x20, revision=0xd6
[    1.695210] microcode: Microcode Update Driver: v2.2.
[16078.105692] microcode: sig=0x906ea, pf=0x20, revision=0xca
[16078.106927] microcode: updated to revision 0xd6, date = 2020-04-27
[16078.114714] microcode: sig=0x906ea, pf=0x20, revision=0xd6

Is the second time when resuming? It looks very strange, that on resume you have revision 0xca, and during boot already the newer version 0xd6? Do problem only happen after suspend?

[…]

Please contact Dell to check whether the firmware applies microcode updates when resuming. And please, open a separate bug report, but I think it has nothing to do with the upstream project.

paulmenzel commented 3 years ago

FYI, Dell released system firmware 1.9.1 which includes microcode revision 0xd6. With the new firmware, both the lockups and the MSR 0x123 errors are gone.

I can confirm the problem is fixed for the Dell Precision 3540.

There are still problem though when booting without the power cable attached.

[…]

Matioupi commented 3 years ago

Hello, i'm facing a similar hang bug on a HP Zbook 15G6. Latest available BIOS version (1.06.00 rev A) bug existed with Ubuntu 20.04 and is still there with fresh loaded 20.10

installed microcode package :-1: intel-microcode/groovy,now 3.20200609.0ubuntu0.20.04.2 amd64

mathieu@ZBook15G6:~$ sudo dmesg |grep microcode [ 5.236586] microcode: sig=0x906ed, pf=0x20, revision=0xd6 [ 5.236982] microcode: Microcode Update Driver: v2.2.

The bug is highly repeatable and occurs when booting the machine hook to the HP G2 thunderolt 3 dock. I reach the grub menu and the machine freeze after displaying "loading initial ramdisk" message.

When booting the laptop from the battery, and hooking the dock after, it works properly. let me know if I can provide other usefull information.

esyr-rh commented 3 years ago

06-8e-0c microcode has been updated to revision 0xde in microcode-20201110 release, does the newer microcode revision help?

Matioupi commented 3 years ago

I've just upgraded microcode and Bios (1.07.01 rev 1 from https://support.hp.com/fr-fr/drivers/selfservice/hp-zbook-15-g6-mobile-workstation/22892887)

mathieu@ZBook15G6:~$ sudo dmesg |grep microcode [sudo] Mot de passe de mathieu : [ 5.826529] microcode: sig=0x906ed, pf=0x20, revision=0xde [ 5.827348] microcode: Microcode Update Driver: v2.2.

But no more luck. Booting with the HP-G2 Thundebolt dock hooked failed several times in a row. I attached 5 bootlog with the HP-G2 Thunderbolt 3 dock (latest public fw loaded)

bootlog1.txt bootlog2.txt bootlog3.txt bootlog4.txt bootlog5.txt

Some boot reached the login screen, some freezed before, some freezed after entering login creds.

I hope this helps and can provide additional test results if needed.

Regards,

Mathieu

hmh commented 3 years ago

@Matioupi:

From the bootlogs, your machine is not updating the microcode at all: it seems to be already at revision 0xde in UEFI/BIOS. So, any regressions you observed were latent issues that a reboot exposed, but not related to the microcode update. Might have been something in the operating system, or an issue with the HP BIOS update you performed.

If your system has a dual-boot BIOS that still has the older version, could you boot with the old BIOS, and check if the microcode update happens? If the issues you observed with your dock were caused by the BIOS update, that might also fix them...

Matioupi commented 3 years ago

@hmh : Hello, the issue of not booting when hooked to the HP G2 TB3 dock was already present with previous BIOS / microcode, so it is not an issue related to this microcode. I was only reporting an issue that is not solved by this microcode / BIOS update. By the way, I'm not sure at all the issue is a microde related issue, but symptoms were close enough to description by other users, so I posted here.

hmh commented 3 years ago

I see.

Anyway, we'd need someone with an outdated BIOS that does not have the current microcode (revision 0xde) and which had issues on reboot with previous microcode updates, to try the new one and report if the freeze-on-reboot is fixed...

Matioupi commented 3 years ago

I could revert to 1.06.00, but I never had any freeze on reboot on my laptop. Only issues I have are when booting with G2 dock hooked. I need to unhook it before booting, and replug it after boot.

hmh commented 3 years ago

@Matioupi: please don't revert your BIOS, if you never had any freezes, it would not tell us anything...

esyr-rh commented 3 years ago

New revision 0xea of 06-8e-0c microcode file has been published as part of microcode-20210608 release, it may be worth to try it out.