linux-surface / linux-surface

Linux Kernel for Surface Devices
4.62k stars 201 forks source link

[SP9] ACPI error spam in dmesg after closing the type cover with the device plugged in #1082

Open peter-marshall5 opened 1 year ago

peter-marshall5 commented 1 year ago

If I close the type cover when my SP9 is plugged in, the dmesg log gets spammed with the following. The memory usage also goes up to around 7 GB with nothing open and the system slows down drastically. It also gets stuck when trying to reboot or shut down, so a hard shutdown is required.

Environment

`dmesg` output ``` [ 2694.375126] ACPI Error: No handler or method for GPE 51, disabling event (20221020/evgpe-839) [ 2694.375133] ACPI Error: No handler or method for GPE 53, disabling event (20221020/evgpe-839) [ 2694.375137] ACPI Error: No handler or method for GPE 54, disabling event (20221020/evgpe-839) [ 2694.375141] ACPI Error: No handler or method for GPE 55, disabling event (20221020/evgpe-839) [ 2694.375146] ACPI Error: No handler or method for GPE 56, disabling event (20221020/evgpe-839) [ 2694.375150] ACPI Error: No handler or method for GPE 57, disabling event (20221020/evgpe-839) [ 2694.375155] ACPI Error: No handler or method for GPE 60, disabling event (20221020/evgpe-839) [ 2694.375164] ACPI Error: No handler or method for GPE 63, disabling event (20221020/evgpe-839) [ 2694.375169] ACPI Error: No handler or method for GPE 64, disabling event (20221020/evgpe-839) [ 2694.375173] ACPI Error: No handler or method for GPE 65, disabling event (20221020/evgpe-839) [ 2694.375180] ACPI Error: No handler or method for GPE 67, disabling event (20221020/evgpe-839) [ 2694.375184] ACPI Error: No handler or method for GPE 68, disabling event (20221020/evgpe-839) [ 2694.375191] ACPI Error: No handler or method for GPE 6A, disabling event (20221020/evgpe-839) [ 2694.375196] ACPI Error: No handler or method for GPE 6B, disabling event (20221020/evgpe-839) [ 2694.375200] ACPI Error: No handler or method for GPE 6C, disabling event (20221020/evgpe-839) [ 2694.375207] ACPI Error: No handler or method for GPE 6E, disabling event (20221020/evgpe-839) [ 2694.375214] ACPI Error: No handler or method for GPE 70, disabling event (20221020/evgpe-839) [ 2694.375218] ACPI Error: No handler or method for GPE 71, disabling event (20221020/evgpe-839) [ 2694.375223] ACPI Error: No handler or method for GPE 72, disabling event (20221020/evgpe-839) [ 2694.375230] ACPI Error: No handler or method for GPE 74, disabling event (20221020/evgpe-839) [ 2694.375234] ACPI Error: No handler or method for GPE 75, disabling event (20221020/evgpe-839) [ 2694.375239] ACPI Error: No handler or method for GPE 76, disabling event (20221020/evgpe-839) [ 2694.375243] ACPI Error: No handler or method for GPE 77, disabling event (20221020/evgpe-839) [ 2694.375290] ACPI Error: No installed handler for fixed event - PM_Timer (0), disabling (20221020/evevent-266) [ 2694.375294] ACPI Error: No installed handler for fixed event - PowerButton (2), disabling (20221020/evevent-266) [ 2694.375297] ACPI Error: No installed handler for fixed event - SleepButton (3), disabling (20221020/evevent-266) [ 2694.375302] ACPI Error: Could not disable RealTimeClock events (20221020/evxfevnt-243) ```
RalphBariz commented 1 year ago

Can confirm issue. The spamming situation was bad. Spamming disk, spamming memory and spamming CPU.

However I temporarily worked around that by adding "intel_idle.max_cstate=0" to GRUB_CMDLINE_LINUX_DEFAULT in /etc/default/grub

Note: check correct idle driver to be used typing "cat /sys/devices/system/cpu/cpuidle/current_driver" you should get acpi_idle after reboot (do not forget to "sudo update-grub" after chaning before rebooting)

https://www.kernel.org/doc/html/latest/admin-guide/pm/cpuidle.html?highlight=max_cstate#idle-states-control-via-kernel-command-line

Adding intel_idle.max_cstate=0 to the kernel command line disables the intel_idle driver and allows acpi_idle to be used

I interpret that as using ACPI states rather than intel c-states. What is the impact on battery life I do not know yet. However it should not be that bad since it should pretty much do the same except allowing "during sleep processing" what non the less is not used anywhere as I see.

Note: limiting c-states to 2 or 1 does not help at all, it seems the intel_idle driver in general causes problems when waking up. But I really didn't debug into.

Note: clean journalctl using "journalctl --vacuum-size=100M" to free the spammed disk space

Note: just added to SP9 wiki page. https://github.com/linux-surface/linux-surface/wiki/Surface-Pro-9 I'm wondering, that I could edit it without being member of the project. Yes even without pull request.

RalphBariz commented 1 year ago

Switching to acpi_idle just lowered probability to get this symptom but it still occured. Another attempt worked out better.

To /etc/systemd/sleep.conf add:

[Sleep]
AllowSuspend=yes
AllowHibernation=no
AllowSuspendThenHibernate=no
AllowHybridSleep=no

Hibernate and HybridSleep seems to be the cause for both, spamming errors and device not coming up from sleep. Again, the effects on battery drain during sleep could be disturbing. For me, since it is mostly powered for the moment, is more important that it is reliably usable when waking it up.

peter-marshall5 commented 1 year ago

That seems interesting. I'll try the workaround myself.

However, I also got this symptom to happen when the device was plugged in and set to ignore the lid switch. Maybe going into sleep takes the kernel / firmware out of a glitched state?

RalphBariz commented 1 year ago

While it reliably wakes up now, ACPI GPE storm seems not completely gone, still getting it but rarely now (once since yesterday, compared to pretty much everytime when waking up previously). However I see still chances to get rid of that.

First there is acpi_osi=! acpi_osi='Windows 2015' kernel parameter which shall tell firmware it is about Windows (2015=first windows 10 version) and it provides the Windows interfaces to Linux which shall be compatible with them. However this will be trying around with different windows versions. But since we are talking about M$ it's highly probable, that this is an intended bug to keep people at using windows on the surface, for sure also to continue to harvest their data. Not rendering it useless for linux, but for everyday tablet use with linux is acheived exactly by messing up firmwares suspend interfaces. Noone could blame them for, noone can proove if its done by intention and they can say... well linux... we don't call it a cancer anymore and pretend to love it but we also do not spend effort into supporting it on that devices because windows is our biz. https://discovery.endeavouros.com/acpi-kernel-parameters/acpi-kernel-parameters-and-how-to-choose-them/2021/03/

Second it is possible to prevent the ACPI GPE storm by masking the GPE events using acpi_mask_gpe=0x51. However this might end up in a heck of a kernel commandline since there are multiple interrupts affected.

I kinda bet, it will work with faking a windows kernel to the firmware. Also reverted sleepd config to see if then also hybrid hibernation is working.

peter-marshall5 commented 1 year ago

I really doubt that there would be intentional bugs meant to keep people using Windows. However the firmware on laptops is known for sometimes applying bizarre "fixes" when booting a non-Windows OS.

Here are the strings matching "Windows" in the DSDT and SSDT tables:

Windows 2001
Windows 2001 SP1
Windows 2001 SP2
Windows 2001.1
Windows 2006
Windows 2009
Windows 2012
Windows 2013
Windows 2015

I might try digging through the ACPI DSDT tables to see what it does differently for Linux.

RalphBariz commented 1 year ago

Yeah, funny thing. While Windows 2015 is contained in DSDT table Windows 2020 isn't. However, Windows 2015 shows same symptoms while 2020 ran stable till now. Always waking up even from deep sleep. On power without GPE storm. Only acpi_osi=! also fails. So it's not like, Windows 2020 doing the same as nothing. Why a device which only can run newer win 10 and win 11 does not contain their OSI strings in DSDT table but obviously behaves differently... yes correct when latest win 10 string is passed... you might want to spent the one or the other thought about. We could try out the same with 2022 but I personally am satisfied to have it working. Long term test will show if that sustainably did the trick.

Due to intentional bugs... well, I'm with Linux since earlier times, 90s. And yes, it was always a fight. Still in Windows 8 era, updates regularly broke grub or in earlier times lilo. The mess when restoring it, you might see. That bug was also always a big complaining towards M$ and it never got better but even worse. That's only working out since M$ plays the Linux friend and well, EFI bootloading method made it harder to say whoops. Strategies change... the typical character and interest of manager and their usual machavellism not so much.

peter-marshall5 commented 1 year ago

Interesting. For me, adding acpi_osi="Windows 2020" to the cmdline does not seem to have any effect, and acpi_osi=! acpi_osi="Windows 2020" does not seem to work either. Suspend seems to always work for me, and the bug is only triggered by closing the type cover while the device is powered on and plugged in. It also doesn't seem to happen after the first suspend. Maybe the OS is expected to set up GPE events manually, so this may be fixed when the surface-gpe module supports the SP9.

qzed commented 1 year ago

From my experience, MS doesn't do any OSI/OS matching shenanigans any more. So far, all ACPI related issues we've encountered were differences in how Windows and Linux treat things.

Regarding surface-gpe support: This module essentially configures the GPE associated with the lid/typecover to be wakeup capable. On the SP9, that is GPE 0x52. I'll add support for that, but I expect that this does not affect the issue. (The only effect it may have is that it will wake the device up if it gets triggered when suspended.)

The problem you're encountering seems to pretty much trigger every GPE around for some reason, however only some are handled. It complains about the ones without any associated handlers (notice how 52 is absent in the log above). So we need to figure out why pretty much every GPE is being triggered somehow.

qzed commented 1 year ago

Regarding surface-gpe support: This module essentially configures the GPE associated with the lid/typecover to be wakeup capable. On the SP9, that is GPE 0x52. I'll add support for that, but I expect that this does not affect the issue. (The only effect it may have is that it will wake the device up if it gets triggered when suspended.)

Actually, can you provide the output of sudo dmidecode -t system for that (I don't need the serial number or UUID, just Manufacturer, Product Name, and SKU)?

peter-marshall5 commented 1 year ago

No problem. Here are the relevant fields:

# dmidecode 3.4
Getting SMBIOS data from sysfs.
SMBIOS 3.3.0 present.

Handle 0x000A, DMI type 1, 27 bytes
System Information
        Manufacturer:  Microsoft Corporation
        Product Name: Surface Pro 9
        SKU Number: Surface_Pro_9_2038
qzed commented 1 year ago

The next release should have wake-via-typecover support.

vbauerster commented 1 year ago

I suppose kernel >= 6.1 should solve this?

https://www.phoronix.com/news/Linux-6.1-MS-Surface-Pro-9

qzed commented 1 year ago

The ACPI error spam? Unfortunately not. The patches described in that article are the SAM/EC ones. So essentially typecover and battery support and have been in our kernels for a while. I am still unsure where the GPE spam comes from, but that's probably rather something ACPI, CPU, or chipset related.

peter-marshall5 commented 10 months ago

Just saying that this is still not fixed as of kernel 6.4.4. Booting with the option acpi=noirq stops this issue from happening but also breaks type cover support.

RalphBariz commented 10 months ago

Dind't have GPE storm for a while with deactivated deep sleep. However still having problems when shutting down and a slow wake up. I figured out, that battery saver BIOS setting might have a play with this. It obviously has a problem with the information of discharging during it is connected to AC.

peter-marshall5 commented 4 months ago

I haven't had a GPE storm come up since applying these settings in my TLP config. Not sure exactly which option mitigates the issue yet. I am running Linux 6.7.1 in case that matters.

CPU_ENERGY_PERF_POLICY_ON_AC=balance_performance
CPU_ENERGY_PERF_POLICY_ON_BAT=balance_power
PCIE_ASPM_ON_BAT=powersupersave
PLATFORM_PROFILE_ON_AC=balanced
PLATFORM_PROFILE_ON_BAT=low-power
RUNTIME_PM_ON_AC=auto
RUNTIME_PM_ON_BAT=auto
SOUND_POWER_SAVE_CONTROLLER=
peter-marshall5 commented 3 weeks ago

I had to modify the GPE driver by adding a space at the beginning of the vendor string for it to load properly on my SP9. Suspend and resume using the lid switch works with it loaded, but it takes around 30 seconds to resume. I'm also getting a bit of GPE error spam in the dmesg logs right after resuming and the speakers start to make a periodic ticking noise. Maybe this is related to the root cause of the GPE spam issues above?

cwittenberg commented 2 weeks ago

Issue appears same as #1446 - I closed it to avoid redundancy but it may have relevant logs.

peter-marshall5 commented 2 days ago

Could this Linux commit be relevant?

peter-marshall5 commented 2 days ago

I found that the speaker clicking noise after suspending via the lid stops after a minute or so, and suspending via the lid works properly again afterward. (Suspending otherwise would cause the device to get stuck.)

peter-marshall5 commented 1 day ago

Just managed to get suspend via lid working while I was playing around with disabling GPE interrpts via /sys/firmware/acpi/interrupts/gpeXX. I'll try to narrow down the affected interrupt.

peter-marshall5 commented 12 hours ago

It seems like specifying acpi_sci=edge prevents the logs from being flooded with repeating ACPI errors, although the errors still appear once. It also seemingly prevents the lid switch from working more than once, though.