linux-surface / linux-surface

Linux Kernel for Surface Devices
4.62k stars 201 forks source link

Surface Laptop 5 does not sleep well #1010

Open mbrennwa opened 1 year ago

mbrennwa commented 1 year ago

My Surface Laptop 5 drains battery within a few hours of sleep (suspend). I am not sure if this is a bug or limitation of the surface kernel, or if it's just a configuration issue. Any thoughts on this?

Environment

NP-chaonay commented 1 year ago

what is USB device connected to device?

and could you capture (the text) of dmesg contains the booting process until finished and then when you trigger suspend.

qzed commented 1 year ago

This does not sound like correct behavior and indeed something that we should try to fix. As @NP-chaonay mentioned, a dmesg log including one suspend cycle (or more) might help.

mbrennwa commented 1 year ago

I did a shutdown, fresh boot, then sleep/suspend, then wake up, then dmesg. The output is attached.

dmesg_after_suspend.txt

mbrennwa commented 1 year ago

Update::

The SL5 spent the night in sleep/suspend with very little battery drain. The difference was that I explicitly told it to go to sleep/suspend via Gnome shutdown UI.

When it did drain the battery before, I simply closed the lid while the laptop was running. According to Gnome Tweak Tool, the system is configured to suspend when the lid is closed. Obviously the lid close somehow did not trigger the suspend correctly. Is this related to the Surface kernel, or something else? How can I look into this?

NP-chaonay commented 1 year ago

@mbrennwa

could you do the evtest command as root

"it list device including ACPI lid"

and then select the device named "ACPI lid"

and then do lid close and reopen again, does the command output that captured event?

mbrennwa commented 1 year ago

I don't quite understand what you want me to do with the "it list device including ACPI lid" part.

Below is what I did/found. I closed and opened the lid, but evtest did not report anything.

root@salami:~# evtest
No device specified, trying to scan all of /dev/input/event*
Available devices:
/dev/input/event0:  Lid Switch
/dev/input/event1:  Video Bus
/dev/input/event10: HDA Intel PCH Mic
/dev/input/event11: HDA Intel PCH Headphone
/dev/input/event12: HDA Intel PCH HDMI/DP,pcm=3
/dev/input/event13: HDA Intel PCH HDMI/DP,pcm=7
/dev/input/event14: HDA Intel PCH HDMI/DP,pcm=8
/dev/input/event15: HDA Intel PCH HDMI/DP,pcm=9
/dev/input/event16: HDA Intel PCH HDMI/DP,pcm=10
/dev/input/event17: HDA Intel PCH HDMI/DP,pcm=11
/dev/input/event18: HDA Intel PCH HDMI/DP,pcm=12
/dev/input/event19: HDA Intel PCH HDMI/DP,pcm=13
/dev/input/event2:  PC Speaker
/dev/input/event20: HDA Intel PCH HDMI/DP,pcm=14
/dev/input/event21: HDA Intel PCH HDMI/DP,pcm=15
/dev/input/event22: HDA Intel PCH HDMI/DP,pcm=16
/dev/input/event23: HDA Intel PCH HDMI/DP,pcm=17
/dev/input/event3:  gpio-keys
/dev/input/event4:  gpio-keys
/dev/input/event5:  Microsoft Surface 045E:09AE Keyboard
/dev/input/event6:  Surface Camera Front: Surface C
/dev/input/event7:  Surface Camera Front: Surface I
/dev/input/event8:  Microsoft Surface 045E:09AF Mouse
/dev/input/event9:  Microsoft Surface 045E:09AF Touchpad
Select the device event number [0-23]: 0
Input driver version is 1.0.1
Input device ID: bus 0x19 vendor 0x0 product 0x5 version 0x0
Input device name: "Lid Switch"
Supported events:
  Event type 0 (EV_SYN)
  Event type 5 (EV_SW)
    Event code 0 (SW_LID) state 0
Properties:
Testing ... (interrupt to exit)
^C
root@salami:~# 
NP-chaonay commented 1 year ago

Ok I found the problem

The problem is about ACPI

But idk how to fix it perhaps driver is not implemented

mbrennwa commented 1 year ago

What do you mean by "about ACPI"?Am 17.12.2022 13:05 schrieb Nuttapong Punpipat @.***>: Ok I found the problem The problem is about ACPI But idk how to fix it perhaps driver is not implemented

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>

qzed commented 1 year ago

@mbrennwa When you logged the suspend above, did you close the lid as well?

The evtest log above should show some lid events, which is what @NP-chaonay is referring to. The lid "device" is a standard ACPI lid device. As far as I can tell there is no significant difference between the SL4 and the SL5 implementation in the ACPI DSDT.

The only difference that I can think of is that we don't have SL5 support in the surface-gpe driver yet, but that should normally only affect waking via the lid. I've implemented that now for testing. Could you build and load the surface-gpe module from the devices/sl5 branch and see if evtest output some lid events with that?

brennmat commented 1 year ago

When you logged the suspend above, did you close the lid as well?

After telling GNOME to suspend the computer and after the display went black, I closed the lid.

Could you build and load the surface-gpe module from the devices/sl5 branch and see if evtest output some lid events with that?

I am not sure what to do, and how to do it. Are there any instructions out there?

qzed commented 1 year ago

I am not sure what to do, and how to do it. Are there any instructions out there?

Ah sorry. Here's how:

  1. Clone the module and checkout the branch:
    • git clone https://github.com/linux-surface/surface-gpe.git
    • cd surface-gpe
    • git checkout devices/sl5
  2. Build the module
    • cd module
    • make all
  3. Load the module
    • sudo modprobe surface_gpe
mbrennwa commented 1 year ago

Hmmmm, didn't quite work:

brennmat@salami:~/Ablage/surface-gpe/module$ make all
make -C /lib/modules/"6.0.12-surface"/build M=/home/brennmat/Ablage/surface-gpe/module modules
make[1]: Entering directory '/usr/src/linux-headers-6.0.12-surface'
warning: the compiler differs from the one used to build the kernel
  The kernel was built by: gcc (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0
  You are using:           gcc (Debian 12.2.0-10) 12.2.0
make[1]: Leaving directory '/usr/src/linux-headers-6.0.12-surface'
brennmat@salami:~/Ablage/surface-gpe/module$ sudo modprobe surface_gpe
qzed commented 1 year ago

I think that's only a warning. Was the module built (i.e. is there a surface_gpe.ko file in the directory?

Also I messed up and the command should be sudo insmod surface_gpe.ko and not modprobe. But you may have to unload the module provided by the kernel first. So run

  1. sudo modprobe -r surface_gpe
  2. sudo insmod surface_gpe.ko
mbrennwa commented 1 year ago

The surface_gpe.ko file is there. However:

brennmat@salami:~/Ablage/surface-gpe/module$ sudo insmod surface_gpe.ko 
insmod: ERROR: could not insert module surface_gpe.ko: Key was rejected by service
qzed commented 1 year ago

Ah, that means that secureboot is blocking the module. You can either manually sign the module or disable secureboot for testing it.

NP-chaonay commented 1 year ago

@mbrennwa

What do you mean by "about ACPI"?

for easy going explain: ACPI is system thet play role as standard interface for software (such as kernel) to communicate with HW/mainboard, ex: without it you have to know each device component in order to commnad the shutdonw/hibernate/suspend command, but with it , it can reduce bunch of work for kernel/OS development but not entirely.

for more detail: https://en.wikipedia.org/wiki/ACPI

@ qzed

yeah I think surface_Gpe have do nothing about this, but anyway I still sure that something happens about ACPI

since SL5 is similar to SP9 which already have somekind of ACPI problem ref: "https://github.com/linux-surface/linux-surface/wiki/Surface-Pro-9"

@mbrennwa it maybe useful for both SP9 and SL5 to let you test if the symtomp also happens same as SP9

mbrennwa commented 1 year ago

Ah, that means that secureboot is blocking the module. You can either manually sign the module or disable secureboot for testing it.

I turned off Secure Boot on my machine, but then it refused to boot. It just showed the Surface logo and a red bar with an unlock logo on the top of the screen and then didn't move. It did not show the GRUB screen.

Sorry, I am a bit lost with this.

qzed commented 1 year ago

Hmm, that issue is new to me. Is there any error message on the screen? Did you change the boot entry / order? Any chance a reset (hold power for ~20s) helps?

mbrennwa commented 1 year ago

No error message. Did not change the boot entries / order. I pressed the power button for about 50 seconds, and the machine tried to reboot a few times. Even after this "treatment", the machine does not go beyond the screen with the Surface logo and the red bar with the "unlock" logo on top even after 5 minutes.

I turned on Secure Boot again, and now the machine boots again.

How can I sign the kernel module so I can test it? (Sorry, I am a total noob with this)

qzed commented 1 year ago

I think the easiest way for that would be if DKMS is set up already. Given that you're on Debian Sid, there's a good chance that this is the case (I think older Debian versions didn't do that). Could you check if dkms is installed (if not do that) and the following files are present:

/var/lib/dkms/mok.key
/var/lib/dkms/mok.pub

If they are: Enroll them as shown here. Skip the "Verifying if a module is signed" step and first enroll the MOK by rebooting.

After rebooting, you can try to install the surface_gpe module by running sudo make dkms-install inside the module/ directory. This should then 1) install the module to replace the one already in the kernel and 1) automatically sign the module.

To verify point 2, run modinfo surface_gpe. In the signer: field, this should read DKMS module signing key or something along those lines. Also check the filename: field, which should contain a path like updates/dkms to verify that it's indeed the one you just installed and not the one from the kernel.

A good idea is probably also to verify that the key has been enrolled. The sig_key: field shows the key used for signing. If you run mokutil --list-enrolled, it should show up there (you can just do sudo mokutil --list-enrolled | grep -i <key>).

After that, you can reboot or reload the module via modprobe -r surface_gpe and modprobe surface_gpe (this time really modprobe and not insmod).

mbrennwa commented 1 year ago

dkms was not installed on my machine. After installing it, /var/lib/dkms/ is empty. I feel that I am not up to this task.

@qzed I might be able to provide remote access to my machine, but I guess we should communicate privately about this. Please get in touch with me if you're interested to go this way.

qzed commented 1 year ago

I think it might be possible that the key is only set up when dkms is used. Could you try to install the module via sudo make dkms-install and check /var/lib/dkms/ again?

Thanks for the offer, but I'm somewhat hesitant with something like this. Let's try to explore some other options first.

mbrennwa commented 1 year ago

No luck:

brennmat@salami:~$ sudo make dkms-install
make: *** No rule to make target 'dkms-install'.  Stop.
StollD commented 1 year ago

You are in your home directory, not in surface-gpe/module (surface-gpe being the repository you cloned)

mbrennwa commented 1 year ago

Oups! I was clueless about what I was doing...

Anyway, here's a bit of progress, but not yet there (the last grep command should not give empty output, I believe):

brennmat@salami:~/Ablage/surface-gpe/module$ sudo modinfo surface_gpe
filename:       /lib/modules/6.0.12-surface/updates/dkms/surface_gpe.ko
alias:          dmi:*:svnMicrosoftCorporation:pnSurface*:*
license:        GPL
description:    Surface GPE/Lid Driver
author:         Maximilian Luz <luzmaximilian@gmail.com>
srcversion:     C6D2256C434E4AC8BC1A50F
depends:        
retpoline:      Y
name:           surface_gpe
vermagic:       6.0.12-surface SMP preempt mod_unload modversions 
sig_id:         PKCS#7
signer:         DKMS module signing key
sig_key:        68:35:14:2A:BF:59:17:D5:FA:84:AE:32:DF:89:A5:2D:39:8A:B6:A0
sig_hashalgo:   sha512
signature:      80:93:4A:AD:83:95:11:C2:8F:E3:7F:2B:5A:3A:60:18:52:6A:87:16:
        E1:F2:B7:C2:C4:7B:2E:96:69:6B:E9:0F:F1:EB:25:95:2D:78:F2:76:
        B1:94:07:02:9F:79:80:CA:45:05:7B:6C:AD:C3:36:AE:36:99:67:2F:
        D7:37:C5:99:83:CF:43:E3:79:53:31:F2:81:1C:3D:3C:AF:98:16:D6:
        BF:E2:C0:46:9B:B1:A3:0D:15:7E:BE:4E:2E:2D:DC:8D:B4:85:96:36:
        09:77:3D:7C:A2:3A:57:38:00:0C:AE:C8:D7:29:37:47:2D:C2:84:E9:
        8C:CD:AA:FA:AF:0F:23:EE:98:CF:46:2E:5A:AD:43:06:E7:3E:93:9D:
        D2:A5:C2:3B:33:04:12:FC:38:F1:15:AB:46:31:FF:00:93:B6:5E:18:
        ED:91:A5:C2:6A:9E:DD:8F:89:A9:24:8B:69:D7:77:D0:B8:E9:08:9D:
        0D:FD:56:58:FA:71:D2:A6:C7:60:83:CC:B4:E3:D3:52:25:FD:00:7A:
        1B:D0:F1:87:80:92:21:02:6D:81:F9:45:38:23:53:AA:1B:31:46:BD:
        60:D7:8E:AD:6F:1F:FB:4C:C6:A2:78:F8:87:CD:7D:FF:25:3C:E4:FA:
        92:58:EF:98:3D:5A:81:02:96:AC:9C:B1:1E:F5:B1:0B
brennmat@salami:~/Ablage/surface-gpe/module$ sudo mokutil --list-enrolled | grep -i 68:35:14:2A:BF:59:17:D5:FA:84:AE:32:DF:89:A5:2D:39:8A:B6:A0
brennmat@salami:~/Ablage/surface-gpe/module$ 
qzed commented 1 year ago

So far so good! Does /var/lib/dkms/mok.pub exist now? The module is definitely signed with a key, so we just need to enroll it. If it does not exist we'll need to figure out where it is stored.

If it does exist/after we've found it, enroll the key via sudo mokutil --import /var/lib/dkms/mok.pub. This asks you for a one-time password, choose anything you like. You'll only need to remember that for the next boot. Then verify with sudo mokutil --list-new that the key is pending enrollment. Finally, reboot and the MOK manager (blue menu thing) should pop up. Choose "enroll MOK" (or something along those lines), confirm, enter the password you chose, and confirm/select ok until you can choose to reboot.

In the (hopefully very unlikely) event that the kernel refuses to boot after that: Try blacklisting the surface_gpe module. For that, select the boot entry in GRUB, press e and add blacklist=surface_gpe to the kernel command line (the line starting with linux, just add it to the end after the other options).

mbrennwa commented 1 year ago

Ok, I was able to enroll the key and test the lid switch again. Yes, the computer now seems to register the lid switch! However, when I re-open the lid after a few seconds, I cannot get the machine back to life. The keyboard lights are still on, while the screen is black (and stays black).

qzed commented 1 year ago

I assume it also doesn't react to normal power button presses?

Could you test two things and check if the issue still happens with either:

mbrennwa commented 1 year ago

Suspend via power menu or systemctl suspend suspends the machine and waking up works fine (both before and after modprobe -r surface_gpe).

However, I just noticed that my previous observeration was not entirely accurate. The machine reacts to closing the lid only if I use evtest. Without running evtest, the machine does not react if I close the lid.

qzed commented 1 year ago

That is quite odd. evtest should only listen and not do anything / have any impact on other parts of the system. Any chance you have acpid or something similar installed? Also, can you check what the HandleLidSwitch option in /etc/systemd/logind.conf is set to and what gnome options is set to?

mbrennwa commented 1 year ago
brennmat@salami:~$ apt list | grep -i acpid
WARNING: apt does not have a stable CLI interface. Use with caution in scripts.
acpid/now 1:2.0.33-2+b1 amd64 [installed,local]
brennmat@salami:~$ cat /etc/systemd/logind.conf | grep -i HandleLid
#HandleLidSwitch=suspend
#HandleLidSwitchExternalPower=suspend
#HandleLidSwitchDocked=ignore

Gnome Tweaks: "Suspend when laptop lid is closed = on"

qzed commented 1 year ago

Ah, acpid could cause some problems. In particular when both acpid and Gnome try to interpret the lid switch event at the same time. Could you try again after uninstalling acpid?

brennmat commented 1 year ago

Ok, it's a new day, and with a fresh mind I did a few more tests in a somewhat more systematic way. I realised that the lid-close result depends on how long the lid remains closed. This may explain some of the erratic observations so far.

Without evtest running: (a) Closing the lid, then open the lid after approx. 2 seconds: the machine does (try) to sleep (b) Closing the lid, then open the lid after approx. 5 seconds: keyboard light is on, screen is black, and (usually) does not wake up. If I leave it for a few seconds, the keyboard light turns off. Pressing any key will bring back the keyboard light, but the screen usually remains black. The screen brightness button (F6) does not allow turning on the screen. In some (rare) cases the screen would turn on again after typing randomly on the keyboard, and the machine then seemes to work as expected. However, in most cases I needed to reset the machine by pressing the power button for a few seconds.

With evtest running: Same as (b) above, even if I re-open the lid immediately after closing it.

acpid was installed by default with my Debian system (same on another non-Surface laptop). After removing acpid from the SL5 I could not observe a clear change from the above behaviours as described above. I re-installed acpid for now.

Some new observation about power off: If I power off the machine from the Gnome power menu (not suspend/sleep), the machine seems to shut down, but the battery will drain a lot (much more than if I put it to suspend; battery was at 15% before power off yesterday night, and it was completely drained this morning). It seems the hardware is not powered down completely. Is this a related issue, or something else?

NP-chaonay commented 1 year ago

(b) Closing the lid, then open the lid after approx. 5 seconds: keyboard light is on, screen is black, and (usually) does not wake up. If I leave it for a few seconds, the keyboard light turns off. Pressing any key will bring back the keyboard light, but the screen usually remains black. The screen brightness button (F6) does not allow turning on the screen. In some (rare) cases the screen would turn on again after typing randomly on the keyboard, and the machine then seemes to work as expected.

This same as my machine and perhaps something wrong with software engineering/degsign stuff of graphical system work or even GPU driver bug, so maybe you could do random press of keyboard (Except power button) or try pres "ESC" repeated (idk if I remember wrong if this work or not)

(a) Closing the lid, then open the lid after approx. 2 seconds: the machine does (try) to sleep

idk if this happen because in the 2sec after lid close, it is phase of suspending, so system wont be wakeup when doing lid open in this phase duration

NP-chaonay commented 1 year ago

With evtest running: Same as (b) above, even if I re-open the lid immediately after closing it.

it is strange for me , and cannot be explained by me atm.

Some new observation about power off: ...

see https://github.com/linux-surface/linux-surface/wiki/Surface-Pro-9

the problem you faced is same as SP9 user faced

you could see at section about power operation to see what is different from yours, noted that this is not yet solved. but I have workaround-based solution given in that wiki page btw

NP-chaonay commented 1 year ago

With evtest running: Same as (b) above, even if I re-open the lid immediately after closing it.

another hypothesis, when evetest is not run, the sensor polling rate is low, so it detect lid event so slow

but when you run evtest, it (driver?) increase poll rate.

(you could test via disable suspend on lid and test lid event in evtest, if this hypotheuse work, it should output data immediately idk)

see https://github.com/linux-surface/linux-surface/wiki/Surface-Pro-9

I now have made the device page for SL5, the SP9 page however you could see for compare the case, and test if SP9 problem have happens on your device, you could using the content in SP9 paeg in SL5 page for ease of wiki writing.

NP-chaonay commented 1 year ago

@mbrennwa @aj3423 (due to user of SP9 who have 1st face this problem)

I think we could making separate issues about these ACPI bug btw.

for you both I have made the page here for Intel 12th device problem: https://github.com/linux-surface/linux-surface/wiki/Intel-12th-Generation-Devices-Issues (to aj3423: this is not any new content for you, I just move the existed content in SP9 page here, just to inform)

qzed commented 1 year ago

another hypothesis, when evetest is not run, the sensor polling rate is low, so it detect lid event so slow

but when you run evtest, it (driver?) increase poll rate.

There shouldn't really be any polling. Everything is event-/interrupt-based.

(you could test via disable suspend on lid and test lid event in evtest, if this hypotheuse work, it should output data immediately idk)

This is a good idea. Can you check if everything works as expected (i.e. nothing breaks) when you disable suspend-on-lid.

And just to understand everything correctly: Suspend via the menu does work as expected, right? Absolutely no issues there and it's just via the lid? The reason I ask is because the only time I've seen a similar issue was when multiple parties (e.g. both acpid and Gnome) tried to suspend/do something on the lid event.

However, if we can rule that out and suspend via the menu works fine, I have no idea how to further debug this. This would indicate that for all intents and purposes, the drivers and devices suspend fine, and it's somehow just the lid event causing problems. For suspending, the lid event should be already successfully processed before the device enters suspend, so those are two largely disconnected things (since user-space decides what to do with the event, and user-space is the last party in the chain). Which would mean that the problem is at some point during waking via the lid event? Maybe there's a way to test that specifically...

qzed commented 1 year ago

Another note: If the keyboard backlight is on, the SAM driver is not suspended / has already been woken up.

mbrennwa commented 1 year ago

(you could test via disable suspend on lid and test lid event in evtest, if this hypotheuse work, it should output data immediately idk)

This is a good idea. Can you check if everything works as expected (i.e. nothing breaks) when you disable suspend-on-lid.

I turned off "Suspend when laptop lid is closed" in Gnome Tweaks, and yes, this did the trick to see evtest working:

brennmat@salami:~$ sudo evtest
No device specified, trying to scan all of /dev/input/event*
Available devices:
/dev/input/event0:  Lid Switch
/dev/input/event1:  Video Bus
/dev/input/event10: HDA Intel PCH Mic
/dev/input/event11: HDA Intel PCH Headphone
/dev/input/event12: HDA Intel PCH HDMI/DP,pcm=3
/dev/input/event13: HDA Intel PCH HDMI/DP,pcm=7
/dev/input/event14: HDA Intel PCH HDMI/DP,pcm=8
/dev/input/event15: HDA Intel PCH HDMI/DP,pcm=9
/dev/input/event16: HDA Intel PCH HDMI/DP,pcm=10
/dev/input/event17: HDA Intel PCH HDMI/DP,pcm=11
/dev/input/event18: HDA Intel PCH HDMI/DP,pcm=12
/dev/input/event19: HDA Intel PCH HDMI/DP,pcm=13
/dev/input/event2:  PC Speaker
/dev/input/event20: HDA Intel PCH HDMI/DP,pcm=14
/dev/input/event21: HDA Intel PCH HDMI/DP,pcm=15
/dev/input/event22: HDA Intel PCH HDMI/DP,pcm=16
/dev/input/event23: HDA Intel PCH HDMI/DP,pcm=17
/dev/input/event3:  gpio-keys
/dev/input/event4:  gpio-keys
/dev/input/event5:  Microsoft Surface 045E:09AE Keyboard
/dev/input/event6:  Microsoft Surface 045E:09AF Mouse
/dev/input/event7:  Microsoft Surface 045E:09AF Touchpad
/dev/input/event8:  Surface Camera Front: Surface C
/dev/input/event9:  Surface Camera Front: Surface I
Select the device event number [0-23]: 0
Input driver version is 1.0.1
Input device ID: bus 0x19 vendor 0x0 product 0x5 version 0x0
Input device name: "Lid Switch"
Supported events:
  Event type 0 (EV_SYN)
  Event type 5 (EV_SW)
    Event code 0 (SW_LID) state 0
Properties:
Testing ... (interrupt to exit)
Event: time 1671625516.487176, type 5 (EV_SW), code 0 (SW_LID), value 1
Event: time 1671625516.487176, -------------- SYN_REPORT ------------
Event: time 1671625517.412664, type 5 (EV_SW), code 0 (SW_LID), value 0
Event: time 1671625517.412664, -------------- SYN_REPORT ------------

And just to understand everything correctly: Suspend via the menu does work as expected, right? Absolutely no issues there and it's just via the lid?

That's what I thought, but since things seem to be a bit erratic, I double checked to be sure. This time, choosing "Suspend" from the Gnome power menu gave exactly the same issue as when I close the lid (keyboard light is on, screen is black, cannot get it back).

So I removed acpid again and rebooted the machine. Now suspending from the Gnome power menu worked as expected.

I am sorry, but the behaviour of all this is indeed a bit erractic, which probably does not help to isolate the issue.

NP-chaonay commented 1 year ago

This time, choosing "Suspend" from the Gnome power menu gave exactly the same issue as when I close the lid (keyboard light is on, screen is black, cannot get it back).

does that happen even lid is not close right?

So I removed acpid again and rebooted the machine

could you test the shutdown but it should not solve even acpid is removed

and make sure that the lid-to-suspend is working normal after removal of acpid

mbrennwa commented 1 year ago

This time, choosing "Suspend" from the Gnome power menu gave exactly the same issue as when I close the lid (keyboard light is on, screen is black, cannot get it back).

does that happen even lid is not close right?

Yes. I have no external screen/mouse/keyboard connected, so the lid had to be open when I did the suspend from the Gnome UI.

So I removed acpid again and rebooted the machine

could you test the shutdown but it should not solve even acpid is removed

What's the best way to test if the hardware is really shut down properly? All I did until now was to observe the battery drain over "power off" durations of a few hours.

and make sure that the lid-to-suspend is working normal after removal of acpid

Closing the lid still caused the same issue as described above even with acpid removed. I have not tested this many times, so there might be some erratic exceptions, though.

NP-chaonay commented 1 year ago

What's the best way to test if the hardware is really shut down properly? All I did until now was to observe the battery drain over "power off" durations of a few hours.

according to SP9 user said: it is have to do In this case, just hold the power button to force a power off.

it can interpreted that if the power is not fully off, you cannot just press power button 1 time to make it boot. You have to hold for long duration to trigger hard shutdown.

But if you case it seem to boot up via 1x power button normally, you can leave it for 1 hour after powered off, and see battery %

NP-chaonay commented 1 year ago

as described above

sorry, but to make sure I have same undestanding, could you expands this for me please

NP-chaonay commented 1 year ago

(usage of word erractic by mbrennwa)

maybe testing 3x could provided less erractic result, this is my recommendation

and have you have any external usb device connected?

brennmat commented 1 year ago

I repeated the lid-close / lid-open three times without acpid installed, and it always behaved as described in this post.

I wonder if the issue is related to the lid-close or to the lid-open. If I leave the lid closed for a few seconds, I can see how the keyboard lights are turned on when I open the lid (I think that's how it should be, so I guess all is well until this moment). However, the screen remains dark. This might suggest that the handling of the lid-open event does not work correctly to turn on the screen.

StollD commented 1 year ago

You could try to SSH into the machine, both when the lid is closed (to make sure that it is suspended) and when you opened it again (to check if the kernel is still alive and maybe get a log).

brennmat commented 1 year ago

Ok, I logged in to my SL5 machine using ssh. Once I closed the lid, the login did not respond anymore (expected). After opening the lid, the machine did not come back. I also tried to establish a fresh ssh connection, but the machine did not respond. I am not sure if this means that the machine was completely stuck, or if it simply was not able to bring back the wifi from sleep.

qzed commented 1 year ago

Alright, so if I understand things correctly:

So I'm wondering if we can find some way to reproduce this without the lid. You could also try the debug modes described in https://docs.kernel.org/power/basic-pm-debugging.html#testing-suspend-to-ram-str (since we're interested in s2idl/s2ram just ignore the write to /sys/power/disk and use echo mem > /sys/power/state for s2idle/s2ram instead of hibernation; you can use cat on all of those files to show the available options). And check the dmesg log while doing that.

mbrennwa commented 1 year ago

Your summary is pretty much spot on.

I am not 100% sure what you want me to do regarding the s2idle/s2ram debugging thing. Anyway, here's what I did.

I ran echo mem > /sys/power/state. The machine went to sleep with no issues. I was able to wake it up by pressing the power button twice. At the first press, the screen would quickly show for a fraction of a second what was on the screen immediately before going to sleep, then goes black again. At the second press, I got the login screen and the machine worked fine. I repeated this 4x, and the behaviour was always the same.

Here's the relevant dmesg output from one of these sleep/wake cycles as described above:

[  465.997887] surface_serial_hub serial0-0: event: unhandled event (rqid: 0x02, tc: 0x02, tid: 0x01, cid: 0x1a, iid: 0x01)
[  466.992578] surface_serial_hub serial0-0: event: unhandled event (rqid: 0x02, tc: 0x02, tid: 0x01, cid: 0x1a, iid: 0x01)
[  477.769537] PM: suspend entry (s2idle)
[  477.779799] Filesystems sync: 0.010 seconds
[  477.781147] Freezing user space processes ... (elapsed 0.001 seconds) done.
[  477.782967] OOM killer disabled.
[  477.782968] Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done.
[  477.784219] printk: Suspending console(s) (use no_console_suspend to debug)
[  477.809345] wlp0s20f3: deauthenticating from 3c:a6:2f:1c:e1:9a by local choice (Reason: 3=DEAUTH_LEAVING)
[  481.108555] i915 0000:00:02.0: [drm] GuC firmware i915/adlp_guc_70.1.1.bin version 70.1
[  481.108560] i915 0000:00:02.0: [drm] HuC firmware i915/tgl_huc_7.9.3.bin version 7.9
[  481.108761] surface_serial_hub serial0-0: rx: parser: invalid start of frame, skipping
[  481.125287] i915 0000:00:02.0: [drm] HuC authenticated
[  481.126384] i915 0000:00:02.0: [drm] GuC submission enabled
[  481.126386] i915 0000:00:02.0: [drm] GuC SLPC enabled
[  481.127166] i915 0000:00:02.0: [drm] GuC RC: enabled
[  481.404031] OOM killer enabled.
[  481.404037] Restarting tasks ... 
[  481.404630] mei_hdcp 0000:00:16.0-b638ab7e-94e2-4ea2-a552-d1c54b627f04: bound 0000:00:02.0 (ops i915_hdcp_component_ops [i915])
[  481.405384] mei_pxp 0000:00:16.0-fbf6fcf1-96cf-4e2e-a6a6-1bab8cbe36b1: bound 0000:00:02.0 (ops i915_pxp_tee_component_ops [i915])
[  481.408976] done.
[  481.409000] random: crng reseeded on system resumption
[  481.409753] PM: suspend exit
[  482.405921] PM: suspend entry (s2idle)
[  482.418920] Filesystems sync: 0.012 seconds
[  482.419345] Freezing user space processes ... (elapsed 0.002 seconds) done.
[  482.422097] OOM killer disabled.
[  482.422099] Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done.
[  482.423622] printk: Suspending console(s) (use no_console_suspend to debug)
[  483.808902] i915 0000:00:02.0: [drm] GuC firmware i915/adlp_guc_70.1.1.bin version 70.1
[  483.808909] i915 0000:00:02.0: [drm] HuC firmware i915/tgl_huc_7.9.3.bin version 7.9
[  483.825891] i915 0000:00:02.0: [drm] HuC authenticated
[  483.826606] i915 0000:00:02.0: [drm] GuC submission enabled
[  483.826611] i915 0000:00:02.0: [drm] GuC SLPC enabled
[  483.827479] i915 0000:00:02.0: [drm] GuC RC: enabled
[  484.019183] OOM killer enabled.
[  484.019188] Restarting tasks ... 
[  484.019607] mei_hdcp 0000:00:16.0-b638ab7e-94e2-4ea2-a552-d1c54b627f04: bound 0000:00:02.0 (ops i915_hdcp_component_ops [i915])
[  484.024316] done.
[  484.024325] random: crng reseeded on system resumption
[  484.024429] mei_pxp 0000:00:16.0-fbf6fcf1-96cf-4e2e-a6a6-1bab8cbe36b1: bound 0000:00:02.0 (ops i915_pxp_tee_component_ops [i915])
[  484.024885] PM: suspend exit
[  487.160221] wlp0s20f3: authenticate with 3c:a6:2f:1c:e1:9a
[  487.206178] wlp0s20f3: Invalid HE elem, Disable HE
[  487.218062] wlp0s20f3: send auth to 3c:a6:2f:1c:e1:9a (try 1/3)
[  487.245385] wlp0s20f3: authenticated
[  487.274382] wlp0s20f3: associate with 3c:a6:2f:1c:e1:9a (try 1/3)
[  487.276947] wlp0s20f3: RX AssocResp from 3c:a6:2f:1c:e1:9a (capab=0x1511 status=0 aid=4)
[  487.286894] wlp0s20f3: associated
[  487.335617] IPv6: ADDRCONF(NETDEV_CHANGE): wlp0s20f3: link becomes ready
[  487.345503] wlp0s20f3: Limiting TX power to 20 (23 - 3) dBm as advertised by 3c:a6:2f:1c:e1:9a