jakeday / linux-surface

Linux Kernel for Surface Devices
2.59k stars 242 forks source link

Can't wake up from suspend #562

Closed tmarkov closed 5 years ago

tmarkov commented 5 years ago

OK, so for like a day now I have a new issue that I've been dealing with. I suspend the device, then I wake it up, then suspend again, and I can't wake up any more (I think it's been happening on the second time so far, but I can't be sure).

I didn't update kernel, I didn't change configuration, nothing... just the issue suddenly popped up (though well, that's not the first time an issue has suddenly popped up on the surface).

SB1, 5.2.5-pre1 journal.txt

kitakar5525 commented 5 years ago

I occasionally encounter this issue. This issue is caused by IPTS. We need to wait about 1 min for the subsequent log to be written (maybe because of vm.dirty_writeback_centisecs). It always happen after ipts_send_sensor_clear_mem_window_cmd get called.

It occurs not so frequent, but I come to think that it might be too early to comment out ipts lines on system-sleep/sleep script…

See https://github.com/jakeday/linux-surface/issues/544#issuecomment-525279890

tmarkov commented 5 years ago

I do unload ipts modules before suspending:

#!/bin/bash
case $1 in
  pre)
    modprobe -r intel_ipts
    modprobe -r mei_me
    modprobe -r mei
    ;;
  post)
    while [ $(cat /proc/acpi/button/lid/LID0/state | grep closed | wc -l) = 1 ]
    do
      echo $(date) >> /var/log/resuspend
      echo freeze > /sys/power/state
    done
    modprobe -r mwifiex_pcie
    modprobe mei_me
    modprobe intel_ipts
    modprobe mwifiex_pcie
    ;;
esac

As for the rarity... I didn't happen before, but since yesterday I have already had to force-restart three times out of like 6 suspends or so.

kitakar5525 commented 5 years ago

I do unload ipts modules before suspending:

Ah, your log shows that, too. Then, wait about 1 min or longer before forcibly shutting down your device and we will see what will be written to the log.

The interval between two suspend is short. It might be related.

tmarkov commented 5 years ago

It was only short for easy reproduction purposes - but one of the cases was after an overnight suspend.

Waiting longer after I press the button trying to wake up doesn't generate any new logs.

tmarkov commented 5 years ago
/lib/systemd/system-sleep/sleep pre
rtcwake -m freeze -s 10

also doesn't wake up. So it looks like the problem isn't that power button doesn't get detected, but that wakeup itself fails for me.

Since I seem to be the only one affected, I'll try reinstall Debian to see if clean install will fix it.

tmarkov commented 5 years ago

So this is also happening on older kernels (even though I used them with no such problems before). It also survived reinstall. And I've been keeping track now; so far it's been very consistently happening on the second suspend of the boot. Any idea what could be going on?

kitakar5525 commented 5 years ago

You can disable console suspend. Some logs may be printed out to the display on suspend.

echo 0 | sudo tee /sys/module/printk/parameters/console_suspend

Any idea what could be going on?

Usually, the cause of crash on suspend is IPTS or mwifiex modules. Also try distribution's 5.2 kernel (if available on debian)

tmarkov commented 5 years ago

Well, it happens also on mainline Debian 4.19 kernel. It kinda looks like something happened to my device.

Here's the log with console_suspend but I don't see anything:

Aug 30 10:27:36 todor-surface systemd-timesyncd[524]: Synchronized to time server for the first time 195.85.215.8:123 (0.debian.pool
Aug 30 10:28:35 todor-surface polkitd(authority=local)[585]: Registered Authentication Agent for unix-process:1974:18140 (system bus
Aug 30 10:28:35 todor-surface NetworkManager[1735]: <info>  [1567150115.9447] manager: sleep: sleep requested (sleeping: no  enabled
Aug 30 10:28:35 todor-surface NetworkManager[1735]: <info>  [1567150115.9449] manager: NetworkManager state is now ASLEEP
Aug 30 10:28:35 todor-surface polkitd(authority=local)[585]: Unregistered Authentication Agent for unix-process:1974:18140 (system b
Aug 30 10:28:36 todor-surface systemd[1]: Reached target Sleep.
Aug 30 10:28:36 todor-surface systemd[1]: Starting Suspend...
Aug 30 10:28:36 todor-surface NetworkManager[1735]: <info>  [1567150116.9044] caught SIGTERM, shutting down normally.
Aug 30 10:28:36 todor-surface systemd[1]: Stopping Network Manager...
Aug 30 10:28:36 todor-surface NetworkManager[1735]: <info>  [1567150116.9394] dhcp4 (wlp3s0): canceled DHCP transaction, DHCP client
Aug 30 10:28:36 todor-surface NetworkManager[1735]: <info>  [1567150116.9394] dhcp4 (wlp3s0): state changed bound -> done
Aug 30 10:28:36 todor-surface NetworkManager[1735]: <info>  [1567150116.9395] device (wlp3s0): DHCPv4: 480 seconds grace period star
Aug 30 10:28:36 todor-surface NetworkManager[1735]: <info>  [1567150116.9401] dhcp6 (wlp3s0): canceled DHCP transaction
Aug 30 10:28:36 todor-surface NetworkManager[1735]: <info>  [1567150116.9534] exiting (success)
Aug 30 10:28:36 todor-surface systemd[1]: NetworkManager.service: Succeeded.
Aug 30 10:28:36 todor-surface systemd[1]: Stopped Network Manager.
Aug 30 10:28:36 todor-surface kernel: IPTS ipts_mei_cl_exit() is called
Aug 30 10:28:36 todor-surface kernel: ipts mei::3e8d0870-271a-4208-8eb5-9acb9402ae04:0F: cmd : 0x80000005, status 9
Aug 30 10:28:37 todor-surface kernel: ipts mei::3e8d0870-271a-4208-8eb5-9acb9402ae04:0F: error in reading m2h msg
Aug 30 10:28:37 todor-surface kernel: ipts mei::3e8d0870-271a-4208-8eb5-9acb9402ae04:0F: !! end event loop !!
Aug 30 10:28:37 todor-surface kernel: IPTS removed
Aug 30 10:28:37 todor-surface kernel: mwifiex_pcie 0000:03:00.0: info: successfully disconnected from cc:bb:fe:83:d4:b4: reason code
Aug 30 10:28:37 todor-surface kernel: mwifiex_pcie 0000:03:00.0: PREP_CMD: FW in reset state
Aug 30 10:28:37 todor-surface kernel: mwifiex_pcie 0000:03:00.0: deleting the crypto keys
Aug 30 10:28:37 todor-surface kernel: mwifiex_pcie 0000:03:00.0: PREP_CMD: FW in reset state
Aug 30 10:28:37 todor-surface kernel: mwifiex_pcie 0000:03:00.0: deleting the crypto keys
Aug 30 10:28:37 todor-surface kernel: mwifiex_pcie 0000:03:00.0: PREP_CMD: FW in reset state
Aug 30 10:28:37 todor-surface kernel: mwifiex_pcie 0000:03:00.0: deleting the crypto keys
Aug 30 10:28:37 todor-surface kernel: mwifiex_pcie 0000:03:00.0: PREP_CMD: FW in reset state
Aug 30 10:28:37 todor-surface kernel: mwifiex_pcie 0000:03:00.0: deleting the crypto keys
Aug 30 10:28:37 todor-surface kernel: mwifiex_pcie 0000:03:00.0: PREP_CMD: FW in reset state
Aug 30 10:28:37 todor-surface kernel: mwifiex_pcie 0000:03:00.0: deleting the crypto keys
Aug 30 10:28:37 todor-surface wpa_supplicant[548]: wlp3s0: CTRL-EVENT-DISCONNECTED bssid=cc:bb:fe:83:d4:b4 reason=0
Aug 30 10:28:37 todor-surface wpa_supplicant[548]: dbus: wpa_dbus_property_changed: no property SessionLength in object /fi/w1/wpa_s
Aug 30 10:28:37 todor-surface kernel: mwifiex_pcie 0000:03:00.0: info: shutdown mwifiex...
Aug 30 10:28:37 todor-surface kernel: mwifiex_pcie 0000:03:00.0: PREP_CMD: card is removed
Aug 30 10:28:37 todor-surface avahi-daemon[541]: Interface wlp3s0.IPv6 no longer relevant for mDNS.
Aug 30 10:28:37 todor-surface avahi-daemon[541]: Leaving mDNS multicast group on interface wlp3s0.IPv6 with address fe80::b6ae:2bff:
Aug 30 10:28:37 todor-surface avahi-daemon[541]: Interface wlp3s0.IPv4 no longer relevant for mDNS.
Aug 30 10:28:37 todor-surface avahi-daemon[541]: Leaving mDNS multicast group on interface wlp3s0.IPv4 with address 192.168.100.4.
Aug 30 10:28:37 todor-surface avahi-daemon[541]: Withdrawing address record for fe80::b6ae:2bff:fec2:9a7f on wlp3s0.
Aug 30 10:28:37 todor-surface avahi-daemon[541]: Withdrawing address record for 192.168.100.4 on wlp3s0.
Aug 30 10:28:37 todor-surface systemd[1]: Starting Load/Save RF Kill Switch Status...
Aug 30 10:28:37 todor-surface systemd[1]: Started Load/Save RF Kill Switch Status.
Aug 30 10:28:37 todor-surface systemd-sleep[1987]: Suspending system...
*pressing button*
*hard restart*
kitakar5525 commented 5 years ago

Hmm…

Here's the log with console_suspend but I don't see anything:

If you disable console_suspend, you will normally see some output in your display after waking up from suspend. You mean you did not see anything on your display?

Also, try

You can refer to the latest system-sleep/sleep script for 5.2.5-pre1 kernel here

Do you have Windows partition? If so,

tmarkov commented 5 years ago

Well, if anything is printed to the console, it certainly doesn't survive a hard restart. But on those occasions when the device does wake up, there's nothing.

kitakar5525 commented 5 years ago

Well, if anything is printed to the console, it certainly doesn't survive a hard restart.

Did you find any useful information there?

Like this:

pci 0000:03:00.0: Refused to change power state, currently in D3
nvme 0000:02:00.0: Refused to change power state, currently in D3
tmarkov commented 5 years ago

Sorry if I'm missing something obvious, but how do I check what was printed on the screen when I don't manage to wake up and have to do a hard restart?

kitakar5525 commented 5 years ago

I expected you will see a console output like the picture in the issue I opened before if your device has problem on resuming

tmarkov commented 5 years ago

Ah, no, pressing the power button (or any other wakeup method like rtcwake) has no effect whatsoever. Device just stays suspended.

kitakar5525 commented 5 years ago

Ah… I tried disabling console_suspend and I noticed it broke resuming from suspend when IPTS is enabled.

Disable IPTS by adding kernel parameter i915.enable_ipts=0 to your bootloader.

Then, enable extra debug output :

echo 0 | sudo tee /sys/module/printk/parameters/console_suspend
echo 1 | sudo tee /sys/power/pm_debug_messages
echo 1 | sudo tee /sys/module/printk/parameters/ignore_loglevel
echo 1 | sudo tee /sys/kernel/debug/clear_warn_once

and try suspend again.

If you can resume your device stably, your issue might be caused by IPTS. If your issue persists, the cause is not IPTS and I hope some interesting output will appear on your display.

tmarkov commented 5 years ago

No difference; the display stays turned off. All my observation point towards resume never getting initiated, so I'd guess ACPI problem.

kitakar5525 commented 5 years ago

OK. Then, we will see what will happen when firmware is updated.

StealthBadger747 commented 5 years ago

I know for me at least, I have to push the power button after opening the lid and then it takes a few seconds for the display to show.

kitakar5525 commented 5 years ago

To provide additional information,

I have to push the power button after opening the lid

It will be fixed by PR #540 (issue #473). If you're using SB2, try qzed's latest build if you're interested.

it takes a few seconds for the display to show

This behavior was added by commit 1d7fa8a. You can avoid the delay if you feel you don't need it by sudo sysctl kernel.resume_delay=0

tmarkov commented 5 years ago

I updated UEFI (other firmware already newest version somehow?), but the issue seems persistent.

tmarkov commented 5 years ago

I tried again with stock kernel, and it seems there rtcwake can do multiple suspend-resume cycles without issue. So it is somehow related to Jakeday's patches. I can re-confirm even old patched kernels (that used to work great) can't wake up from second suspend.

tmarkov commented 5 years ago

OK, so stock kernel experience:

If I use rtcwake -m freeze -s 5 I can do multiple suspend-resume syscles with no problem. On the other hand, if I use systemctl suspend, it works the first time without a hitch, the second time it takes like 10 seconds to wake up, and then it sleeps forever the third time. After it wakes up the second time, it does spit out some error messages in the journal related to wifi:

Sep 11 22:22:57 surface kernel: mwifiex_pcie 0000:03:00.0: None of the WOWLAN triggers enabled
Sep 11 22:22:57 surface kernel: mwifiex_pcie 0000:03:00.0: hs_activate_wait_q terminated
Sep 11 22:22:57 surface kernel: mwifiex_pcie 0000:03:00.0: cmd: failed to suspend
Sep 11 22:22:57 surface kernel: PM: pci_pm_suspend(): mwifiex_pcie_suspend+0x0/0xd0 [mwifiex_pcie] returns -14  
Sep 11 22:22:57 surface kernel: PM: dpm_run_callback(): pci_pm_suspend+0x0/0x130 returns -14
Sep 11 22:22:57 surface kernel: PM: Device 0000:03:00.0 failed to suspend async: error -14
Sep 11 22:22:57 surface kernel: PM: Some devices failed to suspend, or early wake event detected
Sep 11 22:22:57 surface kernel: nvme nvme0: 4/0/0 default/read/poll queues
Sep 11 22:22:57 surface kernel: i2c_hid i2c-MSHW0030:00: failed to change power setting.
Sep 11 22:22:57 surface kernel: PM: dpm_run_callback(): i2c_hid_resume+0x0/0x100 [i2c_hid] returns -121    
Sep 11 22:22:57 surface kernel: PM: Device i2c-MSHW0030:00 failed to resume async: error -121
Sep 11 22:22:57 surface kernel: i2c_hid i2c-MSHW0030:00: failed to retrieve report from device.
Sep 11 22:22:57 surface kernel: i2c_hid i2c-MSHW0030:00: failed to set a report to device.
Sep 11 22:22:57 surface kernel: i2c_hid i2c-MSHW0030:00: failed to set a report to device.
Sep 11 22:22:57 surface kernel: i2c_hid i2c-MSHW0030:00: failed to retrieve report from device.
========= later, on resume ===========
Sep 11 22:23:09 surface kernel: mwifiex_pcie 0000:03:00.0: None of the WOWLAN triggers enabled
Sep 11 22:23:09 surface kernel: i2c_hid i2c-MSHW0030:00: failed to set a report to device.
Sep 11 22:23:09 surface kernel: i2c_hid i2c-MSHW0030:00: failed to set a report to device.
Sep 11 22:23:09 surface kernel: i2c_hid i2c-MSHW0030:00: failed to retrieve report from device.
Sep 11 22:23:09 surface kernel: i2c_hid i2c-MSHW0030:00: failed to change power setting.
Sep 11 22:23:09 surface kernel: mwifiex_pcie 0000:03:00.0: hs_activate_wait_q terminated
Sep 11 22:23:09 surface kernel: mwifiex_pcie 0000:03:00.0: cmd: failed to suspend
Sep 11 22:23:09 surface kernel: PM: pci_pm_suspend(): mwifiex_pcie_suspend+0x0/0xd0 [mwifiex_pcie] returns -14
Sep 11 22:23:09 surface kernel: PM: dpm_run_callback(): pci_pm_suspend+0x0/0x130 returns -14
Sep 11 22:23:09 surface kernel: PM: Device 0000:03:00.0 failed to suspend async: error -14
Sep 11 22:23:09 surface kernel: PM: Some devices failed to suspend, or early wake event detected

Of course, this is stock kernel with unpatched wifi, so these may be completely different, unrelated errors.

It's also strange this doesn't happen when I use rtcwake (or echo freeze > /sys/power/state).

StollD commented 5 years ago

It's also strange this doesn't happen when I use rtcwake (or echo freeze > /sys/power/state).

Both commands bypass the session / login manager (systemd-logind), and suspend the kernel directly. systemd does some other stuff like locking your session before suspending the hardware.

systemd is also what executes the sleep script in /lib/systemd/system-sleep/, so that might be the cause of the issue.

tmarkov commented 5 years ago

Yes, but I did make the /lib/systemd/system-sleep/sleep script non-executable for this testing.

tmarkov commented 5 years ago

It looks like this https://answers.microsoft.com/en-us/surface/forum/surfpro4-surfupdate/surface-pro-4-stuck-in-uefi-loop-cant-change/803bed4e-1dd2-4622-ba54-155d51f8b8f9 has so far resolved the issue. Interesting that updating UEFI didn't - so I'm concerned if it'll stay resolved long-term...