system76 / firmware-open

System76 Open Firmware
Other
949 stars 86 forks source link

tgl-u: S0ix causes EC lockup #424

Closed crawfxrd closed 1 year ago

crawfxrd commented 1 year ago

Since the S0ix changes in the EC, TGL-U units will lock up when trying to suspend.

Steps to reproduce

Expected behavior

Actual behavior

Additional info

Only output from EC is:

VWCTRL1 80
VWIDX47 11
VWCTRL1 80
VWIDX47 10
VWCTRL1 80
VWIDX47 11
VWCTRL1 80
VWIDX47 10
VWCTRL1 80
VWIDX47 11

(It spews this if the display turns off without attempting suspend.)

crawfxrd commented 1 year ago
[  384.033665] PM: suspend entry (s2idle)
[  384.043761] Filesystems sync: 0.010 seconds
[  384.045885] Freezing user space processes
[  384.060710] Freezing user space processes completed (elapsed 0.014 seconds)
[  384.060717] OOM killer disabled.
[  384.060719] Freezing remaining freezable tasks
[  384.061969] Freezing remaining freezable tasks completed (elapsed 0.001 seconds)
[  384.061977] printk: Suspending console(s) (use no_console_suspend to debug)
[  384.303111] ACPI: EC: interrupt blocked
[  395.559605] ACPI: EC: interrupt unblocked
[  398.347532] ACPI Error: AE_TIME, Returned by Handler for [EmbeddedControl] (20221020/evregion-300)
[  398.347537] ACPI Error: Timeout from EC hardware or EC device driver (20221020/evregion-310)

[  398.347548] No Local Variables are initialized for Method [_PSW]

[  398.347549] Initialized Arguments for Method [_PSW]:  (1 arguments defined for method invocation)
[  398.347550]   Arg0:   000000009c179d8f <Obj>           Integer 0000000000000000

[  398.347557] ACPI Error: Aborting method \_SB.LID0._PSW due to previous error (AE_TIME) (20221020/psparse-529)

etc.

justanotherrandomuser386 commented 1 year ago

Got same issue after flashing recent firmware update. But problem is much more severe than suspend failure. Actual bug is that after failed attempt to suspend (think of closing the lid on working laptop, pretty common scenario with mobile computers) you will get unusable laptop unresponsive to power button (long/short/eternal press) so you can't turn it on or off until you disassemble it and unplug the battery. After that laptop works as usual. At least this time you do not need an external programmer to bring machine back to life, yay!

curiousercreative commented 1 year ago

Also affected after flashing updated firmware:

jacobgkau commented 1 year ago

To be clear, this issue was discovered while performing regression testing prior to releasing an official firmware update for lemp10, and a version with the issue has not been released. The latest version of firmware released for lemp10 is 2022-11-21_b337ac6. This would not be affecting any "recent firmware update," only users who have manually flashed newer firmware from this repo.

curiousercreative commented 1 year ago

@jacobgkau yes and to be more precise, I flashed firmware myself. I haven't got back into the system yet to say for certain, but I believe I flashed 2022-11-21_b337ac6 with a modified coreboot (rebased to system76 branch) and modified EC (rebased to master).

curiousercreative commented 1 year ago

You probably already figured out as much, but it seemed to be EC that was responsible for the failure. Once I rebased EC from 2022-11-21_b337ac6, all is good. Thanks for the help on the chat server, disconnecting the battery works well to recover from this failure.

justanotherrandomuser386 commented 1 year ago

Yes, i just flashed mysefl, willigly taking all the risks. So no blaming anyone of system76 team for lack of QA. The main point was that 'does not suspend' is not the same as 'Cannot power off the machine by holding the power button'. Anyway, is there some kind of watchdog available in EC to reset the controller in case of unresponsive firmware?

crawfxrd commented 1 year ago

Enable PWRSW WDT: https://github.com/system76/ec/pull/315

crawfxrd commented 1 year ago
VWIDX47 11

Is HOST_C10, now reported from CB:73689. Are we suppose to ack it or something? I also see that it gets reset at PLTRST# and not ESPI_RESET#. Is that significant?

[17:17:40.788] VWCTRL1 80
[17:17:40.792] VWIDX47 11
[17:17:40.792] VW_HOST_C10 = 10
[17:17:40.796] VWCTRL1 80
[17:17:40.796] VWIDX47 10
[17:17:40.796] VW_HOST_C10 = 10
[17:17:40.801] VWCTRL1 80
[17:17:40.805] VWIDX47 11
[17:17:40.805] VW_HOST_C10 = 10
[17:17:40.805] VWCTRL1 80
[17:17:40.808] VWIDX47 10
[17:17:40.808] VW_HOST_C10 = 10
[17:17:40.813] VWCTRL1 80
[17:17:40.813] VWIDX47 10
[17:17:40.813] VW_HOST_C10 = 10
[17:17:40.887] VWCTRL1 80
[17:17:40.887] VWIDX47 11
[17:17:40.887] VW_HOST_C10 = 10
[17:17:40.890] VWCTRL1 80
[17:17:40.890] VWIDX47 10
[17:17:40.895] VW_HOST_C10 = 10

Opportunistic suspend goes crazy, constantly going in and out of C10. J_SSD2 (PCH port) is broken after C10.

crawfxrd commented 1 year ago

GetTemp() is being called while the CPU is in C10.