system76 / firmware-open

System76 Open Firmware
Other
949 stars 86 forks source link

lemp10 4.19 regression: TCSS/D3-related full-system hang after 20s with usb-c monitor #394

Open mvysin opened 1 year ago

mvysin commented 1 year ago

I have a lemp10 and since the 2022-11-21_b337ac6 release of firmware-open I've been experiencing an intermittent (but recently extremely reproducible) full-system freeze after about 20s of uptime. I filed support ticket 117801 but now that I've bisected the issue I figure this is a better place to follow up.

In short: I'm running a lemp10 with ubuntu 22.10, kernel version 6.0.12-76060006-generic, and two daisy-chained monitors connected to the laptop via usb-c. When I applied the b337ac6 release I immediately started observing an intermittent hard system freeze after about 20s of uptime, either from cold-boot or resume, when the monitors were connected. About a week ago, seemingly unprompted, it became extremely reproducible when both monitors were attached; as I started to try more things, it started to reproduce with /any/ monitor connected.

This seemed to be a very low-level freeze: sysrq keys stopped responding, keyboard LEDs on my external keyboard did not respond, and nothing visible in dmesg.

I bisected this in firmware-open first to commit https://github.com/system76/firmware-open/commit/6ff4ccfbcb665165b3230562a2b4bbca712641e2 and from there, noted it was:

Now, interestingly, the pop-os live usb does not exhibit the freeze even with the broken firmware-open version (and even same kernel version and seemingly-equivalent modules). I have no idea why, though given the commit in question I'd guess it does something different with power states. I'm happy to (slowly) debug further if there's anything of interest here, or if someone can suggest some areas to poke at that affect D3 sleep states.

crawfxrd commented 1 year ago

Can you reproduce this with only a single external display?

I am using 2023-02-14_85f8a8b on lemp10. I'm not able to produce a system lock up with a single external display connected using DP-over-USB. I do see an error in journalctl after suspending:

i915 0000:00:02.0: drm_WARN_ON(dig_port->tc_mode != TC_PORT_DISCONNECTED)
WARNING: CPU: 7 PID: 3803 at drivers/gpu/drm/i915/display/intel_tc.c:711 intel_tc_port_sanitize+0x2f2/0x550 [i915]

system76/coreboot@c7998fda319b was dropped because I was told by an Intel developer it was 100% working:

This feature is used for PnP at runtime and had been fully validated for S0ix/rtd3.

mvysin commented 1 year ago

This is part I don't fully understand the trigger: I used to be able to only reproduce with the chained monitors; unplugging the second monitor avoided the freeze. And that held steady for a few days.

But then (again, oddly, after making no configuration changes I can think of, or even running system updates) it started occasionally freezing even with a single monitor, and that behavior has persisted for a few days now. Sometimes it would even get into a weird state where I have to power cycle the single monitor before the system will recognize the monitor at all. So I also wouldn't be surprised if the monitor is contributing to this mess.

I do also see that drm_WARN_ON notice for a few kernel versions now, but that happened before the firmware update as well.

Later today I can try against 85f8a8b as well to rule out any post-release change that may have fixed this some other way.

mvysin commented 1 year ago

I finally tried against 85f8ab. With coreboot unmodified, I consistently reproduce the hang with either one monitor or two.

With coreboot modified by cherrypicking the c7998fd change I get no hang.

I can't really think of anything special I've done regarding power management. I have what seem to be ubuntu standard upowerd and thermald running as well as system76-power, I haven't tweaked any cpupower settings I'm aware of, system76-power says my profile is Balanced, I have no exotic kernel command line options, and this hang happens from a cold start so there's no recent sleep state.

I was digging around with s0ix and c10 states in the past, for the previous firmware release before https://github.com/system76/ec/commit/cc3effb6a451e43ce69e0f9133e76476e7aa2c37, so I'm not 100% sure there's no weird setting leftover from that, but I sure can't find anything.

What I can try with more time is an ubuntu live usb or a fresh install in another partition to rule out any non-default stuff.