thesofproject / linux

Linux kernel source tree
Other
90 stars 129 forks source link

Lenovo Thinkpad X1 Carbon Gen1- DSP boot failure #5072

Closed plbossart closed 2 months ago

plbossart commented 3 months ago

Initially filed here: https://bugzilla.kernel.org/show_bug.cgi?id=218961

We have a Lenovo Thinkpad X1 Carbon Gen 10 in our lab and ever since 6.10.0-rc1 it has failed its first S2idle suspend, and has stopped getting S0iX on subsequent successful S2idle suspends. The issue in the first suspend fail is here in the audio driver (dmesg section shown):

sof-audio-pci-intel-tgl 0000:00:1f.3: Code loader DMA did not complete
sof-audio-pci-intel-tgl 0000:00:1f.3: ------------[ DSP dump start ]------------
sof-audio-pci-intel-tgl 0000:00:1f.3: Firmware download failed
sof-audio-pci-intel-tgl 0000:00:1f.3: fw_state: SOF_FW_BOOT_READY_OK (6)
sof-audio-pci-intel-tgl 0000:00:1f.3: 0x00000005: module: ROM, state: FW_ENTERED, running
sof-audio-pci-intel-tgl 0000:00:1f.3: extended rom status:  0x5 0x0 0x4000 0x0 0x0 0x0 0x2560521 0x0
sof-audio-pci-intel-tgl 0000:00:1f.3: ------------[ DSP dump end ]------------
sof-audio-pci-intel-tgl 0000:00:1f.3: Failed to start DSP
sof-audio-pci-intel-tgl 0000:00:1f.3: error: failed to boot DSP firmware after resume -110
sof-audio-pci-intel-tgl 0000:00:1f.3: error: hda_dsp_core_reset_enter: timeout on HDA_DSP_REG_ADSPCS read
sof-audio-pci-intel-tgl 0000:00:1f.3: error: dsp core reset failed: core_mask 1
sof-audio-pci-intel-tgl 0000:00:1f.3: failed to power down DSP during suspend
sof-audio-pci-intel-tgl 0000:00:1f.3: error: suspending dsp
sof-audio-pci-intel-tgl 0000:00:1f.3: error: failed to power down DSP during suspend -110
sof-audio-pci-intel-tgl 0000:00:1f.3: PM: pci_pm_suspend(): snd_sof_suspend [snd_sof] returns -110
sof-audio-pci-intel-tgl 0000:00:1f.3: PM: dpm_run_callback(): pci_pm_suspend returns -110
sof-audio-pci-intel-tgl 0000:00:1f.3: PM: failed to suspend async: error -110

This is the commit:

commit d5263dbbd8af026159b16a08a94bedfe51b5f67b Author: Pierre-Louis Bossart [pierre-louis.bossart@linux.intel.com](mailto:pierre-louis.bossart@linux.intel.com) Date: Thu Apr 4 13:54:47 2024 -0500

ASoC: SOF: Intel: don't ignore IOC interrupts for non-audio transfers

Reverting the commit thusly fixes things in everything up to 6.10.0-rc3.

%> git diff 6cbf086143cf9674c7f029e1cf435c65a537066a d5263dbbd8af026159b16a08a94bedfe51b5f67b > ../revert.patch %> cat ../revert.patch | patch -1 -R

I've attached the sleepgraph timeline of the fail. The dmesg log is accesible by clicking the "dmesg" button in the upper right hand corner. They "log" button shows all the system details.

tebrandt commented 3 months ago

also-info-thinkpad-x1.txt

plbossart commented 3 months ago

Thanks @tebrandt, the only thing I can think of is that maybe the timeout is too small Can you try updating the value to this:

#define HDA_CL_DMA_IOC_TIMEOUT_MS 3000
tebrandt commented 3 months ago

otcpl-thinkpad-x1_freeze.html.gz The delay didn't seem to help, it just failed after 3 seconds instead of half a second:

[ 411.863179] PM: suspend entry (s2idle) [ 411.864836] Filesystems sync: 0.001 seconds [ 411.866732] Freezing user space processes [ 411.868542] Freezing user space processes completed (elapsed 0.001 seconds) [ 411.868546] OOM killer disabled. [ 411.868550] Freezing remaining freezable tasks [ 411.869801] Freezing remaining freezable tasks completed (elapsed 0.001 seconds) [ 411.869808] printk: Suspending console(s) (use no_console_suspend to debug) [ 415.131118] sof-audio-pci-intel-tgl 0000:00:1f.3: Code loader DMA did not complete [ 415.131125] sof-audio-pci-intel-tgl 0000:00:1f.3: ------------[ DSP dump start ]------------ [ 415.131126] sof-audio-pci-intel-tgl 0000:00:1f.3: Firmware download failed [ 415.131127] sof-audio-pci-intel-tgl 0000:00:1f.3: fw_state: SOF_FW_BOOT_READY_OK (6) [ 415.131138] sof-audio-pci-intel-tgl 0000:00:1f.3: 0x00000005: module: ROM, state: FW_ENTERED, running [ 415.131151] sof-audio-pci-intel-tgl 0000:00:1f.3: extended rom status: 0x5 0x0 0x4000 0x0 0x0 0x0 0x2560521 0x0 [ 415.131152] sof-audio-pci-intel-tgl 0000:00:1f.3: ------------[ DSP dump end ]------------ [ 415.131168] sof-audio-pci-intel-tgl 0000:00:1f.3: Failed to start DSP [ 415.131169] sof-audio-pci-intel-tgl 0000:00:1f.3: error: failed to boot DSP firmware after resume -110 [ 415.181713] sof-audio-pci-intel-tgl 0000:00:1f.3: error: hda_dsp_core_reset_enter: timeout on HDA_DSP_REG_ADSPCS read [ 415.181721] sof-audio-pci-intel-tgl 0000:00:1f.3: error: dsp core reset failed: core_mask 1 [ 415.181728] sof-audio-pci-intel-tgl 0000:00:1f.3: failed to power down DSP during suspend [ 415.181733] sof-audio-pci-intel-tgl 0000:00:1f.3: error: suspending dsp [ 415.181737] sof-audio-pci-intel-tgl 0000:00:1f.3: error: failed to power down DSP during suspend -110 [ 415.181746] sof-audio-pci-intel-tgl 0000:00:1f.3: PM: pci_pm_suspend(): snd_sof_suspend [snd_sof] returns -110 [ 415.181880] sof-audio-pci-intel-tgl 0000:00:1f.3: PM: dpm_run_callback(): pci_pm_suspend returns -110 [ 415.181909] sof-audio-pci-intel-tgl 0000:00:1f.3: PM: failed to suspend async: error -110 [ 415.182256] PM: suspend of devices aborted after 3272.710 msecs [ 415.182264] PM: start suspend of devices aborted after 3312.519 msecs [ 415.182266] PM: Some devices failed to suspend, or early wake event detected [ 415.274403] nvme nvme0: 16/0/0 default/read/poll queues [ 415.279227] nvme nvme0: Ignoring bogus Namespace Identifiers [ 416.029144] PM: resume of devices complete after 846.899 msecs [ 416.031206] mei_hdcp 0000:00:16.0-b638ab7e-94e2-4ea2-a552-d1c54b627f04: bound 0000:00:02.0 (ops i915_hdcp_ops [i915]) [ 416.031307] OOM killer enabled. [ 416.031308] Restarting tasks ... done. [ 416.032707] random: crng reseeded on system resumption [ 416.074436] PM: suspend exit

plbossart commented 3 months ago

Oh, I completely missed that this happens during a suspend operation, now it's starting to make sense.

The PCI subsystem will pm_runtime resume the device, and that means booting firmware. If the failure happens only at that moment, then it's not a classic authentication failure, but more that something was turned off before the DSP boots.

It's possible that we stop dealing with IRQs at some point but still rely on IRQs for the code download.

I'll have to think more about this, IIRC we also added a synchronize_irq() at some point.

This will have to wait next week though, busy with other things atm.

tebrandt commented 3 months ago

I'll see if there's a firmware update for this system. I guess it's possible that there's a bug in firmware that was activated by this patch. I'll try to at least rule it out.

plbossart commented 3 months ago

It's also possible that relying on wait_for_completion() while suspending is not allowed or has side effects.

I am not sure why this shows up only now though, the code is like 6 month old. This suspend issue would have been detected by our CI - unless we have a test coverage of course...

plbossart commented 3 months ago

Well I just checked that our CI does not test s2idle :-(

2024-06-25 18:18:37 UTC [REMOTE_INFO] Current suspend/resume type mode: s2idle [deep]
plbossart commented 3 months ago

maybe test with this patch if there's a race between the wait for an interrupt and the suspend part where we disable the interrupts

logs.diff.txt

tebrandt commented 3 months ago

otcpl-lenovo-tix1-tgl_freeze.html.gz alsa-info-lenovo-tix1-tgl.txt

It seems that this issue is happening on another Lenovo machine as well, the Lenovo ThinkPad X1 Titanium Gen 1 (otcpl-lenovo-tix1-tgl). I've just confirmed that same regression commit is also the cause on this machine. Attached is a timeline with the dmesg log file, and the also-info. The first S2idle always fails.

tebrandt commented 3 months ago

otcpl-hp-spectre-tgl_freeze.html.gz alsa-info-hp-spectre-tgl.txt

I just ran a full scan of all our data and it turns out there are actually 3 machines (out of 50) affected by this regression. The list now includes the Hewlett Packard Spectre x360 Convertible 14-ea0xxx. Attached is a timeline with the included dmesg log file (dmesg button in upper right corner), and the alsa-info. The first S2idle always fails. Given that this affects 3 high end production machines, I think it should be reverted if possible, or at the very least removed from the stable kernel.

NOTE: the dmesg logs for each machine show slightly different info, they all culminate in a failure to suspend, but there are different levels of debug dmesg data. The differences might help to triangulate what's going wrong.

tebrandt commented 3 months ago

Well I just checked that our CI does not test s2idle :-(

2024-06-25 18:18:37 UTC [REMOTE_INFO] Current suspend/resume type mode: s2idle [deep]

Yea it's an easy add. You can use sleepgraph if you want to create a very simple S2idle test. It just echos freeze to /sys/power/state and verifies that it reaches S2idle. First install it on the test system:

$> git clone http://github.com/intel/pm-graph.git
$> cd pm-graph
$> sudo make install
$> "man sleepgraph" OR "sleepgraph -h" for help text

This simple script can check for pass or fail:

#!/bin/sh

cd /tmp
sudo rm -f sleepgraph.txt result.txt
sudo sleepgraph -m freeze -rtcwake 15 -result result.txt > sleepgraph.txt
CHECK=`cat /tmp/result.txt | grep "pass"`
if [ -n "$CHECK" ]; then
    echo "PASS"
else
    echo "FAIL"
fi
tebrandt commented 3 months ago

maybe test with this patch if there's a race between the wait for an interrupt and the suspend part where we disable the interrupts

logs.diff.txt

otcpl-hp-spectre-tgl_freeze.html.gz otcpl-lenovo-tix1-tgl_freeze.html.gz otcpl-thinkpad-x1_freeze.html.gz

I tested with the patch you provided on top of 6.10-rc5 and there didn't appear to be any change. I've included the sleepgraph timelines and each have the dmesg logs in the upper right hand corner "dmesg" button.

plbossart commented 3 months ago

the patch was only intended to give us logs @tebrandt, can you apply the patch and attached the updated dmesg logs using the S2idle transitions?

krinkinmu commented 3 months ago

Hi, I can reporoduce the issue on my laptop and bisected it to d5263dbbd8af02 ("ASoC: SOF: Intel: don't ignore IOC interrupts for non-audio transfers") as well.

I applied the patch from https://github.com/thesofproject/linux/issues/5072#issuecomment-2194218392 and after restart got the issue in the logs right away. See the attached dmesg logs in case it might help (keep in mind it's the logs right after a system reboot with the new kernel, I didn't use sleepgraph - I will play with it a bit more). dmesg.log

plbossart commented 3 months ago

Thank you @krinkinmu for the logs, there's indeed something fundamentally broken with interrupts disabled before attempting to boot firmware:

[   13.336618] sof-audio-pci-intel-tgl 0000:00:1f.3: hda_suspend: plb: before disabling interrupts
[   13.336655] sof-audio-pci-intel-tgl 0000:00:1f.3: hda_suspend: plb: before synchronize_irq
[   13.336659] sof-audio-pci-intel-tgl 0000:00:1f.3: hda_suspend: plb: synchronize_irq done
[   14.949513] sof-audio-pci-intel-tgl 0000:00:1f.3: hda_cl_copy_fw: plb Code loader DMA starting
[   14.949533] sof-audio-pci-intel-tgl 0000:00:1f.3: hda_cl_copy_fw: plb wait for IOC started
krinkinmu commented 3 months ago

I've also run sleepgraph, I can see in dmesg all the same errors as before, but also a "Unbalanced IRQ 196 wake disable" that seem to have something sof related in the stack trace.

Attaching the details below in case they might be useful as well.

dmesg.log kmu-ThinkPad-X1-Carbon-Gen-9_freeze_ftrace.txt kmu-ThinkPad-X1-Carbon-Gen-9_freeze_dmesg.txt result.txt sleepgraph.txt

plbossart commented 3 months ago

Thanks for all the results @krinkinmu this is puzzling. It's as if there's some sort of race between pm_runtime resume and system suspend. For PCI devices, the latter triggers the former, but from the logs the firmware download starts 1.6s after the system suspend started.

Can you try to apply this modified patch (remove the previous one) to see more information of which sequence happens when?

logs2.diff.txt

krinkinmu commented 3 months ago

Here are the logs after restart with the applied patch dmesg.log

plbossart commented 3 months ago

ok, so this is a pm_runtime resume issue

[   18.872679] sof-audio-pci-intel-tgl 0000:00:1f.3: hda_suspend: plb: start runtime_suspend 1
[   18.872687] sof-audio-pci-intel-tgl 0000:00:1f.3: hda_suspend: plb: before disabling interrupts
[   18.872709] sof-audio-pci-intel-tgl 0000:00:1f.3: hda_suspend: plb: before synchronize_irq
[   18.872713] sof-audio-pci-intel-tgl 0000:00:1f.3: hda_suspend: plb: synchronize_irq done
[   18.873434] sof-audio-pci-intel-tgl 0000:00:1f.3: hda_suspend: plb: done runtime_suspend 1

all good

[   40.087398] sof-audio-pci-intel-tgl 0000:00:1f.3: hda_resume: plb: start runtime_resume 1
[   40.097289] sof-audio-pci-intel-tgl 0000:00:1f.3: hda_resume: plb: done runtime_resume 1

this is supposed to restart everything, but it only takes 10ms

[   40.171969] sof-audio-pci-intel-tgl 0000:00:1f.3: hda_cl_copy_fw: plb Code loader DMA starting
[   40.172003] sof-audio-pci-intel-tgl 0000:00:1f.3: hda_cl_copy_fw: plb wait for IOC started
[   40.687471] sof-audio-pci-intel-tgl 0000:00:1f.3: Code loader DMA did not complete

and then this happens after the resume
krinkinmu commented 3 months ago

Also, in the interest of being specific, here is what I did to get those logs:

  1. build the kernel, install and reboot
  2. after reboot I triger a browser to play something

From what I can tell the hda_suspend is called right after restart and when I tried to play something hda_resume was called and system attempted to load firmware.

plbossart commented 3 months ago

@krinkinmu can you copy this file sof-dyndbg.conf.txt as /etc/modprobe.d/sof-dyndbg.conf, reboot and attach the logs. It's going to be quite verbose so you might need to allocate a larger console buffer.

krinkinmu commented 3 months ago

Sure, here you go dmesg.log

plbossart commented 3 months ago

@tebrandt @krinkinmu from the logs, it appears you are running the SOF firmware 2.0 version.

[    5.023504] sof-audio-pci-intel-tgl 0000:00:1f.3: Firmware info: version 2:0:0-b678a
[    6.652664] sof-audio-pci-intel-tgl 0000:00:1f.3: Firmware info: version 2:0:0-b678a

When I test on an HDaudio device, I see this log

[   31.480412] snd_sof_intel_hda_common:hda_dsp_cl_boot_firmware: sof-audio-pci-intel-tgl 0000:00:1f.3: IMR restore supported, booting from IMR directly

which is missing in your case. The IMR boot is the recommended version, and it's possible that the failure to exit D3 with the firmware download is due to the use of the 'wrong' method.

Can you try and update your firmware? You should be using version 2.2.x

krinkinmu commented 3 months ago

Thank you, will try in a couple of hours - away from the laptop with the issue at the moment.

On Tue 2 Jul 2024, 15:36 Pierre-Louis Bossart, @.***> wrote:

@tebrandt https://github.com/tebrandt @krinkinmu https://github.com/krinkinmu from the logs, it appears you are running the SOF firmware 2.0 version.

[ 5.023504] sof-audio-pci-intel-tgl 0000:00:1f.3: Firmware info: version 2:0:0-b678a

[ 6.652664] sof-audio-pci-intel-tgl 0000:00:1f.3: Firmware info: version 2:0:0-b678a

When I test on an HDaudio device, I see this log

[ 31.480412] snd_sof_intel_hda_common:hda_dsp_cl_boot_firmware: sof-audio-pci-intel-tgl 0000:00:1f.3: IMR restore supported, booting from IMR directly

which is missing in your case. The IMR boot is the recommended version, and it's possible that the failure to exit D3 with the firmware download is due to the use of the 'wrong' method.

Can you try and update your firmware? You should be using version 2.2.x

— Reply to this email directly, view it on GitHub https://github.com/thesofproject/linux/issues/5072#issuecomment-2203390185, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAI7Q47K7TN2KWCJNFBPOH3ZKK3GNAVCNFSM6AAAAABJRL3FEOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMBTGM4TAMJYGU . You are receiving this because you were mentioned.Message ID: @.***>

krinkinmu commented 3 months ago

@plbossart yeah indeed, I took firmware from the sofproject respository (I use Ubuntu LTS which does not have anything fresher than 2.0 in its repositories) and things started working. dmesg.log Just in case, attaching dmesg logs again from the run with the updated firmware.

plbossart commented 3 months ago

@tebrandt can you try to apply the patch in https://github.com/thesofproject/linux/pull/5089, it should essentially revert the problematic commit and keep you in business with unmodified Ubuntu LTS w/ an updated kernel. Thanks!

tebrandt commented 3 months ago

the patch was only intended to give us logs @tebrandt, can you apply the patch and attached the updated dmesg logs using the S2idle transitions?

the dmesg logs are in the html, view them by clicking the "dmesg" button in the upper right hand corner, the "log" button shows all kinds of system details during the test.

tebrandt commented 3 months ago

Reverting a kernel patch in testing that isn't ever going to be reverted upstream is out of the question. So a firmware update is the only option. First I tried downloading the newer versions of the firmware-sof-signed ubuntu deb packages that include v2.2.6 here: https://packages.ubuntu.com/search?keywords=firmware-sof-signed I tried every one for 23.10 and 24.04 and none could install the firmware in a way that ubuntu 22.04 could actually find. They seem to install a different firmware file extension, so I'm guessing those packages are currently buggy.

So I switched gears to just installing straight from git over my existing firmware-sof-signed package with this script:

#!/bin/sh

cd /tmp
sudo mv /lib/firmware/intel/sof /tmp/
sudo mv /lib/firmware/intel/sof-tplg /tmp/
git clone https://github.com/thesofproject/sof-bin
cd sof-bin
sudo ./install.sh v2.2.x/v2.2-rc1

This worked. Once I did this, all three machines found the new firmware and S2idle began to work again just fine, S0iX also came back on subsequent S2idle suspends. So installing new firmware from the git repo is one working solution. I will put this info into the bugzilla so that others who run into this issue can have a fix.

Here's my concern. By not reverting this patch (or finding some way to add legacy support to old firmware) you're exposing a firmware bug that will cause S2idle to be unavailable on any machines with this hardware running kernel 6.10 or higher on any pre-2024 Linux OS release. This reflects poorly on Intel because both the kernel commit and firmware are Intel owned. I'd appreciate if you'd at least consider adding code that could detect the firmware version and provide some kind of legacy workaround to prevent losing S2idle. I'd be happy to test.

plbossart commented 3 months ago

that's precisely what I provided in PR 5089, when legacy firmware is used (that means IPC3) the wait_for_completion is NOT used.

krinkinmu commented 3 months ago

@plbossart and @tebrandt just FYI, I tested the changes in #5089 with old firmware and it works, if you want feel free to add tested-by to the change.

Attaching the dmesg logs from the successful run as well just in case. dmesg.log

plbossart commented 3 months ago

Tested-by: Mike Krinkin <krinkin.m.u@gmail.com> ?

krinkinmu commented 3 months ago

@plbossart yes, that's the correct tag.

plbossart commented 3 months ago

@plbossart yes, that's the correct tag.

Thanks a lot for the tests and quick answers, much appreciated.

tebrandt commented 2 months ago

that's precisely what I provided in PR 5089, when legacy firmware is used (that means IPC3) the wait_for_completion is NOT used.

My apologies, I misread what you said. I thought it was just a straight revert but in fact you have a patch that only reverts for versions of the firmware that fail the new functionality.

I just added the patch you provided to 6.10.0-rc5, rolled the firmware version back to v2.0, and after testing it successfully fixed the issue on both the HP Spectre and Lenovo Thinkpad. Both the first S2idle succeeds and S0iX is no longer impaired. I will add this patch to the bugzilla issue so that anyone who bumps into this can use it. Thanks for the fix!

Tested-by: Todd Brandt todd.e.brandt@linux.intel.com

knurd commented 2 months ago

Out of interest: when will this regression fix head upstream? It's not in -next yet afaics.

Side note: ideally it now would also get a "CC: stable…" tag, as it's a regression in 6.10 which is out now.

tebrandt commented 2 months ago

The fix appears to be available in upstream 6.11.0-rc1. I've rolled back the sof firmware on these machines to v2.0 to test it and verified that the issue no longer occurs. I've marked the bugzilla issue as closed. Thanks for the fix!