thesofproject / linux

Linux kernel source tree
Other
89 stars 129 forks source link

[ARL-S] Audio firmware download failure after S3/S4 #5135

Open syedk008 opened 1 month ago

syedk008 commented 1 month ago

Describe the bug Audio resume failed with fw load error

sof-audio-pci-intel-mtl 0000:80:1f.3: Code loader DMA did not complete
sof-audio-pci-intel-mtl 0000:80:1f.3: ------------[ DSP dump start ]------------
sof-audio-pci-intel-mtl 0000:80:1f.3: Firmware download failed
sof-audio-pci-intel-mtl 0000:80:1f.3: fw_state: SOF_FW_BOOT_READY_OK (6)
sof-audio-pci-intel-mtl 0000:80:1f.3: 0x50000005: module: ROM_EXT, state: FW_ENTERED, running
sof-audio-pci-intel-mtl 0000:80:1f.3: Firmware state: 0x5, status/error code: 0x0
sof-audio-pci-intel-mtl 0000:80:1f.3: Core dump is not available due to invalid separator 0xc0de
sof-audio-pci-intel-mtl 0000:80:1f.3: ------------[ DSP dump end ]------------
sof-audio-pci-intel-mtl 0000:80:1f.3: Failed to start DSP
sof-audio-pci-intel-mtl 0000:80:1f.3: error: failed to boot DSP firmware after resume -110

To Reproduce Update BIOS setting for S3 and S4 • Go to Intel Advanced Menu -> ACPI Settings -> Wakeup system from S5 via RTC -> Enabled • Go to Intel Advanced Menu -> ACPI Settings -> S0 Idle Low Power Idle Capability -> Disabled

Try multiple times Suspend resume with below command: sleep 10 && rtcwake -m mem -s 15

Reproduction Rate 5%

Impact High impact

Environment Kernel: sof_dev (commit: 7df4fc116381) SOF: v2.10 topology: sof-hda-generic-4ch.tplg Platform: Ubuntu 24.04

syedk008 commented 1 month ago

dmesg.txt

dmesg file with sof dynamic debug enabled. Search for keyword "Code loader".

lgirdwood commented 1 month ago

@ssavati any chance you can write a small script that loops over sleep 10 && rtcwake -m mem -s 15 on MTL RVP for several hundred iterations. @kv2019i @ujfalusi @plbossart IIUC, the FW is still running (or memories are not cleared) when we try and re load code, this could mean we have not put DSP into D3 ?

kv2019i commented 1 month ago

In the above log, FW_READY is received, so FW has been loaded successfully:

[ 202.961859] snd_sof:snd_sof_run_firmware: sof-audio-pci-intel-mtl 0000:80:1f.3: booting DSP firmware ... [ 203.265654] snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-mtl 0000:80:1f.3: ipc rx : 0x1b080000|0x0: GLB_NOTIFICATION|FW_READY

So seems DMA transfer was successful, but host misses the transfer completion interrupt and raises the error (even if FW was loaded and booted).

ssavati commented 1 month ago

@ssavati any chance you can write a small script that loops over sleep 10 && rtcwake -m mem -s 15 on MTL RVP for several hundred iterations.

Sure will try on MTLP HDA config first.

@syedk008 We dont have ARL-S setup. is it possible to get one board for debug ?

kv2019i commented 1 month ago

@syedk008 can you try https://github.com/thesofproject/linux/pull/5136

plbossart commented 1 month ago

@kv2019i there's still something very strange if the DMA is programmed to generate an IOC interrupt upon the end of the transfer and we don't get receive it.

For IPC3 this was root-caused to something odd in SOF 2.0, which was fixed in 2.2

To me this still points to something not quite correct on the firmware, or ROM side.

ujfalusi commented 1 month ago

@kv2019i there's still something very strange if the DMA is programmed to generate an IOC interrupt upon the end of the transfer and we don't get receive it.

For IPC3 this was root-caused to something odd in SOF 2.0, which was fixed in 2.2

To me this still points to something not quite correct on the firmware, or ROM side.

@plbossart, this is the firmware booting, at this stage IPC version does not matter. We can load anything at this stage (another thing is that it is going to be rejected as not valid). I cannot believe that ROM booting can depend on the IPC protocol used by the transferred binary, which is not even started at this point.

plbossart commented 1 month ago

you missed the point @ujfalusi

the code loader sets-up the DMA with the IOC bit set. If we don't get an interrupt, then something is wrong in the firmware or ROM handling.

We previously disabled the IPC3 because the problems we found were related to old firmware. I agree this has nothing to do with the IPC proper, but is related to the firmware infrastructure.

ujfalusi commented 1 month ago

@plbossart, the initial firmware loading has nothing to do with the payload itself. The DMA will load the amount of data and that's it. The firmware has nothing to do with this, it is ROM and second stage, soft ROM code. Can you point me to the firmware fix for IPC3 you mentioned?

plbossart commented 1 month ago

if the HDaudio DMA is programmed with the IOC bit set, then do we agree the IOC interrupt SHALL be generated?

We've seen in some cases of IPC3 firmware that it was not, see https://github.com/thesofproject/linux/issues/5072

I don't really care if this we remove the wait for this interrupt, but the fact that different machines have different unexplained behaviors is concerning. What exactly makes ARL-S different to all our CI devices?

ssavati commented 1 month ago

@lgirdwood @syedk008 I have tried on MTL HDA to reproduce issue

I have applied below BIOS settings • Go to Intel Advanced Menu -> ACPI Settings -> Wakeup system from S5 via RTC -> Enabled • Go to Intel Advanced Menu -> ACPI Settings -> S0 Idle Low Power Idle Capability -> Disabled

With above settings system is going to “PM: suspend entry (deep)“ and not resume back and need to restart device

On our Devices Go to "S0 Idle Low Power Idle Capability -> Enabled". With this system goes to “PM: suspend entry (s2idle)” and resume back. I have kept “sleep 10 && rtcwake -m mem -s 15” in loop it able to complete 200 iteration without any issue

Tested on below config Linux Branch: topic/sof-dev Linux Commit: 7df4fc116381 OF Branch: v2.10 SOF Commit: b15f1f1a3238 All our systems are on Ubuntu 22,04

ujfalusi commented 1 month ago

if the HDaudio DMA is programmed with the IOC bit set, then do we agree the IOC interrupt SHALL be generated?

Yes, we agree on this.

We've seen in some cases of IPC3 firmware that it was not, see #5072

That is exactly the same issue, I agree again.

I don't really care if this we remove the wait for this interrupt, but the fact that different machines have different unexplained behaviors is concerning. What exactly makes ARL-S different to all our CI devices?

That I cannot explain, but the fact is that it can also fail as some TGL device is curious. We don't have problems not waiting for IOC in case of IPC3, but we want to wait for it if the payload is IPC4? Does this makes sense? The 'lost' IOC has nothing to do with the IPC version, do you agree? If so then why would we use different mode to press the power button?

plbossart commented 1 month ago

"The 'lost' IOC has nothing to do with the IPC version, do you agree?"

We had evidence that some versions of SOF 2.2 firmware didn't work and some did. We ended-up disabling the wait for all IPC3 devices to avoid having to special-case which versions didn't work. the blanket "all IPC3 devices" was a simplification, not a statement that IPC was involved.

kv2019i commented 1 month ago

Let's see the results and see whether #5136 helps with this issue. It's clear IOC complete should work, but it's not so clear whether this wait is something we need to have in the FW load sequence to begin with.

syedk008 commented 1 month ago

@lgirdwood @syedk008 I have tried on MTL HDA to reproduce issue

I have applied below BIOS settings • Go to Intel Advanced Menu -> ACPI Settings -> Wakeup system from S5 via RTC -> Enabled • Go to Intel Advanced Menu -> ACPI Settings -> S0 Idle Low Power Idle Capability -> Disabled

With above settings system is going to “PM: suspend entry (deep)“ and not resume back and need to restart device

On our Devices Go to "S0 Idle Low Power Idle Capability -> Enabled". With this system goes to “PM: suspend entry (s2idle)” and resume back. I have kept “sleep 10 && rtcwake -m mem -s 15” in loop it able to complete 200 iteration without any issue

Tested on below config Linux Branch: topic/sof-dev Linux Commit: 7df4fc116381 OF Branch: v2.10 SOF Commit: b15f1f1a3238 All our systems are on Ubuntu 22,04

@ssavati I see the same behavior with sof-dev config file, please use the attached config file, this is being used in our BKC. arl_defconfig.txt

syedk008 commented 1 month ago

@syedk008 can you try #5136

Thanks. With this patch, I could not reproduce the issue. we will test more with this and let you know.

plbossart commented 1 month ago

@kv2019i I thought it was an interesting data point to see when the transfer is complete v. when we get the first response from firmware.

We can of course remove the wait_for_completion(), it's not strictly required, but that would be an acknowledgement that we have no idea how the code download works and what makes it fail.

ujfalusi commented 1 month ago

@kv2019i I thought it was an interesting data point to see when the transfer is complete v. when we get the first response from firmware.

Yes, it can be interesting, true.

Note: The boot flow charts I have seen never includes IOC waiting, it is always load fw and wait for the FW_READY.

We can of course remove the wait_for_completion(), it's not strictly required, but that would be an acknowledgement that we have no idea how the code download works and what makes it fail.

We could do something like this:

  1. start the DMA transfer In IOC irq handler set flag that we received it for code loader
  2. wait for FW_READY
  3. if FW_READY did not came then we print different error depending on IOC reception

This could be racy, but if we hard wait for the IOC we might get the FW_READY and things will be amazingly confused.

I would remove the IOC wait as a fix and iterate on it probably with the example to have a bit more data point on real boot failures. It is an interesting detail if the FW is not booted and the DMA is not sent the IOC interrupt.

Btw, the IOC is purely HDA DMA affair, it has nothing to do with FW or type of data.

ujfalusi commented 1 month ago

@ssavati, have you tried to remove the audio drivers and then do the deep suspend on MTL? Does that work? I don't think MTL supports deep sleep, it has been deprecated for recent Intel platforms for some time..

kv2019i commented 1 month ago

@syedk008 Could you test with this alternative PR that adds more debug https://github.com/thesofproject/linux/pull/5142 (expected to fail but with more debug). If the results are as expected, I propose we proceed with https://github.com/thesofproject/linux/pull/5136 as the fix, and potentially follow-up with a PR like https://github.com/thesofproject/linux/pull/5141 to keep some of the debug capabilities even if IOC wait is removed.