thesofproject / sof

Sound Open Firmware
Other
528 stars 307 forks source link

[BUG] MTL firmware does not resume any more #9308

Closed marc-hb closed 1 month ago

marc-hb commented 1 month ago

As of July 22nd, the current status is: something (IMR heap?) overwrites the firmware code stored in the IMR and used for resuming. That's why not booting from IMR with sof_debug=0x80 avoids the crash.  


To Reproduce

Use audio. Wait. Try to use it again.

Reproduction Rate

100%

Impact

Show stopper

cc:

Screenshots or console output

Screenshot 2024-07-17 at 17 45 08

https://sof-ci.01.org/sofpr/PR9305/build6552/devicetest/index.html https://sof-ci.01.org/softestpr/PR1220/build662/devicetest/index.html

[  161.664059] kernel: snd_sof:snd_sof_pci_update_bits_unlocked: sof-audio-pci-intel-mtl 0000:00:1f.3: Debug PCIR: 00000000 at  00000044
[  161.664078] kernel: snd_sof:snd_sof_pci_update_bits_unlocked: sof-audio-pci-intel-mtl 0000:00:1f.3: Debug PCIR: 00000000 at  00000048
[  161.667119] kernel: snd_sof:snd_sof_pci_update_bits_unlocked: sof-audio-pci-intel-mtl 0000:00:1f.3: Debug PCIR: 00000000 at  00000048
[  161.667125] kernel: snd_sof:snd_sof_pci_update_bits_unlocked: sof-audio-pci-intel-mtl 0000:00:1f.3: Debug PCIW: 00000040 at  00000048
[  161.667133] kernel: snd_sof_intel_hda_common:hda_dsp_state_log: sof-audio-pci-intel-mtl 0000:00:1f.3: Current DSP power state: D0I0
[  161.667138] kernel: snd_sof:sof_set_fw_state: sof-audio-pci-intel-mtl 0000:00:1f.3: fw_state change: 0 -> 2
[  161.667143] kernel: snd_sof:snd_sof_load_firmware: sof-audio-pci-intel-mtl 0000:00:1f.3: loading firmware
[  161.667146] kernel: snd_sof:sof_set_fw_state: sof-audio-pci-intel-mtl 0000:00:1f.3: fw_state change: 2 -> 3
[  161.668181] kernel: snd_sof_pci_intel_mtl:mtl_dsp_pre_fw_run: sof-audio-pci-intel-mtl 0000:00:1f.3: FW Poll Status: reg[0x1000]=0x1010000 successful
[  161.669221] kernel: snd_sof_pci_intel_mtl:mtl_dsp_pre_fw_run: sof-audio-pci-intel-mtl 0000:00:1f.3: FW Poll Status: reg[0x1d1c]=0x8151 successful
[  161.669227] kernel: snd_sof:snd_sof_run_firmware: sof-audio-pci-intel-mtl 0000:00:1f.3: booting DSP firmware
[  161.669233] kernel: snd_sof_intel_hda_common:hda_dsp_cl_boot_firmware: sof-audio-pci-intel-mtl 0000:00:1f.3: IMR restore supported, booting from IMR directly
[  161.670274] kernel: snd_sof_pci_intel_mtl:mtl_dsp_core_power_up: sof-audio-pci-intel-mtl 0000:00:1f.3: FW Poll Status: reg[0x178d04]=0x2000101 successful
[  161.670281] kernel: snd_sof_pci_intel_mtl:mtl_dsp_cl_init: sof-audio-pci-intel-mtl 0000:00:1f.3: Primary core power up successful
[  161.670290] kernel: snd_sof_pci_intel_mtl:mtl_dsp_cl_init: sof-audio-pci-intel-mtl 0000:00:1f.3: FW Poll Status: reg[0x73214]=0x80000000 successful
[  161.670301] kernel: snd_sof_pci_intel_mtl:mtl_enable_interrupts: sof-audio-pci-intel-mtl 0000:00:1f.3: FW Poll Status: reg[0x1800]=0x41 successful
[  161.670310] kernel: snd_sof_pci_intel_mtl:mtl_enable_interrupts: sof-audio-pci-intel-mtl 0000:00:1f.3: FW Poll Status: reg[0x1140]=0x1 successful
[  161.670320] kernel: snd_sof_pci_intel_mtl:mtl_dsp_cl_init: sof-audio-pci-intel-mtl 0000:00:1f.3: FW Poll Status: reg[0x163200]=0x50000005 successful
[  163.714303] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: ------------[ DSP dump start ]------------
[  163.714323] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: Firmware boot failure due to timeout
[  163.714330] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: fw_state: SOF_FW_BOOT_IN_PROGRESS (3)
[  163.714343] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: 0x50000005: module: ROM_EXT, state: FW_ENTERED, running
[  163.714356] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: Firmware state: 0x5, status/error code: 0x0
[  163.714438] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: Unknown toolchain is used
[  163.714446] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: error: DSP Firmware Oops
[  163.714451] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: error: Exception Cause: AllocaCause, MOVSP instruction, if caller’s registers are not in the register file
[  163.714458] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: EXCCAUSE 0x00000005 EXCVADDR 0x00000004 PS       0x00060020 SAR     0x00000004
[  163.714466] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: EPC1     0xa0088553 EPC2     0x00000000 EPC3     0x00000000 EPC4    0x00000000
[  163.714474] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: EPC5     0x00000000 EPC6     0x00000000 EPC7     0x00000000 DEPC    0x00000000
[  163.714480] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: EPS2     0x00000000 EPS3     0x00000000 EPS4     0x00000000 EPS5    0x00000000
[  163.714486] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: EPS6     0x00000000 EPS7     0x00000000 INTENABL 0x00000000 INTERRU 0x00000000
[  163.714492] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: stack dump from 0x00000000
[  163.714498] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: AR registers:
[  163.714506] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: 0x0: a0041e68 a0102688 00000000 a0065c84
[  163.714512] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: 0x10: a0066370 00000001 0000000a a01025c8
[  163.714519] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: 0x20: a0065d65 a0102628 00000001 40150478
[  163.714525] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: 0x30: a0065d65 a0102628 00000001 40150478
[  163.714532] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: ------------[ DSP dump end ]------------
[  163.714538] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: error: failed to boot DSP firmware after resume -5
[  163.714547] kernel: snd_sof:sof_set_fw_state: sof-audio-pci-intel-mtl 0000:00:1f.3: fw_state change: 3 -> 4
[  163.714558] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: ASoC: error at snd_soc_pcm_component_pm_runtime_get on 0000:00:1f.3: -5
[  163.714569] kernel:  NoCodec-0: ASoC: error at __soc_pcm_open on NoCodec-0: -5
[  163.714578] kernel:  Port0: ASoC: error at dpcm_be_dai_startup on Port0: -5
[  163.714588] kernel:  Port0: ASoC: error at dpcm_fe_dai_startup on Port0: -5
marc-hb commented 1 month ago

Last working daily test run 43924

Start Time: 2024-07-16 13:08:34 UTC Linux Commit: 1998ade4783a KConfig Commit: 8189104a4f38 SOF Commit: 3051607efb4f Zephyr Commit: 650227d8c47f

First failing daily test run 43972

Start Time: 2024-07-17 13:21:12 UTC Linux Commit: c3ecd35f66c1 KConfig Commit: 8189104a4f38 SOF Commit: d629e521020c Zephyr Commit: 740d7f735e23

d629e521020c serhiy.katsyuba@intel.com // (origin/main) ipc4: copier: Add IPC4 channel map handler ffce2cbd93a4 serhiy.katsyuba@intel.com // ipc4: copier: Extend get_convertion_func() to support remapping eda60297367a serhiy.katsyuba@intel.com // dai-zephyr: Prioritize HW params channels over base config params 808200604295 serhiy.katsyuba@intel.com // ipc4: pcm_converter: Add channel remapping conversion functions 319978685d78 serhiy.katsyuba@intel.com // pcm_converter: Add channel map parameter 8b927ad49777 serhiy.katsyuba@intel.com // dai-zephyr: Use frames, not samples, for DMA copy bytes calculation 6e183cbf530c serhiy.katsyuba@intel.com // dai-zephyr: Fix to avoid using buffers with uninitialized stream params ee66620b2b35 serhiy.katsyuba@intel.com // ipc4: dai-zephyr: Do not reuse process func as multi-gateway channel copy 5a000e825bcc tomasz.m.leman@intel.com // west.yml: update zephyr to 740d7f735e2 3051607efb4f guennadi.liakhovetski@linux.intel.com // kcps: fix 0 module CPC case

_

c3ecd35f66c1 Vijendar.Mukunda@amd.com // !fixup ASoC: amd: acp: fix for unused-but-set-variable warning for amp_num
1998ade4783a pierre-louis.bossart@linux.intel.com // ASoC: SOF: sof-audio.h: optimize snd_sof_pcm_stream_pipeline_list

I checked all the corresponding Pull Requests and they all passed separately.

cc: @serhiy-katsyuba-intel

lyakh commented 1 month ago

@marc-hb the last passing test of 16.07 already includes LLEXT patches as you can also see by following https://github.com/thesofproject/sof/commit/3051607efb4f53a53a5f65bf09eb4d282cc016ec and by checking that daily test's logs

lyakh commented 1 month ago

can it be the 5a000e825bcc Zephyr upgrade?

fredoh9 commented 1 month ago

I only reverted LLEXT PR while keeping zephyr upgrade. FW re-loading worked fine. This is my revert branch, https://github.com/fredoh9/sof/tree/fix/revert

lyakh commented 1 month ago

I only reverted LLEXT PR while keeping zephyr upgrade. FW re-loading worked fine. This is my revert branch, https://github.com/fredoh9/sof/tree/fix/revert

@fredoh9 thanks for checking that! And that isn't surprising, since every PR, merged after LLEXT passed its testing too - on a base that didn't include LLEXT. But apparently one of those PRs conflicts with LLEXT.

kv2019i commented 1 month ago

@lyakh @tmleman can you check this, it's seems it's a combination of LLEXT and Zephyr update. Given where it fails, this could be a new side-effect of the power_down.S changes.

marc-hb commented 1 month ago

Partial quote from https://github.com/thesofproject/sof/pull/9291#issuecomment-2236750181

We can have cases where two PRs pass the tests independetly, but when they are both in the tree, the test fails

This is a well-known CI issue and Gitlab and Github have a solution for it: ...

marc-hb commented 1 month ago

Wild guess: would NOT resuming from IMR temporarily but reliably avoid these SRAM power-off/cache alignment issues? Assuming the crash happens at resume time.

This instead of reverting recent, unrelated commits that just happened to be in the wrong place at the wrong time:

The latter feels like a timebomb: which other, unrelated commits will just break it and crash again?

tmleman commented 1 month ago

Unfortunately, I have not yet managed to get to the bottom of these problems. However, I have determined the following:

My suspicion is that someone is writing to memory locations they shouldn't.

marc-hb commented 1 month ago

Given that the IMR context save has been disabled for the MTL, the FW boot process should be consistent each time.

@tmleman what do you mean by "IMR context save has been disabled for the MTL"

I see this: https://sof-ci.01.org/sofpr/PR9305/build6552/devicetest/index.html?model=MTLP_RVP_NOCODEC&testcase=check-playback-10sec

[  161.669233] kernel: snd_sof_intel_hda_common:hda_dsp_cl_boot_firmware: sof-audio-pci-intel-mtl
               0000:00:1f.3: IMR restore supported, booting from IMR directly
.
.
.
[  163.714446] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: error: DSP Firmware Oops
marc-hb commented 1 month ago

Wild guess: would NOT resuming from IMR temporarily but reliably avoid these SRAM power-off/cache alignment issues? Assuming the crash happens at resume time.

I just tested this and it works. In other words, adding the BIT(7) 0x80 to options snd-sof sof_debug=... completely avoids this crash.

https://github.com/thesofproject/linux/blob/topic/sof-dev/sound/soc/sof/sof-priv.h

Note IMR resume is being especially "evil" here because you don't need to reboot to drop the bit 0x80 but you need to reboot to add it. Maybe that's an error handling issue in the Linux kernel?

cc: @jxstelter

marc-hb commented 1 month ago

@tmleman what do you mean by "IMR context save has been disabled for the MTL"

Answering myself: 69ad96abbfa0 set CONFIG_ADSP_IMR_CONTEXT_SAVE=n (It's y for LNL)

marc-hb commented 1 month ago

My suspicion is that someone is writing to memory locations they shouldn't.

"Someone" in audio firmware or could it be "someone" elsewhere?

lyakh commented 1 month ago

@tmleman has explained to me that he had traced the problem down to IMR overwriting by the use of the IMR heap. That explains well why using (LLEXT) loadable modules triggers these problems. Indeed, "randomly" increasing some of the values in https://github.com/zephyrproject-rtos/zephyr/blob/f2b6490dee30b38dfb0ee31902177491c35ebacb/soc/intel/intel_adsp/ace/include/adsp_memory.h#L44-L68 fixes the problem. @tmleman could you provide a fix with some "meaningful" constants? And we need a strategy for keeping those up to date, or maybe generating them automatically.

marc-hb commented 1 month ago

Note Zephyr commit https://github.com/zephyrproject-rtos/zephyr/commit/6069f946be1bd502 has de-duplicated the adsp_memory.h file across ace15, ace20 and ace30 and moved some constants to device tree.

tmleman commented 1 month ago

@lyakh thank you for the fix. I had a problem with decoding all these macros and I wasn't sure if it would be enough to shift the IMR stack.

To better outline what the problem was, here is a small map of the IMR along with the addresses from which the FW was copied:

L3_MEM_BASE_ADDR             = 0xa1000000
IMR_BOOT_LDR_MANIFEST_BASE   = 0xa1042000
IMR_BOOT_LDR_TEXT_ENTRY_BASE = 0xa1048000
IMR_BOOT_LDR_LIT_BASE        = 0xa1048180
IMR_BOOT_LDR_TEXT_BASE       = 0xa10481c0
IMR_BOOT_LDR_DATA_BASE       = 0xa1049000 # The beginning of the memory copied to the hpsram.
                               0xa1049000 -> 0xa0030000
IMR_BOOT_LDR_BSS_BASE        = 0xa1110000
IMR_BOOT_LDR_STACK_BASE      = 0xa1120000
IMR_L3_HEAP_BASE             = 0xa1121000
The last copied address      = 0xa1133000 -> 0xa011a000

The proposed fix moves IMR_BOOT_LDR_STACK_BASE and IMR_L3_HEAP_BASE to addresses 0xa1150000 and 0xa1151000, respectively. CI caught the problem only in builds with assertions enabled, but also in a normal build, the heap initialization was overwriting our FW, it just didn't cause crashes.

lyakh commented 1 month ago

The proposed fix moves IMR_BOOT_LDR_STACK_BASE and IMR_L3_HEAP_BASE to addresses 0xa1150000 and 0xa1151000, respectively.

@tmleman thanks for the break-down. Yes, the important thing is to move the L3 heap base further down, because it's that which overlaps with the firmware and where we overwrite its parts? And yes, we need to automate extracting these addresses, @lgirdwood proposes using a linker script.

lgirdwood commented 1 month ago

The proposed fix moves IMR_BOOT_LDR_STACK_BASE and IMR_L3_HEAP_BASE to addresses 0xa1150000 and 0xa1151000, respectively.

@tmleman thanks for the break-down. Yes, the important thing is to move the L3 heap base further down, because it's that which overlaps with the firmware and where we overwrite its parts? And yes, we need to automate extracting these addresses, @lgirdwood proposes using a linker script.

We need short term fix today for Zephyr, but long term we should use Zephyr methods for memory reservations @dcpleung will @lyakh fix https://github.com/zephyrproject-rtos/zephyr/pull/76196 be ok for Zephyr today ?

dcpleung commented 1 month ago

I think as a hotfix that should be fine.

marc-hb commented 1 month ago

but long term we should use Zephyr methods for memory reservations

I just filed new issue https://github.com/zephyrproject-rtos/zephyr/issues/76247 to discuss and track a longer term solution.

marc-hb commented 1 month ago

Tentative Zephyr upgrade in sof/west.yml with the temporary fix:

kv2019i commented 1 month ago

With #9338 merged, @marc-hb @tmleman can we close this?

tmleman commented 1 month ago

With https://github.com/thesofproject/sof/pull/9338 merged, @marc-hb @tmleman can we close this?

@marc-hb opened a separate issue on the Zephyr side https://github.com/zephyrproject-rtos/zephyr/issues/76247. I think this one can be closed unless the FW grows faster than someone manages to fix it.

marc-hb commented 1 month ago

can we close this?

Once https://github.com/thesofproject/sof/pull/9347 is passing and merged.

kv2019i commented 1 month ago

@marc-hb wrote:

Once #9347 is passing and merged.

Ack, now done, closing this one. Thanks all!