Closed marc-hb closed 1 month ago
Last working daily test run 43924
Start Time: 2024-07-16 13:08:34 UTC Linux Commit: 1998ade4783a KConfig Commit: 8189104a4f38 SOF Commit: 3051607efb4f Zephyr Commit: 650227d8c47f
First failing daily test run 43972
Start Time: 2024-07-17 13:21:12 UTC Linux Commit: c3ecd35f66c1 KConfig Commit: 8189104a4f38 SOF Commit: d629e521020c Zephyr Commit: 740d7f735e23
d629e521020c serhiy.katsyuba@intel.com // (origin/main) ipc4: copier: Add IPC4 channel map handler ffce2cbd93a4 serhiy.katsyuba@intel.com // ipc4: copier: Extend get_convertion_func() to support remapping eda60297367a serhiy.katsyuba@intel.com // dai-zephyr: Prioritize HW params channels over base config params 808200604295 serhiy.katsyuba@intel.com // ipc4: pcm_converter: Add channel remapping conversion functions 319978685d78 serhiy.katsyuba@intel.com // pcm_converter: Add channel map parameter 8b927ad49777 serhiy.katsyuba@intel.com // dai-zephyr: Use frames, not samples, for DMA copy bytes calculation 6e183cbf530c serhiy.katsyuba@intel.com // dai-zephyr: Fix to avoid using buffers with uninitialized stream params ee66620b2b35 serhiy.katsyuba@intel.com // ipc4: dai-zephyr: Do not reuse process func as multi-gateway channel copy 5a000e825bcc tomasz.m.leman@intel.com // west.yml: update zephyr to 740d7f735e2 3051607efb4f guennadi.liakhovetski@linux.intel.com // kcps: fix 0 module CPC case
_
c3ecd35f66c1 Vijendar.Mukunda@amd.com // !fixup ASoC: amd: acp: fix for unused-but-set-variable warning for amp_num
1998ade4783a pierre-louis.bossart@linux.intel.com // ASoC: SOF: sof-audio.h: optimize snd_sof_pcm_stream_pipeline_list
I checked all the corresponding Pull Requests and they all passed separately.
cc: @serhiy-katsyuba-intel
@marc-hb the last passing test of 16.07 already includes LLEXT patches as you can also see by following https://github.com/thesofproject/sof/commit/3051607efb4f53a53a5f65bf09eb4d282cc016ec and by checking that daily test's logs
can it be the 5a000e825bcc Zephyr upgrade?
I only reverted LLEXT PR while keeping zephyr upgrade. FW re-loading worked fine. This is my revert branch, https://github.com/fredoh9/sof/tree/fix/revert
I only reverted LLEXT PR while keeping zephyr upgrade. FW re-loading worked fine. This is my revert branch, https://github.com/fredoh9/sof/tree/fix/revert
@fredoh9 thanks for checking that! And that isn't surprising, since every PR, merged after LLEXT passed its testing too - on a base that didn't include LLEXT. But apparently one of those PRs conflicts with LLEXT.
@lyakh @tmleman can you check this, it's seems it's a combination of LLEXT and Zephyr update. Given where it fails, this could be a new side-effect of the power_down.S changes.
Partial quote from https://github.com/thesofproject/sof/pull/9291#issuecomment-2236750181
We can have cases where two PRs pass the tests independetly, but when they are both in the tree, the test fails
This is a well-known CI issue and Gitlab and Github have a solution for it: ...
Wild guess: would NOT resuming from IMR temporarily but reliably avoid these SRAM power-off/cache alignment issues? Assuming the crash happens at resume time.
This instead of reverting recent, unrelated commits that just happened to be in the wrong place at the wrong time:
The latter feels like a timebomb: which other, unrelated commits will just break it and crash again?
Unfortunately, I have not yet managed to get to the bottom of these problems. However, I have determined the following:
My suspicion is that someone is writing to memory locations they shouldn't.
Given that the IMR context save has been disabled for the MTL, the FW boot process should be consistent each time.
@tmleman what do you mean by "IMR context save has been disabled for the MTL"
[ 161.669233] kernel: snd_sof_intel_hda_common:hda_dsp_cl_boot_firmware: sof-audio-pci-intel-mtl
0000:00:1f.3: IMR restore supported, booting from IMR directly
.
.
.
[ 163.714446] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: error: DSP Firmware Oops
Wild guess: would NOT resuming from IMR temporarily but reliably avoid these SRAM power-off/cache alignment issues? Assuming the crash happens at resume time.
I just tested this and it works. In other words, adding the BIT(7) 0x80 to options snd-sof sof_debug=...
completely avoids this crash.
https://github.com/thesofproject/linux/blob/topic/sof-dev/sound/soc/sof/sof-priv.h
Note IMR resume is being especially "evil" here because you don't need to reboot to drop the bit 0x80 but you need to reboot to add it. Maybe that's an error handling issue in the Linux kernel?
cc: @jxstelter
@tmleman what do you mean by "IMR context save has been disabled for the MTL"
Answering myself: 69ad96abbfa0 set CONFIG_ADSP_IMR_CONTEXT_SAVE=n
(It's y
for LNL)
My suspicion is that someone is writing to memory locations they shouldn't.
"Someone" in audio firmware or could it be "someone" elsewhere?
@tmleman has explained to me that he had traced the problem down to IMR overwriting by the use of the IMR heap. That explains well why using (LLEXT) loadable modules triggers these problems. Indeed, "randomly" increasing some of the values in https://github.com/zephyrproject-rtos/zephyr/blob/f2b6490dee30b38dfb0ee31902177491c35ebacb/soc/intel/intel_adsp/ace/include/adsp_memory.h#L44-L68 fixes the problem. @tmleman could you provide a fix with some "meaningful" constants? And we need a strategy for keeping those up to date, or maybe generating them automatically.
Note Zephyr commit https://github.com/zephyrproject-rtos/zephyr/commit/6069f946be1bd502 has de-duplicated the adsp_memory.h
file across ace15, ace20 and ace30 and moved some constants to device tree.
@lyakh thank you for the fix. I had a problem with decoding all these macros and I wasn't sure if it would be enough to shift the IMR stack.
To better outline what the problem was, here is a small map of the IMR along with the addresses from which the FW was copied:
L3_MEM_BASE_ADDR = 0xa1000000
IMR_BOOT_LDR_MANIFEST_BASE = 0xa1042000
IMR_BOOT_LDR_TEXT_ENTRY_BASE = 0xa1048000
IMR_BOOT_LDR_LIT_BASE = 0xa1048180
IMR_BOOT_LDR_TEXT_BASE = 0xa10481c0
IMR_BOOT_LDR_DATA_BASE = 0xa1049000 # The beginning of the memory copied to the hpsram.
0xa1049000 -> 0xa0030000
IMR_BOOT_LDR_BSS_BASE = 0xa1110000
IMR_BOOT_LDR_STACK_BASE = 0xa1120000
IMR_L3_HEAP_BASE = 0xa1121000
The last copied address = 0xa1133000 -> 0xa011a000
The proposed fix moves IMR_BOOT_LDR_STACK_BASE
and IMR_L3_HEAP_BASE
to addresses 0xa1150000
and 0xa1151000
, respectively. CI caught the problem only in builds with assertions enabled, but also in a normal build, the heap initialization was overwriting our FW, it just didn't cause crashes.
The proposed fix moves
IMR_BOOT_LDR_STACK_BASE
andIMR_L3_HEAP_BASE
to addresses0xa1150000
and0xa1151000
, respectively.
@tmleman thanks for the break-down. Yes, the important thing is to move the L3 heap base further down, because it's that which overlaps with the firmware and where we overwrite its parts? And yes, we need to automate extracting these addresses, @lgirdwood proposes using a linker script.
The proposed fix moves
IMR_BOOT_LDR_STACK_BASE
andIMR_L3_HEAP_BASE
to addresses0xa1150000
and0xa1151000
, respectively.@tmleman thanks for the break-down. Yes, the important thing is to move the L3 heap base further down, because it's that which overlaps with the firmware and where we overwrite its parts? And yes, we need to automate extracting these addresses, @lgirdwood proposes using a linker script.
We need short term fix today for Zephyr, but long term we should use Zephyr methods for memory reservations @dcpleung will @lyakh fix https://github.com/zephyrproject-rtos/zephyr/pull/76196 be ok for Zephyr today ?
I think as a hotfix that should be fine.
but long term we should use Zephyr methods for memory reservations
I just filed new issue https://github.com/zephyrproject-rtos/zephyr/issues/76247 to discuss and track a longer term solution.
Tentative Zephyr upgrade in sof/west.yml
with the temporary fix:
With #9338 merged, @marc-hb @tmleman can we close this?
With https://github.com/thesofproject/sof/pull/9338 merged, @marc-hb @tmleman can we close this?
@marc-hb opened a separate issue on the Zephyr side https://github.com/zephyrproject-rtos/zephyr/issues/76247. I think this one can be closed unless the FW grows faster than someone manages to fix it.
can we close this?
Once https://github.com/thesofproject/sof/pull/9347 is passing and merged.
@marc-hb wrote:
Once #9347 is passing and merged.
Ack, now done, closing this one. Thanks all!
As of July 22nd, the current status is: something (IMR heap?) overwrites the firmware code stored in the IMR and used for resuming. That's why not booting from IMR with
sof_debug=0x80
avoids the crash.To Reproduce
Use audio. Wait. Try to use it again.
Reproduction Rate
100%
Impact
Show stopper
cc:
9268
Screenshots or console output
https://sof-ci.01.org/sofpr/PR9305/build6552/devicetest/index.html https://sof-ci.01.org/softestpr/PR1220/build662/devicetest/index.html