Open lyakh opened 2 years ago
I modified the code a bit:
case SOF_DAI_INTEL_ALH:
/* SDW HW FIFO always requires 32bit MSB aligned sample data for
* all formats, such as 8/16/24/32 bits.
*/
dev->ipc_config.frame_fmt = SOF_IPC_FRAME_S32_LE;
t1 = platform_timer_get(NULL);
dd->dma_buffer->stream.frame_fmt = dev->ipc_config.frame_fmt;
t2 = platform_timer_get(NULL);
comp_info(dev, "DMA buffer %p t-diff %u", dd->dma_buffer, (unsigned)(t2 - t1));
to make the location between the two calls to platform_timer_get()
obvious. Here's the disassembly:
be030fd4: 1062d2 s32i a13, a2, 64
be030fd7: 1bebe5 call8 be04ce94 <platform_timer_get>
be030fda: 1123e2 l32i a14, a3, 68
be030fdd: 207aa0 or a7, a10, a10
be030fe0: 1022d2 l32i a13, a2, 64
be030fe3: 0a0c movi.n a10, 0
be030fe5: 176ed2 s32i a13, a14, 92
be030fe8: 1beaa5 call8 be04ce94 <platform_timer_get>
be030feb: 04dd mov.n a13, a4
be030fed: ffaec1 l32r a12, be030ea8 (20002528 <log_entry$20773_360_14>)
be030ff0: 3e0c movi.n a14, 3
be030ff2: c2f8 l32i.n a15, a2, 48
be030ff4: c08a70 sub a8, a10, a7
be030ff7: b2b8 l32i.n a11, a2, 44
be030ff9: 2a0c movi.n a10, 2
be030ffb: 01b9 s32i.n a11, a1, 0
be030ffd: 11a9 s32i.n a10, a1, 4
be030fff: 0b0c movi.n a11, 0
be031001: 06ad mov.n a10, a6
be031003: 112392 l32i a9, a3, 68
be031006: 2199 s32i.n a9, a1, 8
be031008: 3189 s32i.n a8, a1, 12
be03100a: 2454a5 call8 be055554 <_log_sofdict>
Was this discovered while debugging recent regressions #6075 and/or #6091? Just trying to connect some dots, sorry if unrelated.
Was this discovered while debugging recent regressions #6075 and/or #6091? Just trying to connect some dots, sorry if unrelated.
@marc-hb #6075 , yes
Further, the same problem also appears on TGL, where the code was modified to add that NULL-dereference, also when built with crosstool gcc and with the Zephyr toolchain. Apparently that code should run during initial DAI configuration and during potential run-time DAI re-configuration. The ipc_dai_data_config()
function is only called from dai_params()
in dai-legacy.c and dai-zephyr.c, and in both cases dd->dma_buffer
is first allocated there - during initialisation, or re-used - during re-configuration.
Update: just checked, that doing the same - adding a low address (e.g. 0x10) dereference in main()
behaves in one of two ways: (1) causes an exception when reading from it - as expected or (2) causes no problem when writing to it.
@lyakh what happens when 1) your pointer is volatile ? OR 2) You write back the cache after the write ?
Update Nr.2: on BYT neither writes to nor reads from low addresses cause exceptions. Update Nr.3: actually using 0 instead of a low address has the same effect
@lyakh what happens when
1. your pointer is volatile ? OR 2. You write back the cache after the write ?
@lgirdwood sorry, I checked assembly. I'm rather sure those actual addresses (0x10, 0, etc.) are used, which of course are outside of any RAM addresses - cached or uncached.
@mmaka1 any idea about this?
@lyakh what happens when
1. your pointer is volatile ? OR 2. You write back the cache after the write ?
@lgirdwood sorry, I checked assembly. I'm rather sure those actual addresses (0x10, 0, etc.) are used, which of course are outside of any RAM addresses - cached or uncached.
Can you check anyway, this could be misconfiguration of the cache_attr for the HW (see the linker script for value and pls check it's value is aligned with HW).
Can you check anyway, this could be misconfiguration of the cache_attr for the HW (see the linker script for value and pls check it's value is aligned with HW).
@lgirdwood I see what you mean... IIRC @andyross and @ceolin checked and double-checked those flags for Zephyr, do you think such an effect could be possible via their misconfiguration?
Unfortunately the cAVS/MTL platforms put registers in that low 512M block, so we can't use the region protection bits to deny access. That region is marked rw/uncached. Whether or not an address faults (some do, for sure) is entirely up to the bus hardware.
Once there's hardware available with an MMU, there will be more options. Also on some of the devices (I want to say cAVS 1.5 can do this but 1.8+ can't?) we have the "RPO with Translation" feature, which allows the top 3 bits of memory to remap each 512M region. That would allow us (with a lot of work to the devicetree definitions!) to move the registers somewhere else and mark the null region invalid. But it would only work on the older hardware, IIRC.
Just checked: indeed, the core-isa.h files in the sof tree have XCHAL_HAVE_XLT_CACHEATTR==1 for BYT and APL, but all other platforms (AMD/NXP/MediaTek included) are 0. So that's not really much of an option.
Can you check anyway, this could be misconfiguration of the cache_attr for the HW
FWIW: the cacheattr register actually doesn't exist on platforms with Region Protection Option, the HAL code to set it actually interprets the register value at runtime. We have a smaller, plausibly-too-optimized-for-its-own-good version here: https://github.com/zephyrproject-rtos/zephyr/blob/main/include/zephyr/arch/xtensa/cache.h#L163
Just for curiosity, can you check what is the access mode for this address ? Not familiar with this part of SoF code, but I think glb_addr_attr
should give you it.
Unfortunately the cAVS/MTL platforms put registers in that low 512M block, so we can't use the region protection bits to deny access. That region is marked rw/uncached. Whether or not an address faults (some do, for sure) is entirely up to the bus hardware.
Once there's hardware available with an MMU, there will be more options. Also on some of the devices (I want to say cAVS 1.5 can do this but 1.8+ can't?) we have the "RPO with Translation" feature, which allows the top 3 bits of memory to remap each 512M region. That would allow us (with a lot of work to the devicetree definitions!) to move the registers somewhere else and mark the null region invalid. But it would only work on the older hardware, IIRC.
We are using identity map without translation, aren't we ?
Right. The code is aware of the "with translation" distinction (because it has to set up the top three bits differently), but for compatibility we don't try to do any mapping.
I may be missing something, but only baytrail and apollolake have support for translation.
src/platform/apollolake/include/arch/xtensa/config/core-isa.h:580:#define XCHAL_HAVE_XLT_CACHEATTR 1 /* region prot. w/translation */
src/platform/baytrail/include/arch/xtensa/config/core-isa.h:478:#define XCHAL_HAVE_XLT_CACHEATTR 1 /* region prot. w/translation */
src/platform/cannonlake/include/arch/xtensa/config/core-isa.h:580:#define XCHAL_HAVE_XLT_CACHEATTR 0 /* region prot. w/translation */
src/platform/haswell/include/arch/xtensa/config/core-isa-bdw.h:588:#define XCHAL_HAVE_XLT_CACHEATTR 0 /* region prot. w/translation */
src/platform/haswell/include/arch/xtensa/config/core-isa-hsw.h:588:#define XCHAL_HAVE_XLT_CACHEATTR 0 /* region prot. w/translation */
src/platform/icelake/include/arch/xtensa/config/core-isa.h:580:#define XCHAL_HAVE_XLT_CACHEATTR 0 /* region prot. w/translation */
src/platform/imx8/include/arch/xtensa/config/core-isa.h:614:#define XCHAL_HAVE_XLT_CACHEATTR 0 /* region prot. w/translation */
src/platform/imx8m/include/arch/xtensa/config/core-isa.h:612:#define XCHAL_HAVE_XLT_CACHEATTR 0 /* region prot. w/translation */
src/platform/imx8ulp/include/arch/xtensa/config/core-isa.h:645:#define XCHAL_HAVE_XLT_CACHEATTR 0 /* region prot. w/translation */
src/platform/meteorlake/include/arch/xtensa/config/core-isa.h:659:#define XCHAL_HAVE_XLT_CACHEATTR 0 /* region prot. w/translation */
src/platform/mt8186/include/arch/xtensa/config/core-isa.h:760:#define XCHAL_HAVE_XLT_CACHEATTR 0 /* region prot. w/translation */
src/platform/mt8195/include/arch/xtensa/config/core-isa.h:669:#define XCHAL_HAVE_XLT_CACHEATTR 0 /* region prot. w/translation */
src/platform/suecreek/include/arch/xtensa/config/core-isa.h:580:#define XCHAL_HAVE_XLT_CACHEATTR 0 /* region prot. w/translation */
src/platform/tigerlake/include/arch/xtensa/config/core-isa.h:603:#define XCHAL_HAVE_XLT_CACHEATTR 0 /* region prot. w/translation */
Right. Thus "not much of an option". :)
that all sounds wonderful... But reading from those addresses does generate an exception, so the DSP "knows" that they are invalid? I looked at some of the registers and the lowest addresses I see begin soon after 0x1000 so about the low 4k of addresses seem to be invalid? So it's the bus hardware bug, that it raises exceptions on reading from there but not on writing to that area?
That matches my experience, though I never tried to probe the exact behavior. For sure I've gotten some reasonable-looking exceptions on null pointer dereferences. It makes perfect sense that the hardware designers would put a poison region at the bottom of memory. Frustrating if it only works for reads, though.
that all sounds wonderful... But reading from those addresses does generate an exception, so the DSP "knows" that they are invalid? I looked at some of the registers and the lowest addresses I see begin soon after 0x1000 so about the low 4k of addresses seem to be invalid? So it's the bus hardware bug, that it raises exceptions on reading from there but not on writing to that area?
Did you were able to get the access mode for these address that are behaving like this ? There are access mode that raises exceptions in fetch but not in load and store operations.
Describe the bug A line of code clearly has to dereference a pointer, where the above line prints it as NULL, and no exception is generated, when built with xcc.
To Reproduce Use https://github.com/thesofproject/sof/commit/95b128e4f614acf844c1fd9388232a44f8255ae6 and build an XTOS based firmware for TGL, then run it on an SDW platform. Observe the following log:
specifically see this line:
[ 223855.052083] ( 10.078125) c0 dai 1.2 src/ipc/ipc3/dai.c:142 INFO DMA buffer (nil)
which is produced by this code: https://github.com/thesofproject/sof/blob/95b128e4f614acf844c1fd9388232a44f8255ae6/src/ipc/ipc3/dai.c#L142-L143 where the next line after the print dereferences the NULL pointer, and no exception happens. A full log is attached.
Reproduction Rate 100%
Expected behavior A DSP exception should be triggered.
Impact The compiler seems to have a severe bug
Environment 1) Name of the topology file
sof-coherent-null.txt