thesofproject / sof

Sound Open Firmware
Other
561 stars 318 forks source link

[BUG][IPC3][Zephyr] IPC tx error when trying to do aplay/arecord after DSP suspend #6582

Closed LaurentiuM1234 closed 8 months ago

LaurentiuM1234 commented 1 year ago

Describe the bug A clear and concise description of what the bug is.

IPC either gets timeout or gets an error when trying to do aplay/arecord after DSP suspend when enabling DMA trace with DMA domain on i.MX using Zephyr build.

What have you tried to diagnose or workaround this issue?

Tried the following: 1) Adding a delay of 200us (by setting EDF_SCHEDULE_DELAY to 200) when adding work to edf_schedule => this seems to somehow make things a bit better as I was able to run the testing script for about 2 hours with no errors (usually it takes a few minutes for the IPC timeout issue to occur)

To Reproduce 1) Enable the DMA domain on Zephyr by setting CONFIG_DMA_DOMAIN to y. 2) Remove the ifndef __ZEPHYR__ check in src/platform/imx8/platform.c, platform_init() function from the following sequence:

#ifndef __ZEPHYR__ /* remove this */
#if CONFIG_TRACE
    /* Initialize DMA for Trace*/
    trace_point(TRACE_BOOT_PLATFORM_DMA_TRACE);
    dma_trace_init_complete(sof->dmat);
#endif
    /* show heap status */
    heap_trace_all(1);
#endif /* __ZEPHYR__ */ /* also remove this */

3) Run the following test script:

#!/bin/bash                                                                                                                                                                                                    

set -e

SLEEP_VALUE=3

for j in {1..1000}
do
# test consecutive aplays with 3s sleep between
for i in {1..10}
do
    aplay -Dhw:1,0 -d 2 piano2.wav
    sleep $SLEEP_VALUE
done

# test aplay followed by arecord followed by 3s sleep
for i in {1..10}
do
     aplay -Dhw:1,0 -d 2 piano2.wav
     arecord -Dhw:1,0 -f S32_LE -c 2 -r 48000 -d 2 sample.wav
     sleep $SLEEP_VALUE
done

# test consecutive aplays with 3s sleep between
for i in {1..10}
do
     arecord -Dhw:1,0 -f S32_LE -c 2 -r 48000 -d 2 sample.wav
     sleep $SLEEP_VALUE
done
done

Reproduction Rate 100%

Expected behavior The IPCs shouldn't get an error.

Impact annoyance - the logs don't work properly because of it

Environment 1) Branch name and commit hash of the 2 repositories: sof (firmware/topology) and linux (kernel driver).

Screenshots or console output Sample 1: (happened in the beginning, without running the test script)

[   31.428868] sof-audio-of-imx8 556e8000.dsp: ipc tx timed out for 0x30110000 (msg/reply size: 12/12)
[   31.437994] sof-audio-of-imx8 556e8000.dsp: ------------[ DSP dump start ]------------
[   31.445972] sof-audio-of-imx8 556e8000.dsp: IPC timeout
[   31.451413] sof-audio-of-imx8 556e8000.dsp: fw_state: SOF_FW_BOOT_COMPLETE (6)
[   31.458738] sof-audio-of-imx8 556e8000.dsp: invalid header size 0x5480. FW oops is bogus
[   31.466884] sof-audio-of-imx8 556e8000.dsp: unexpected fault 0x000064c2 trace 0x000064c2
[   31.475016] sof-audio-of-imx8 556e8000.dsp: ------------[ DSP dump end ]------------
[   31.482828] sof-audio-of-imx8 556e8000.dsp: failed to free widget PIPELINE.2.SAI1.IN
[   32.004845] sof-audio-of-imx8 556e8000.dsp: ipc tx timed out for 0x90050000 (msg/reply size: 8/12)
[   32.013861] sof-audio-of-imx8 556e8000.dsp: IPC timeout
[   32.019124] sof-audio-of-imx8 556e8000.dsp: DMA_TRACE_FREE failed with error: -110
[   32.548851] sof-audio-of-imx8 556e8000.dsp: ipc tx timed out for 0x40010000 (msg/reply size: 76/12)
[   32.557965] sof-audio-of-imx8 556e8000.dsp: IPC timeout
[   32.563219] sof-audio-of-imx8 556e8000.dsp: ctx_save IPC error: -110, proceeding with suspend

Sample 2: (happend after 11 minutes of running the test script)

[  721.734474] sof-audio-of-imx8 556e8000.dsp: reply size (100) exceeds the buffer size (20)
[  721.742713] sof-audio-of-imx8 556e8000.dsp: ipc tx error for 0x30010000 (msg/reply size: 100/20): -22
[  721.752043] sof-audio-of-imx8 556e8000.dsp: Failed to setup widget PGA2.0
[  721.758918] sof-audio-of-imx8 556e8000.dsp: Failed to restore pipeline after resume -22
[  721.766993] sof-audio-of-imx8 556e8000.dsp: ASoC: error at snd_soc_pcm_component_pm_runtime_get on 556e8000.dsp: -22
[  721.777573]  sai1-wm8960-hifi: __soc_pcm_open() failed (-22)
[  721.783262]  Port0: ASoC: dpcm_be_dai_startup() failed at sai1-wm8960-hifi (-22)
[  721.790709]  Port0: dpcm_fe_dai_startup() failed (-22)

Note 1) Without enabling DMA trace, I was able to run the test script for at least 12 hours without the problem occurring (stopped because I thought it was enough, not because of the problem appearing) 2) There are no notable errors reported by sof-logger.

lgirdwood commented 1 year ago

@LaurentiuM1234 you may find a timer drift after suspend. i.e. the next task may have been scheduled before suspend but the time after resume means this task time is now in the past ? @teburd fyi.

lgirdwood commented 1 year ago

@kv2019i fyi

LaurentiuM1234 commented 1 year ago

@LaurentiuM1234 you may find a timer drift after suspend. i.e. the next task may have been scheduled before suspend but the time after resume means this task time is now in the past ? @teburd fyi.

Hm, as far as I know there is a time between the moment at which the DSP clocks get re-enabled and the FW gets loaded in which something nasty (possibly something timer-related like you said) could happen but what I don't understand is how enabling the DMA trace leads to this? As far as I can tell, the DMA trace gets disabled before cutting off DSP clocks and the edf_workq thread should be stuck endlessly waiting for an IRQ in a while loop so there shouldn't be anything left "running" (although I am not sure about the system workq)

EDIT: Also, regarding this statement:

Hm, as far as I know there is a time between the moment at which the DSP clocks get re-enabled and the FW gets loaded in which something nasty (possibly something timer-related like you said) could happen

If this were to happen, shouldn't loading the FW still lead to the FW entering an expected state (the initial state)?

EDIT2: Please let me know if there's anything wrong with my understanding.

lgirdwood commented 1 year ago

So I dont know about the i.MX arch PM logic, but if your DSP is doing a clean reboot after suspend then all FW status and configs should be in a reset state ? i.e. there is no memory of the previous config/state prior to the suspend in the FW (however, the driver may still have previous context ?).