thesofproject / sof

Sound Open Firmware
Other
533 stars 308 forks source link

[BUG][IPC4] IO error with ipc timed out when GLB_CREATE_PIPELINE on both TGLU_UP_HDA_IPC4ZPH and ADLP_RVP_NOCODEC_IPC4 #5825

Closed XiaoyunWu6666 closed 2 years ago

XiaoyunWu6666 commented 2 years ago

Describe the bug IO error on both TGLU_UP_HDA_IPC4ZPH and ADLP_RVP_NOCODEC_IPC4 which run sof IPC4 Zephyr test configuration. IPC timed out when GLB_CREATE_PIPELINE and create module failure happened on both platforms.

To Reproduce TPLG=/lib/firmware/intel/avs-tplg/cavs-tgl-nocodec.tplg MODEL=ADLP_RVP_NOCODEC_IPC4 ~/sof-test/test-case/check-playback.sh -d 3 -l 1 -r 1 -F OR TPLG=/lib/firmware/intel/avs-tplg/cavs-tgl-nocodec.tplg MODEL=ADLP_RVP_NOCODEC_IPC4 ~/sof-test/test-case/check-suspend-resume.sh -l 50

Environment Kernel Branch: topic/sof-dev Kernel Commit: db34128c2a8d SOF Branch: main SOF Commit: 6c4f3ea42f0d Zephyr Commit: 782480313edc Platform: cavs-tgl-nocodec.tplg for ADLP_RVP_NOCODEC_IPC4 ; cavs-mixin-mixout-hda.tplg for TGLU_UP_HDA_IPC4ZPH

Take cavs-tgl-nocodec.tplg as example image

Screenshots or console output Take cavs-tgl-nocodec.tplg as example [dmesg]

[  126.641632] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: Create widget copier.host.3.1 instance 0 - pipe 3 - core 0
[  126.641646] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: ipc tx      : 0x40000004|0x30015: MOD_INIT_INSTANCE [data size: 84]
[  126.641955] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: ipc tx reply: 0x60000000|0x30015: MOD_INIT_INSTANCE
[  126.642066] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: ipc tx done : 0x40000004|0x30015: MOD_INIT_INSTANCE [data size: 84]
[  126.642080] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: widget copier.host.3.1 setup complete
[  126.642085] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: Create widget pipeline.4 instance 0 - pipe 4 - core 0
[  126.642090] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: pipeline: 4 memory pages: 2
[  126.642096] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: ipc tx      : 0x11040002|0x0: GLB_CREATE_PIPELINE
[  127.143643] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: ipc timed out for 0x11040002|0x0
[  127.143676] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: failed to create module pipeline.4
[  127.143697] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: ipc tx      : 0x12030000|0x0: GLB_DELETE_PIPELINE
[  127.647558] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: ipc timed out for 0x12030000|0x0
[  127.647591] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: failed to free pipeline widget pipeline.3
[  127.647607] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: Failed to set up connected widgets
[  127.648303] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: pcm: trigger stream 1 dir 0 cmd 1
[  127.648870] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: FW Poll Status: reg[0x160]=0x14001e successful
[  127.648884] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: ipc4 set pipeline 3 state 3
[  127.648895] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: ipc tx      : 0x13030003|0x0: GLB_SET_PIPELINE_STATE
[  128.151515] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: ipc timed out for 0x13030003|0x0
[  128.151547] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: failed to pause pipeline 3
RanderWang commented 2 years ago

A random bug. Still debugging

XiaoyunWu6666 commented 2 years ago

in inner daily IPC4 12776 on model TGLU_UP_HDA_IPC4ZPH in testcase check-playback-all-formats , the IO error happened again with FW ERROR Unsupported operation requested [dmesg]

[  164.267912] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: ipc4 set pipeline 1 state 3
[  164.267919] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: ipc tx      : 0x13010003|0x0: GLB_SET_PIPELINE_STATE
[  164.268005] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: ipc tx reply: 0x33000000|0x0: GLB_SET_PIPELINE_STATE
[  164.268020] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: ipc tx done : 0x13010003|0x0: GLB_SET_PIPELINE_STATE
[  164.268026] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: ipc4 set pipeline 1 state 4
[  164.268031] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: ipc tx      : 0x13010004|0x0: GLB_SET_PIPELINE_STATE
[  164.268326] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: ipc tx reply: 0x33000007|0x0: GLB_SET_PIPELINE_STATE
[  164.268330] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: FW reported error: 7 - Unsupported operation requested
[  164.268346] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: ipc error for msg 0x13010004|0x0
[  164.268356] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: failed to set state 4 for pipeline 1
[  164.268364] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: ASoC: error at soc_component_trigger on 0000:00:1f.3: -22
[  164.268373] kernel:  HDA Analog: ASoC: trigger FE cmd: 1 failed: -22
[  164.268383] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: pcm: trigger stream 0 dir 0 cmd 0
[  164.268394] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: FW Poll Status: reg[0x160]=0x140000 successful
[  164.268402] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: ipc4 set pipeline 1 state 2
lgirdwood commented 2 years ago

@RanderWang pls check with @lyakh on Zephyr lock debugging (and enable it and retest).

XiaoyunWu6666 commented 2 years ago

After Rander's PR #5843 merged , seemed less testcases on TGL UP HDA are affected . For example , check playback/capture all formats testcase won't fail on it.

But still ipc time out happened when setting pipeline state when playback for 100 times on both ADL RVP NOCODEC and TGL UP HDA. TPLG=/lib/firmware/intel/avs-tplg/cavs-mixin-mixout-hda.tplg MODEL=TGLU_UP_HDA_IPC4ZPH ~/sof-test/test-case/check-playback.sh -d 1 -l 100 -r 1 on TGL UP HDA and TPLG=/lib/firmware/intel/avs-tplg/cavs-tgl-nocodec.tplg MODEL=ADLP_RVP_NOCODEC_IPC4 ~/sof-test/test-case/check-playback.sh -d 1 -l 100 -r 1 to reproduce

kv2019i commented 2 years ago

Managed to reproduce this on cAVS2.5 nocodec system. I enable full Zephyr logs and FW logs show:

6902998250257 ipc: comp 3 not found 6902998251157 failed to process msg 19 status 9 6902998251797 ipc4: 0 failed err 9 6902998268744 ipc: comp 3 not found 6902998269364 failed to process msg 19 status 9 6902998269990 ipc4: 0 failed err 9 6903017548544 ipc: comp 3 not found 6903017549165 failed to process msg 19 status 9 6903017549792 ipc4: 0 failed err 9

Kernel side log

[180375.031196] sof-audio-pci-intel-tgl 0000:00:1f.3: pipeline: 4 memory pages: 2 [180375.031237] sof-audio-pci-intel-tgl 0000:00:1f.3: ipc tx : 0x11040002|0x0: GLB_CREATE_PIPELINE [180375.537481] sof-audio-pci-intel-tgl 0000:00:1f.3: ipc timed out for 0x11040002|0x0 [180375.537495] sof-audio-pci-intel-tgl 0000:00:1f.3: failed to create module pipeline.4

But at least no DSP crash. I can @RanderWang continue to look at this now that I managed to reproduce this.

lgirdwood commented 2 years ago

But at least no DSP crash. I can @RanderWang continue to look at this now that I managed to reproduce this.

So we should reply with an error but it looks like we are not ?

kv2019i commented 2 years ago

@mwasko Here's one reproducible IPC4 bug.

mwasko commented 2 years ago

@wszypelt please run the described flow and verify if it reproduce on MTL hardware

kv2019i commented 2 years ago

Update: it seems with https://github.com/thesofproject/sof/pull/5844 and https://github.com/thesofproject/sof/pull/5887 , I can no longer reproduce this case on cAVS2.5 nocodec platform. FYI to @mwasko @wszypelt @RanderWang . Without these fixes, the issues is easy to hit (on every run of the sof-test check-playback.sh case).

XiaoyunWu6666 commented 2 years ago

ipc error when trying to set pipeline running see inner result 13187 model ADLP_RVP_NOCODEC_IPC4 when check-capture-100times

[ 1275.590622] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: ipc tx      : 0x13070003|0x0: GLB_SET_PIPELINE_STATE
[ 1275.590795] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: ipc tx reply: 0x33000000|0x0: GLB_SET_PIPELINE_STATE
[ 1275.590804] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: ipc tx done : 0x13070003|0x0: GLB_SET_PIPELINE_STATE
[ 1275.590806] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: ipc4 set pipeline 7 state 4
[ 1275.590812] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: ipc tx      : 0x13070004|0x0: GLB_SET_PIPELINE_STATE
[ 1275.600522] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: ipc tx reply: 0x33000006|0x0: GLB_SET_PIPELINE_STATE
[ 1275.600525] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: FW reported error: 6 - Unknown error while processing the request
[ 1275.600546] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: ipc error for msg 0x13070004|0x0
[ 1275.600569] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: failed to set state 4 for pipeline 7
[ 1275.600571] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: ASoC: error at soc_component_trigger on 0000:00:1f.3: -22
[ 1275.600573] kernel:  Port0: ASoC: trigger FE cmd: 1 failed: -22
[ 1275.600576] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: pcm: trigger stream 0 dir 1 cmd 0
[ 1275.600582] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: FW Poll Status: reg[0x80]=0x20140000 successful
[ 1275.600584] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: ipc4 set pipeline 7 state 2
[ 1275.600585] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: ipc tx      : 0x13070002|0x0: GLB_SET_PIPELINE_STATE
[ 1276.102288] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: ipc timed out for 0x13070002|0x0
[ 1276.102327] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: failed to set state 2 for pipeline 7
[ 1276.102343] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: ASoC: error at soc_component_trigger on 0000:00:1f.3: -110
[ 1276.102362] kernel:  Port0: ASoC: trigger FE cmd: 0 failed: -110
[ 1276.103605] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: pcm: free stream 0 dir 1
[ 1276.103620] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: ipc4 set pipeline 7 state 2
[ 1276.103634] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: ipc tx      : 0x13070002|0x0: GLB_SET_PIPELINE_STATE
[ 1276.606297] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: ipc timed out for 0x13070002|0x0
[ 1276.606337] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: failed to set state 2 for pipeline 7
[ 1276.606386] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: sof_ipc4_route_free: unbind modules copier.SSP.8.1 -> copier.host.7.1
[ 1276.606403] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: ipc tx      : 0x46000004|0x10004: MOD_UNBIND
[ 1277.110235] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: ipc timed out for 0x46000004|0x10004
[ 1277.110273] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: failed to unbind modules copier.SSP.8.1 -> copier.host.7.1
[ 1277.110299] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: ipc tx      : 0x12080000|0x0: GLB_DELETE_PIPELINE
[ 1277.614228] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: ipc timed out for 0x12080000|0x0
[ 1277.614266] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: failed to free pipeline widget pipeline.8
[ 1277.614295] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: ipc tx      : 0x12070000|0x0: GLB_DELETE_PIPELINE
[ 1278.118201] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: ipc timed out for 0x12070000|0x0
[ 1278.118240] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: failed to free pipeline widget pipeline.7
[ 1278.118259] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: Failed to free connected widgets
[ 1278.118290] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: ASoC: error at snd_soc_pcm_component_hw_free on 0000:00:1f.3: -110
[ 1278.118578] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: pcm: close stream 0 dir 1

so does model TGLU_UP_HDA_IPC4ZPH when check-playback-100times

lgirdwood commented 2 years ago

Update: it seems with #5844 and #5887 , I can no longer reproduce this case on cAVS2.5 nocodec platform. FYI to @mwasko @wszypelt @RanderWang . Without these fixes, the issues is easy to hit (on every run of the sof-test check-playback.sh case).

Ok, so it looks like we need a fix to replace PR #5844 with a new PR. @mwasko fyi.

wszypelt commented 2 years ago

@XiaoyunWu6666 If possible, I would like to ask for a topology description and full test logs so that we can reproduce issue in our tests on MTL

XiaoyunWu6666 commented 2 years ago

Sure , taking ADLP RVP NOCODEC in inner daily 13433 in testcase check-playback-100times as an example. @wszypelt

Here is the zip file where the cavs-tgl-nocodec.tplg used by ADLP RVP NOCODEC is used. cavs-tgl-nocodec (3).zip

Complete dmesg logs is here. dmesg.txt

serhiy-katsyuba-intel commented 2 years ago

@kv2019i, @RanderWang, I observe a problem with k_sleep() in IPC processing task. That is a call inside ipc_wait_for_compound_msg() (for example, see this PR #5887). In the latest code that line is k_sleep(Z_TIMEOUT_US(250)); two weeks ago it was k_sleep(Z_TIMEOUT_TICKS(10000)); and somewhat earlier it was wait_delay(10000);

With k_sleep(Z_TIMEOUT_TICKS(10000)); version I observe such a problem: after calling k_sleep() the current thread is suspended and another thread is executed, the thread that called k_sleep() is never resumed and so the code after k_sleep() is never executed. I waited for quite a long time like few minutes -- there was no return from k_sleep(). The problem seems 100% reproducible.

With k_sleep(Z_TIMEOUT_US(250)); (a latest version introduced by #5887) the above problem is still exists but is very rare. The reproduction rate becomes about 15%. I think I've tried about 15 times and was able to reproduce the problem only 2 times. The problem same as in paragraph above: IPC task thread is stuck in k_sleep() forever and CPU core is executing another thread.

With wait_delay(10000); (it's just a busy loop, no yield here), the same problem also occurs, but with about 20% reproduction rate. Seems like while executing busy loop inside wait_delay() IPC task thread is preempted with another thread and never resumed and so no code is executed after wait_delay().

(1) So it looks like we have a problem with IPC task thread is not capable to be resumed after suspension. Any suggestion what could cause such a problem?

(2) Another thing: msg_data.delayed_reply is modified by one thread and checked in while loop by another in ipc_wait_for_compound_msg(). In theory a compiler has a right to optimize reading of this variable in while (msg_data.delayed_reply) { ... } -- the compiler has no clue that this variable could be modified from another thread and so could optimize out memory reading and while loop will run forever. Seems it is safer to declare that variable as volatile. However, I tested code without volatile and while loop works just fine and we do not have problem with absent volatile. But anyway decided to mention about missing volatile here.

P.S.: I've tested on FPGA MTL cSOF+Zephyr.

Tagging: @lgirdwood, @mwasko, @rfredzim.

RanderWang commented 2 years ago

@serhiy-katsyuba-intel Can you try https://github.com/thesofproject/sof/pull/5844 ? This was a fix for another ipc issue

serhiy-katsyuba-intel commented 2 years ago

@serhiy-katsyuba-intel Can you try #5844 ? This was a fix for another ipc issue

5844 is not directly applicable for MTL as we use IPC driver from src/drivers/intel/ace/ipc.c, the fix applies to src/drivers/intel/cavs/ipc.c.

Anyway, I have additional observations. When multiple pipelines are running it looks like IPC task thread cannot get CPU time. I reproduce it on FPGA. At the end of the test host sends PAUSE, I see IPC interrupt is triggered and IPC task is scheduled without any error reported but that IPC task is never executed. Seems we have a problem that high priority pipeline tasks prevent lower priority IPC task to get CPU time. This is consistent with k_sleep() problem described above. The difference is that k_sleep() problem happens at the beginning of test when switching from PAUSE to RUNNING. Seems sometimes after yield inside k_sleep() and start of multiple pipeline tasks original k_sleep() thread (lower priority IPC task) cannot be resumed.

rdurkalx commented 2 years ago

Hi, I would like to reproduce this bug on MTL platform. But I can't open cavs-tgl-nocodec (3).tplg file, can You write me what topolgy from https://github.com/thesofproject/sof/tree/main/tools/topology/topology1 You are using to reproduce this bug ?

serhiy-katsyuba-intel commented 2 years ago

can You write me what topolgy from https://github.com/thesofproject/sof/tree/main/tools/topology/topology1 You are using to reproduce this bug ?

Try any test with many pipelines. More pipelines more chances it will not work: IPC_TIMEOUT should appear at the end of test when host tries to PAUSE running pipelines. I'm reproducing with my modified firmware (which is a work in progress) with test_01_18_mixin_single_output. But that is a bad example, do not try it, as this test works fine with original non-modified firmware, it fails with my firmware as I added use of addition buffer. There is a chance that IPC_TIMEOUT I experience on MTL is not the same issue as this GitHub report describes for TGL. So it's better to make exactly the same test on MTL as it was done on TGL.

rdurkalx commented 2 years ago

Thanks for the reply. So as I understand it correctly, I should run whichever test from TGL that runs a lot of pipelines ?

serhiy-katsyuba-intel commented 2 years ago

Further investigation shows that the cause of the problem in my case was logging. Logging (call to comp_err() ) takes about 159 us on FPGA. The test was executed without any parameters to receive logs from DSP, so in theory logging should do almost nothing. But this is not that case. In comparison a call to a function that basically does nothing (buffer_acquire() ) takes about 4 us. In my case there were 3 calls to comp_err() inside 2 _copy() functions called from 2 pipelines. That was enough to do not fit into 1 ms LL scheduler period. After commenting out calls to comp_dbg() inside _copy() functions the problem disappeared.

lgirdwood commented 2 years ago

@serhiy-katsyuba-intel good findings. Do we have any assert() in the comp_err() paths ? IIUC, @kv2019i found an performance issue with assert flow with the Zephyr build config and XCC. Are we also fully using the native Zephyr LOG() calls from the comp_err call (or do we still have some legacy trace logic running)?

serhiy-katsyuba-intel commented 2 years ago

Are we also fully using the native Zephyr LOG() calls from the comp_err call (or do we still have some legacy trace logic running)?

comp_err still calls legacy trace logic. embargo branch did not yet switch to use Zephyr LOG.

Sorry, I was busy with high priority task (and will continue for a couple of days) and was not able to debug the issue further :-/

kv2019i commented 2 years ago

A category of IPC timeout bugs have been fixed now in mainline (IPC message lost but subsequence messages were ok).

These two tests now pass in daily test for these platforms: TPLG=/lib/firmware/intel/avs-tplg/cavs-tgl-nocodec.tplg MODEL=ADLP_RVP_NOCODEC_IPC4 ~/sof-test/test-case/check-playback.sh -d 3 -l 1 -r 1 -F TPLG=/lib/firmware/intel/avs-tplg/cavs-tgl-nocodec.tplg MODEL=ADLP_RVP_NOCODEC_IPC4 ~/sof-test/test-case/check-suspend-resume.sh -l 50

Closing this. Let's follow remaining issues in other bugs.