Open marc-hb opened 4 months ago
The kernel logs are unfortunately missing in https://sof-ci.01.org/softestpr/PR966/build659/devicetest/index.html?model=LNLM_RVP_NOCODEC&testcase=check-pause-resume-capture-100, I don't know why. I checked the kernel logs on the device directly and something strange caught my attention: it looks like the DMIC SFX2 pipeline under pause/release test is being freed and restarted WHILE the test is still running!?
The firmware logs are available on the web but I don't know how to match the timestamps.
...
t=45791 ms: arecord: (84/100) Found === PAUSE === , pausing for 173 ms
t=46092 ms: arecord: (85/100) Found volume ### | xx%, recording for 152 ms
t=46217 ms: arecord: (85/100) Found === PAUSE === , pausing for 190 ms
t=50287 ms: arecord: (86/100) Found volume ### | xx%, recording for 166 ms
t=50304 ms: arecord: (86/100) Found === PAUSE === , pausing for 119 ms
t=50436 ms: arecord: (87/100) Found volume ### | xx%, recording for 150 ms
t=50437 ms: arecord: WARNING: received == PAUSE == while in state recording! Ignoring.
<abrupt log stop while test is still running>
Should that WARNING be fatal? Maybe it would help collect the logs?
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx done : 0x46010004|0x6: MOD_UNBIND
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx : 0x12000000|0x0: GLB_DELETE_PIPELINE
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx reply: 0x32000000|0x0: GLB_DELETE_PIPELINE
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx done : 0x12000000|0x0: GLB_DELETE_PIPELINE
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_sof:sof_widget_free_unlocked: sof-audio-pci-intel-lnl 0000:00:1f.3: widget pipeline.14 freed
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_sof:sof_widget_free_unlocked: sof-audio-pci-intel-lnl 0000:00:1f.3: widget module-copier.14.2 freed
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_sof:sof_ipc4_route_free: sof-audio-pci-intel-lnl 0000:00:1f.3: unbind modules gain.19.1:0 -> module-copier.19.1:0
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_sof:sof_widget_free_unlocked: sof-audio-pci-intel-lnl 0000:00:1f.3: widget gain.19.1 freed
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_sof:sof_ipc4_route_free: sof-audio-pci-intel-lnl 0000:00:1f.3: unbind modules module-copier.19.1:0 -> gain.20.1:0
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx : 0x46020004|0x10006: MOD_UNBIND
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx reply: 0x66000000|0x10006: MOD_UNBIND
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx done : 0x46020004|0x10006: MOD_UNBIND
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx : 0x12010000|0x0: GLB_DELETE_PIPELINE
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx reply: 0x32000000|0x0: GLB_DELETE_PIPELINE
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx done : 0x12010000|0x0: GLB_DELETE_PIPELINE
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_sof:sof_widget_free_unlocked: sof-audio-pci-intel-lnl 0000:00:1f.3: widget pipeline.19 freed
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_sof:sof_widget_free_unlocked: sof-audio-pci-intel-lnl 0000:00:1f.3: widget module-copier.19.1 freed
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_sof:sof_ipc4_route_free: sof-audio-pci-intel-lnl 0000:00:1f.3: unbind modules gain.20.1:0 -> host-copier.28.capture:0
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_sof:sof_widget_free_unlocked: sof-audio-pci-intel-lnl 0000:00:1f.3: widget gain.20.1 freed
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx : 0x12020000|0x0: GLB_DELETE_PIPELINE
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx reply: 0x32000000|0x0: GLB_DELETE_PIPELINE
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx done : 0x12020000|0x0: GLB_DELETE_PIPELINE
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_sof:sof_widget_free_unlocked: sof-audio-pci-intel-lnl 0000:00:1f.3: widget pipeline.20 freed
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_sof:sof_widget_free_unlocked: sof-audio-pci-intel-lnl 0000:00:1f.3: widget host-copier.28.capture freed
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_sof:sof_pcm_prepare: sof-audio-pci-intel-lnl 0000:00:1f.3: pcm: prepare stream 28 dir 1
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_sof:sof_pcm_hw_params: sof-audio-pci-intel-lnl 0000:00:1f.3: pcm: hw params stream 28 dir 1
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_sof_intel_hda_common:hda_dsp_stream_hw_params: sof-audio-pci-intel-lnl 0000:00:1f.3: FW Poll Status: reg[0x80]=0x40000 successful
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_sof_intel_hda_common:hda_dsp_stream_hw_params: sof-audio-pci-intel-lnl 0000:00:1f.3: FW Poll Status: reg[0x80]=0x40000 successful
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_sof_intel_hda_common:hda_dsp_stream_setup_bdl: sof-audio-pci-intel-lnl 0000:00:1f.3: period_bytes:0x17700
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_sof_intel_hda_common:hda_dsp_stream_setup_bdl: sof-audio-pci-intel-lnl 0000:00:1f.3: periods:4
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_sof:sof_ipc4_prepare_copier_module: sof-audio-pci-intel-lnl 0000:00:1f.3: copier dai-copier.DMIC.NoCodec-6.capture, type 28
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_sof:snd_sof_get_nhlt_endpoint_data: sof-audio-pci-intel-lnl 0000:00:1f.3: dai index 0 nhlt type 2 direction 1 dev type 0
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_intel_dspcfg:intel_nhlt_get_endpoint_blob: sof-audio-pci-intel-lnl 0000:00:1f.3: Looking for configuration:
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_intel_dspcfg:intel_nhlt_get_endpoint_blob: sof-audio-pci-intel-lnl 0000:00:1f.3: vbus_id=0 link_type=2 dir=1, dev_type=0
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_intel_dspcfg:intel_nhlt_get_endpoint_blob: sof-audio-pci-intel-lnl 0000:00:1f.3: ch=4 fmt=32 rate=48000 (vbps is ignored for DMIC 32bit format)
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_intel_dspcfg:intel_nhlt_get_endpoint_blob: sof-audio-pci-intel-lnl 0000:00:1f.3: Endpoint count=8
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_intel_dspcfg:nhlt_check_ep_match: sof-audio-pci-intel-lnl 0000:00:1f.3: Endpoint: vbus_id=0 link_type=2 dir=1 dev_type = 0
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_intel_dspcfg:nhlt_get_specific_cfg: sof-audio-pci-intel-lnl 0000:00:1f.3: Endpoint format count=1
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_intel_dspcfg:nhlt_get_specific_cfg: sof-audio-pci-intel-lnl 0000:00:1f.3: Endpoint format: ch=4 fmt=32/32 rate=48000
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_sof:sof_ipc4_init_input_audio_fmt: sof-audio-pci-intel-lnl 0000:00:1f.3: matched audio format index for 48000Hz, 32bit, 4 channels: 2
Jul 16 21:13:04 jf-lnlm-rvp-nocodec-3 ubuntu[11727]: test failed, CI session https://sof-ci.ostc.intel.com/api/session/379913 now
trying to reboot DUT ubuntu@jf-lnlm-rvp-nocodec-3.jf.intel.com over ssh
The DMIC SFX2 pipeline is a branch from a common part, so it's a special case that could be problematic
I am not sure how well DPCM deals with such capture branches, and it wouldn't be surprising if the SOF states are incorrect.
Need @ranj063 and @ujfalusi to chime in.
This is not just DMIC SFX2, many examples:
https://sof-ci.01.org/sofpr/PR9313/build6571/devicetest/index.html
This is not just DMIC SFX2, example: https://sof-ci.01.org/softestpr/PR812/build656/devicetest/index.html?model=LNLM_SDW_AIOC&testcase=multiple-pause-resume-50
I'm not sure what we are looking at here, but if I read it right the test fails because the aplay hw:0,2 is failing to die, right?
In the kernel log the last trigger to this stream is a START (at 852.134468
).
hw:0,3 is toggling between PAUSE and RUNNING t hen finally hw:0,3 is closed, but I see no attempt to stop hw:0,2
What is interesting regarding to hw:0,2 is:
844.769414
open
844.769906
hw_params
844.797258
start
... toggling between paused/running
850.271098
stopped
852.114512
prepare + hw+params w/o closing it
852.134468
start
I don't know what the test is doing, but it does not appear to be trying to stop the playback.
I don't know what the test is doing, but it does not appear to be trying to stop the playback.
This test passes in all other configurations and ~I think it may even~ it passes in that configuration sometimes. EDIT: for instance it was perfectly fine in daily 44012 (July 18th)
(Thanks for analyzing the logs!)
This failure looks a little bit different:
t=1547 ms: cmd1 arecord Amp feedback: (3/50) Found === PAUSE === , pausing for 47 ms
t=1806 ms: cmd1 arecord Amp feedback: (4/50) Found volume ### | xx%, recording for 36 ms
t=1931 ms: cmd1 arecord Amp feedback: (4/50) Found === PAUSE === , pausing for 39 ms
t=2140 ms: cmd3 aplay Jack Out: (2/50) Found === PAUSE === , pausing for 32 ms
t=2140 ms: cmd3 aplay Jack Out: ERROR: found VOLUME while paused!
t=2140 ms: cmd3 aplay Jack Out: aplay: do_pause:1567: pause push error: File descriptor in bad state
It does not seem to have any overrun!?
This looks like memory corruption of some kind. Not the first time I see it. No overrun there.
[ 1225.549551] <inf> ll_schedule: zephyr_ll_task_schedule_common: task add 0xa01149c0 0xa00eabf4U priority 0 flags 0x0
[ 1225.549570] <inf> ll_schedule: zephyr_domain_register: zephyr_domain_register domain->type 1 domain->clk 5 domain->ticks_per_ms 38400 period 1000
[ 1225.549668] <inf> host_comp: host_get_copy_bytes_normal: comp:1 0x10004 no bytes to copy, available samples: 0, free_samples: 760
[ 1225.550661] <inf> host_comp: host_get_copy_bytes_normal: comp:1 0x10004 no bytes to copy, available samples: 0, free_samples: 760
[ 1225.551661] <inf> host_comp: host_get_copy_bytes_normal: comp:1 0x10004 no bytes to copy, available samples: 0, free_samples: 760
[ 1225.552383] <inf> ipc: ipc_cmd: rx : 0x13000004|0x0
[ 1225.552391] <inf> pipe: pipeline_trigger: pipe:0 0x0 pipe trigger cmd 8
[ 1225.552396] <inf> ll_schedule: zephyr_ll_task_schedule_common: task add 0xa0115400 0xa00eabf4U priority 0 flags 0x0
[ 1225.552700] <inf> host_comp: host_get_copy_bytes_normal: comp:1 0x10004 no bytes to copy, available samples: 0, free_samples: 760
�����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������
�����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������
������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������
In the same software configuration, the HDA model has mass overruns:
This one has this and then silence, no logs.
t=28189 ms: arecord: (67/100) Found volume ### | xx%, recording for 186 ms
t=28206 ms: arecord: (67/100) Found === PAUSE === , pausing for 131 ms
t=28337 ms: arecord: (68/100) Found volume ### | xx%, recording for 139 ms
t=28337 ms: arecord: WARNING: received == PAUSE == while in state recording! Ignoring.
Here's my observation with one of the failure instances (https://sof-ci.01.org/linuxpr/PR5106/build4026/devicetest/index.html?model=LNLM_RVP_NOCODEC&testcase=multiple-pause-resume-50)
The multiple pause resume combination that gets started is aplay on Port 0 (stream 0) and the DeepBuffer Port 0 (stream 31). Pause/release iterations start almost simultaneously for both the streams and proceed normally up until the stream 31 finishes 15 iterations. After it is released in the 15th iteration, the FW runs into an overrun causing aplay to try to recover from the xrun and from then onwards all we see is the stream prepare, stream trigger start followed by an immediate stream trigger stop indicating that aplay cannot recover from the xrun but it keeps trying endlessly. But parallelly, stream 0 continues with its pause/release iterations without any issues.
So the question is why does the FW run into an overrun after 15 pause/releases? I am afraid this issue looks very similar to the issue #5080 in the sense that there's a random xrun in the FW after many iterations and there's nothing unusual in the sequences leading up to it.
what happens if the "DeepBuffer Port 0" is skipped in the tests, do we see an xrun, ever?
If the xrun happens only in a case where pipelines are mixed, this could be a case of a) an initial condition where real-time deadlines are not met and b) a problematic state management where the state of the common part is not well handled.
IIRC we tested pause_push/pause_release with one stream playing, when all streams getting mixed are paused/released at the same time there could be races conditions left and right.
@plbossart, what is interesting is that the issue (afaik) only can be reproduced on LNL.
we've seen in the SoundWire case errors that were different on LNL, the timing of the transitions seems a bit different and that seems to open the door to race conditions we didn't see or notice before.
what happens if the "DeepBuffer Port 0" is skipped in the tests, do we see an xrun, ever?
@plbossart yes, there're several instances of the xrun happening with pause/release with just one stream (DMIC Raw)
what happens if the "DeepBuffer Port 0" is skipped in the tests, do we see an xrun, ever?
If the xrun happens only in a case where pipelines are mixed, this could be a case of a) an initial condition where real-time deadlines are not met and b) a problematic state management where the state of the common part is not well handled.
IIRC we tested pause_push/pause_release with one stream playing, when all streams getting mixed are paused/released at the same time there could be races conditions left and right.
whats interesting in this particular case is that the Port 0 stream is going on with its test just fine while the other stream gets stuck randomly. My initial suspicion was on link DMA but now I feel like it might be the host DMA thats the problem as the logs show that after releasing we're reporting that there's nothing available to copy from the host DMA buffer. But this makes it even harder to understand because in the case of pause/release, the host DMA is left untouched, it just keeps running forever until the pipeline is completely stopped.
https://github.com/thesofproject/sof-test/pull/1226 makes a huge difference, please review.
The failure logs have simplified and changed A LOT since my "fail fast" test fixes in https://github.com/thesofproject/sof-test/pull/1226 was merged.
Summary of the results of daily test 44477 (Start Time: 2024-08-02 13:07:34 UTC)
aplay: do_pause:1567: pause push error: File descriptor in bad state
This failure has a file descriptor in bad state
and an overrun:
https://sof-ci.01.org/sofpr/PR9323/build6836/devicetest/index.html?model=LNLM_RVP_NOCODEC&testcase=check-pause-resume-capture-100 Same in daily 44550?model=LNLM_RVP_NOCODEC&testcase=check-pause-resume-capture-100
This one has only a "non-zero exit status"
This one passed on the same system:
HDA seems to have a higher pass rate and NOCODEC the worst, example: https://sof-ci.01.org/linuxpr/PR5133/build4270/devicetest/index.html NOCODEC has more to test and the test runs longer.
In these runs, some tests passed and some tests failed in every LNL model: https://sof-ci.01.org/sofpr/PR9351/build6896/devicetest/index.html , https://sof-ci.01.org/linuxpr/PR5110/build4324/devicetest/index.html
So failures seem intermittent in every configuration.
Daily run 44926 is interesting with 5 failures out of 6 test runs and they don't all look the same.
Same in https://sof-ci.01.org/sofpr/PR9299/build7192/devicetest/index.html
EDIT: in daily run 44967?model=LNLM_SDW_AIOC&testcase=multiple-pause-resume-50, the whole system crashed and become unresponsive. The journalctl
logs end abruptly.
I spent a long time looking for an existing bug and I found many similar issues (see list below) but I think this one is either brand new or never filed yet. Part of the problem was that the Expect part of the pause-resume test was utterly buggy; that didn't help. Now that I rewrote it in https://github.com/thesofproject/sof-test/pull/1218 we can finally pay less attention to the test code and a bit more attention to the product.
The
WARNING: received == PAUSE == while in state recording! Ignoring.
message is very surprising. ~It showed up more than once~. EDIT: this message is now gone since "fail fast" test fix https://github.com/thesofproject/sof-test/pull/1226. Never look past the very first error which is usually nowfile descriptor in bad state
.2024-07-16T13:31:16-07:00
Linux Branch: topic/sof-dev Commit: 1998ade4783a Kconfig Branch: master Kconfig Commit: 8189104a4f38 SOF Branch: main Commit: 3051607efb4f Zephyr Commit:650227d8c47f
https://sof-ci.01.org/softestpr/PR1218/build632/devicetest/index.html?model=LNLM_RVP_NOCODEC&testcase=multiple-pause-resume-50 https://sof-ci.01.org/softestpr/PR966/build659/devicetest/index.html https://sof-ci.01.org/softestpr/PR812/build656/devicetest/index.html https://sof-ci.01.org/linuxpr/PR5106/build4026/devicetest/index.html?model=LNLM_RVP_NOCODEC&testcase=multiple-pause-resume-50 https://sof-ci.01.org/sofpr/PR9305/build6551/devicetest/index.html
https://sof-ci.01.org/softestpr/PR1224/build686/devicetest/index.html
https://sof-ci.01.org/sofpr/PR9335/build6748/devicetest/index.html
The logs don't all look the same but here's a typical one:
cc: