thesofproject / linux

Linux kernel source tree
Other
91 stars 133 forks source link

[LNL] hang, timeout or crash in pause-release #5109

Open marc-hb opened 4 months ago

marc-hb commented 4 months ago

I spent a long time looking for an existing bug and I found many similar issues (see list below) but I think this one is either brand new or never filed yet. Part of the problem was that the Expect part of the pause-resume test was utterly buggy; that didn't help. Now that I rewrote it in https://github.com/thesofproject/sof-test/pull/1218 we can finally pay less attention to the test code and a bit more attention to the product.

The WARNING: received == PAUSE == while in state recording! Ignoring. message is very surprising. ~It showed up more than once~. EDIT: this message is now gone since "fail fast" test fix https://github.com/thesofproject/sof-test/pull/1226. Never look past the very first error which is usually now file descriptor in bad state.

2024-07-16T13:31:16-07:00

Linux Branch: topic/sof-dev Commit: 1998ade4783a Kconfig Branch: master Kconfig Commit: 8189104a4f38 SOF Branch: main Commit: 3051607efb4f Zephyr Commit:650227d8c47f

https://sof-ci.01.org/softestpr/PR1218/build632/devicetest/index.html?model=LNLM_RVP_NOCODEC&testcase=multiple-pause-resume-50 https://sof-ci.01.org/softestpr/PR966/build659/devicetest/index.html https://sof-ci.01.org/softestpr/PR812/build656/devicetest/index.html https://sof-ci.01.org/linuxpr/PR5106/build4026/devicetest/index.html?model=LNLM_RVP_NOCODEC&testcase=multiple-pause-resume-50 https://sof-ci.01.org/sofpr/PR9305/build6551/devicetest/index.html

https://sof-ci.01.org/softestpr/PR1224/build686/devicetest/index.html

https://sof-ci.01.org/sofpr/PR9335/build6748/devicetest/index.html

The logs don't all look the same but here's a typical one:

t=7412 ms: cmd3 arecord Port0 2nd Capture: (16/50) Found volume ### | xx%, recording for 38 ms
t=7538 ms: cmd1 arecord Port0: (19/50) Found   === PAUSE ===  ,  pausing for 36 ms
t=7420 ms: cmd3 arecord Port0 2nd Capture: (16/50) Found   === PAUSE ===  ,  pausing for 22 ms
t=7553 ms: cmd3 arecord Port0 2nd Capture: (17/50) Found volume ### | xx%, recording for 47 ms
t=7553 ms: cmd3 arecord Port0 2nd Capture: WARNING: received == PAUSE == while in state recording! Ignoring.
t=7794 ms: cmd1 arecord Port0: (20/50) Found volume ### | xx%, recording for 36 ms
t=7919 ms: cmd1 arecord Port0: (20/50) Found   === PAUSE ===  ,  pausing for 44 ms
t=8170 ms: cmd1 arecord Port0: (21/50) Found volume ### | xx%, recording for 27 ms
t=8295 ms: cmd1 arecord Port0: (21/50) Found   === PAUSE ===  ,  pausing for 26 ms
t=8546 ms: cmd1 arecord Port0: (22/50) Found volume ### | xx%, recording for 29 ms

t=18714 ms: cmd1 arecord Port0: (49/50) Found volume ### | xx%, recording for 20 ms
t=18839 ms: cmd1 arecord Port0: (49/50) Found   === PAUSE ===  ,  pausing for 39 ms
t=19090 ms: cmd1 arecord Port0: (50/50) Found volume ### | xx%, recording for 29 ms
t=19215 ms: cmd1 arecord Port0: (50/50) Found   === PAUSE ===  ,  pausing for 33 ms
t=19215 ms: cmd1 arecord Port0: WARNING: volume was always 00%!
t=19215 ms: cmd1 arecord Port0: SUCCESS: /home/ubuntu/sof-test/case-lib/apause.exp arecord -D hw:0,0 -r 48000 -c 2 -f S16_LE -vv -i /dev/null

2024-07-16 07:53:47 UTC [REMOTE_ERROR] Still have expect process not finished after wait for 250
11568 expect /home/ubuntu/sof-test/case-lib/apause.exp cmd3 arecord Port0 2nd Capture 50 20 30 arecord -D hw:0,12 -r 48000 -c 2 -f S16_LE -vv -i /dev/null
11570 arecord -D hw:0,12 -r 48000 -c 2 -f S16_LE -vv -i /dev/null
2024-07-16 07:53:47 UTC [REMOTE_INFO] Starting func_exit_handler(1)
2024-07-16 07:53:47 UTC [REMOTE_ERROR] Starting func_exit_handler(), exit status=1, FUNCNAME stack:
2024-07-16 07:53:47 UTC [REMOTE_ERROR]  main()  @  /home/ubuntu/sof-test/test-case/multiple-pause-resume.sh
2024-07-16 07:53:48 UTC [REMOTE_INFO] pkill -TERM -f mtrace-reader.py
2024-07-16 07:53:48 UTC [REMOTE_INFO] nlines=24148 /home/ubuntu/sof-test/logs/multiple-pause-resume/2024-07-16-07:45:35-16732/mtrace.txt
+ grep -B 2 -A 1 -i --word-regexp -e ERR -e ERROR -e '' -e OSError /home/ubuntu/sof-test/logs/multiple-pause-resume/2024-07-16-07:45:35-16732/mtrace.txt
2024-07-16 07:53:50 UTC [REMOTE_INFO] ktime=1779 sof-test PID=10390: ending
2024-07-16 07:53:50 UTC [REMOTE_WARNING] Process(es) started by /home/ubuntu/sof-test/test-case/multiple-pause-resume.sh are still active, killing these process(es):
2024-07-16 07:53:50 UTC [REMOTE_WARNING] Catch pid: 11568 expect /home/ubuntu/sof-test/case-lib/apause.exp cmd3 arecord Port0 2nd Capture 50 20 30 arecord -D hw:0,12 -r 48000 -c 2 -f S16_LE -vv -i /dev/null
2024-07-16 07:53:50 UTC [REMOTE_WARNING] Kill cmd:'expect /home/ubuntu/sof-test/case-lib/apause.exp cmd3 arecord Port0 2nd Capture 50 20 30 arecord -D hw:0,12 -r 48000 -c 2 -f S16_LE -vv -i /dev/null' by kill -9
/home/ubuntu/sof-test/case-lib/hijack.sh: line 204: 11568 Killed                  "$TOPDIR"/case-lib/apause.exp "$shortname" "$repeat_count" "$rnd_min" "$rnd_range" "$cmd" -D "$dev" -r "$rate" -c "$channel" -f "$fmt" -vv -i "$file"
2024-07-16 07:53:50 UTC [REMOTE_INFO] Test Result: FAIL!
[ 1682.256786] <inf> ll_schedule: zephyr_domain_thread_fn: ll core 0 timer avg 2079, max 2127, overruns 0
[ 1683.280786] <inf> ll_schedule: zephyr_domain_thread_fn: ll core 0 timer avg 2079, max 2117, overruns 0
[ 1684.304788] <inf> ll_schedule: zephyr_domain_thread_fn: ll core 0 timer avg 2079, max 2120, overruns 0
[ 1685.328786] <inf> ll_schedule: zephyr_domain_thread_fn: ll core 0 timer avg 2079, max 2128, overruns 0
[ 1686.352788] <inf> ll_schedule: zephyr_domain_thread_fn: ll core 0 timer avg 2079, max 2117, overruns 0
[ 1687.376788] <inf> ll_schedule: zephyr_domain_thread_fn: ll core 0 timer avg 2079, max 2123, overruns 0
[ 1688.400788] <inf> ll_schedule: zephyr_domain_thread_fn: ll core 0 timer avg 2079, max 2121, overruns 0
[ 1689.424788] <inf> ll_schedule: zephyr_domain_thread_fn: ll core 0 timer avg 2079, max 2343, overruns 0
[ 1690.448788] <inf> ll_schedule: zephyr_domain_thread_fn: ll core 0 timer avg 2079, max 2120, overruns 0
[ 1691.472788] <inf> ll_schedule: zephyr_domain_thread_fn: ll core 0 timer avg 2079, max 2117, overruns 0
[ 1692.496788] <inf> ll_schedule: zephyr_domain_thread_fn: ll core 0 timer avg 2079, max 2130, overruns 0
[ 1693.520788] <inf> ll_schedule: zephyr_domain_thread_fn: ll core 0 timer avg 2079, max 2122, overruns 0
[ 1694.544788] <inf> ll_schedule: zephyr_domain_thread_fn: ll core 0 timer avg 2079, max 2127, overruns 0
[ 1695.568788] <inf> ll_schedule: zephyr_domain_thread_fn: ll core 0 timer avg 2079, max 2117, overruns 0
[ 1696.592788] <inf> ll_schedule: zephyr_domain_thread_fn: ll core 0 timer avg 2079, max 2128, overruns 0
[ 1697.616788] <inf> ll_schedule: zephyr_domain_thread_fn: ll core 0 timer avg 2079, max 2126, overruns 0
[ 1698.640788] <inf> ll_schedule: zephyr_domain_thread_fn: ll core 0 timer avg 2079, max 2123, overruns 0
[ 1699.664788] <inf> ll_schedule: zephyr_domain_thread_fn: ll core 0 timer avg 2079, max 2119, overruns 0
[ 1700.688786] <inf> ll_schedule: zephyr_domain_thread_fn: ll core 0 timer avg 2079, max 2116, overruns 0
[ 1701.712788] <inf> ll_schedule: zephyr_domain_thread_fn: ll core 0 timer avg 2079, max 2127, overruns 0
[ 1702.736788] <inf> ll_schedule: zephyr_domain_thread_fn: ll core 0 timer avg 2079, max 2124, overruns 0
[ 1703.760788] <inf> ll_schedule: zephyr_domain_thread_fn: ll core 0 timer avg 2079, max 2127, overruns 0
[ 1704.784788] <inf> ll_schedule: zephyr_domain_thread_fn: ll core 0 timer avg 2079, max 2117, overruns 0
[ 1705.808786] <inf> ll_schedule: zephyr_domain_thread_fn: ll core 0 timer avg 2079, max 2126, overruns 0
[ 1706.832788] <inf> ll_schedule: zephyr_domain_thread_fn: ll core 0 timer avg 2079, max 2117, overruns 0
[ 1707.856786] <inf> ll_schedule: zephyr_domain_thread_fn: ll core 0 timer avg 2079, max 2126, overruns 0
[ 1708.880786] <inf> ll_schedule: zephyr_domain_thread_fn: ll core 0 timer avg 2079, max 2128, overruns 0
[ 1709.904788] <inf> ll_schedule: zephyr_domain_thread_fn: ll core 0 timer avg 2079, max 2126, overruns 0

etc.

cc:

marc-hb commented 4 months ago

The kernel logs are unfortunately missing in https://sof-ci.01.org/softestpr/PR966/build659/devicetest/index.html?model=LNLM_RVP_NOCODEC&testcase=check-pause-resume-capture-100, I don't know why. I checked the kernel logs on the device directly and something strange caught my attention: it looks like the DMIC SFX2 pipeline under pause/release test is being freed and restarted WHILE the test is still running!?

The firmware logs are available on the web but I don't know how to match the timestamps.

...
t=45791 ms: arecord: (84/100) Found   === PAUSE ===  ,  pausing for 173 ms
t=46092 ms: arecord: (85/100) Found volume ### | xx%, recording for 152 ms
t=46217 ms: arecord: (85/100) Found   === PAUSE ===  ,  pausing for 190 ms
t=50287 ms: arecord: (86/100) Found volume ### | xx%, recording for 166 ms
t=50304 ms: arecord: (86/100) Found   === PAUSE ===  ,  pausing for 119 ms
t=50436 ms: arecord: (87/100) Found volume ### | xx%, recording for 150 ms

t=50437 ms: arecord: WARNING: received == PAUSE == while in state recording! Ignoring.

<abrupt log stop while test is still running>

Should that WARNING be fatal? Maybe it would help collect the logs?

Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx done : 0x46010004|0x6: MOD_UNBIND
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx      : 0x12000000|0x0: GLB_DELETE_PIPELINE
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx reply: 0x32000000|0x0: GLB_DELETE_PIPELINE
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx done : 0x12000000|0x0: GLB_DELETE_PIPELINE
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_sof:sof_widget_free_unlocked: sof-audio-pci-intel-lnl 0000:00:1f.3: widget pipeline.14 freed
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_sof:sof_widget_free_unlocked: sof-audio-pci-intel-lnl 0000:00:1f.3: widget module-copier.14.2 freed
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_sof:sof_ipc4_route_free: sof-audio-pci-intel-lnl 0000:00:1f.3: unbind modules gain.19.1:0 -> module-copier.19.1:0
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_sof:sof_widget_free_unlocked: sof-audio-pci-intel-lnl 0000:00:1f.3: widget gain.19.1 freed
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_sof:sof_ipc4_route_free: sof-audio-pci-intel-lnl 0000:00:1f.3: unbind modules module-copier.19.1:0 -> gain.20.1:0
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx      : 0x46020004|0x10006: MOD_UNBIND
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx reply: 0x66000000|0x10006: MOD_UNBIND
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx done : 0x46020004|0x10006: MOD_UNBIND
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx      : 0x12010000|0x0: GLB_DELETE_PIPELINE
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx reply: 0x32000000|0x0: GLB_DELETE_PIPELINE
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx done : 0x12010000|0x0: GLB_DELETE_PIPELINE
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_sof:sof_widget_free_unlocked: sof-audio-pci-intel-lnl 0000:00:1f.3: widget pipeline.19 freed
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_sof:sof_widget_free_unlocked: sof-audio-pci-intel-lnl 0000:00:1f.3: widget module-copier.19.1 freed
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_sof:sof_ipc4_route_free: sof-audio-pci-intel-lnl 0000:00:1f.3: unbind modules gain.20.1:0 -> host-copier.28.capture:0
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_sof:sof_widget_free_unlocked: sof-audio-pci-intel-lnl 0000:00:1f.3: widget gain.20.1 freed
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx      : 0x12020000|0x0: GLB_DELETE_PIPELINE
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx reply: 0x32000000|0x0: GLB_DELETE_PIPELINE
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx done : 0x12020000|0x0: GLB_DELETE_PIPELINE
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_sof:sof_widget_free_unlocked: sof-audio-pci-intel-lnl 0000:00:1f.3: widget pipeline.20 freed
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_sof:sof_widget_free_unlocked: sof-audio-pci-intel-lnl 0000:00:1f.3: widget host-copier.28.capture freed
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_sof:sof_pcm_prepare: sof-audio-pci-intel-lnl 0000:00:1f.3: pcm: prepare stream 28 dir 1
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_sof:sof_pcm_hw_params: sof-audio-pci-intel-lnl 0000:00:1f.3: pcm: hw params stream 28 dir 1
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_sof_intel_hda_common:hda_dsp_stream_hw_params: sof-audio-pci-intel-lnl 0000:00:1f.3: FW Poll Status: reg[0x80]=0x40000 successful
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_sof_intel_hda_common:hda_dsp_stream_hw_params: sof-audio-pci-intel-lnl 0000:00:1f.3: FW Poll Status: reg[0x80]=0x40000 successful
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_sof_intel_hda_common:hda_dsp_stream_setup_bdl: sof-audio-pci-intel-lnl 0000:00:1f.3: period_bytes:0x17700
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_sof_intel_hda_common:hda_dsp_stream_setup_bdl: sof-audio-pci-intel-lnl 0000:00:1f.3: periods:4
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_sof:sof_ipc4_prepare_copier_module: sof-audio-pci-intel-lnl 0000:00:1f.3: copier dai-copier.DMIC.NoCodec-6.capture, type 28
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_sof:snd_sof_get_nhlt_endpoint_data: sof-audio-pci-intel-lnl 0000:00:1f.3: dai index 0 nhlt type 2 direction 1 dev type 0
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_intel_dspcfg:intel_nhlt_get_endpoint_blob: sof-audio-pci-intel-lnl 0000:00:1f.3: Looking for configuration:
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_intel_dspcfg:intel_nhlt_get_endpoint_blob: sof-audio-pci-intel-lnl 0000:00:1f.3:   vbus_id=0 link_type=2 dir=1, dev_type=0
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_intel_dspcfg:intel_nhlt_get_endpoint_blob: sof-audio-pci-intel-lnl 0000:00:1f.3:   ch=4 fmt=32 rate=48000 (vbps is ignored for DMIC 32bit format)
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_intel_dspcfg:intel_nhlt_get_endpoint_blob: sof-audio-pci-intel-lnl 0000:00:1f.3: Endpoint count=8
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_intel_dspcfg:nhlt_check_ep_match: sof-audio-pci-intel-lnl 0000:00:1f.3: Endpoint: vbus_id=0 link_type=2 dir=1 dev_type = 0
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_intel_dspcfg:nhlt_get_specific_cfg: sof-audio-pci-intel-lnl 0000:00:1f.3: Endpoint format count=1
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_intel_dspcfg:nhlt_get_specific_cfg: sof-audio-pci-intel-lnl 0000:00:1f.3: Endpoint format: ch=4 fmt=32/32 rate=48000
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_sof:sof_ipc4_init_input_audio_fmt: sof-audio-pci-intel-lnl 0000:00:1f.3: matched audio format index for 48000Hz, 32bit, 4 channels: 2
Jul 16 21:13:04 jf-lnlm-rvp-nocodec-3 ubuntu[11727]: test failed, CI session https://sof-ci.ostc.intel.com/api/session/379913 now 
trying to reboot DUT ubuntu@jf-lnlm-rvp-nocodec-3.jf.intel.com over ssh
plbossart commented 4 months ago

The DMIC SFX2 pipeline is a branch from a common part, so it's a special case that could be problematic

image

I am not sure how well DPCM deals with such capture branches, and it wouldn't be surprising if the SOF states are incorrect.

Need @ranj063 and @ujfalusi to chime in.

marc-hb commented 4 months ago

This is not just DMIC SFX2, many examples:

https://sof-ci.01.org/softestpr/PR812/build656/devicetest/index.html?model=LNLM_SDW_AIOC&testcase=multiple-pause-resume-50

https://sof-ci.01.org/sofpr/PR9313/build6571/devicetest/index.html

https://sof-ci.01.org/sofpr/PR9310/build6583/devicetest/index.html?model=LNLM_SDW_AIOC&testcase=multiple-pause-resume-50

ujfalusi commented 4 months ago

This is not just DMIC SFX2, example: https://sof-ci.01.org/softestpr/PR812/build656/devicetest/index.html?model=LNLM_SDW_AIOC&testcase=multiple-pause-resume-50

I'm not sure what we are looking at here, but if I read it right the test fails because the aplay hw:0,2 is failing to die, right? In the kernel log the last trigger to this stream is a START (at 852.134468). hw:0,3 is toggling between PAUSE and RUNNING t hen finally hw:0,3 is closed, but I see no attempt to stop hw:0,2

What is interesting regarding to hw:0,2 is: 844.769414 open 844.769906 hw_params 844.797258 start ... toggling between paused/running 850.271098 stopped 852.114512 prepare + hw+params w/o closing it 852.134468 start

I don't know what the test is doing, but it does not appear to be trying to stop the playback.

marc-hb commented 4 months ago

I don't know what the test is doing, but it does not appear to be trying to stop the playback.

This test passes in all other configurations and ~I think it may even~ it passes in that configuration sometimes. EDIT: for instance it was perfectly fine in daily 44012 (July 18th)

(Thanks for analyzing the logs!)

marc-hb commented 4 months ago

This failure looks a little bit different:

https://sof-ci.01.org/sofpr/PR9304/build6579/devicetest/index.html?model=LNLM_SDW_AIOC&testcase=multiple-pause-resume-50

t=1547 ms: cmd1 arecord Amp feedback: (3/50) Found   === PAUSE ===  ,  pausing for 47 ms
t=1806 ms: cmd1 arecord Amp feedback: (4/50) Found volume ### | xx%, recording for 36 ms
t=1931 ms: cmd1 arecord Amp feedback: (4/50) Found   === PAUSE ===  ,  pausing for 39 ms
t=2140 ms: cmd3 aplay Jack Out: (2/50) Found   === PAUSE ===  ,  pausing for 32 ms
t=2140 ms: cmd3 aplay Jack Out: ERROR: found VOLUME while paused!
t=2140 ms: cmd3 aplay Jack Out: aplay: do_pause:1567: pause push error: File descriptor in bad state

It does not seem to have any overrun!?

marc-hb commented 4 months ago

This looks like memory corruption of some kind. Not the first time I see it. No overrun there.

https://sof-ci.01.org/sofpr/PR9314/build6595/devicetest/index.html?model=LNLM_SDW_AIOC&testcase=multiple-pause-resume-50

[ 1225.549551] <inf> ll_schedule: zephyr_ll_task_schedule_common: task add 0xa01149c0 0xa00eabf4U priority 0 flags 0x0
[ 1225.549570] <inf> ll_schedule: zephyr_domain_register: zephyr_domain_register domain->type 1 domain->clk 5 domain->ticks_per_ms 38400 period 1000
[ 1225.549668] <inf> host_comp: host_get_copy_bytes_normal: comp:1 0x10004 no bytes to copy, available samples: 0, free_samples: 760
[ 1225.550661] <inf> host_comp: host_get_copy_bytes_normal: comp:1 0x10004 no bytes to copy, available samples: 0, free_samples: 760
[ 1225.551661] <inf> host_comp: host_get_copy_bytes_normal: comp:1 0x10004 no bytes to copy, available samples: 0, free_samples: 760
[ 1225.552383] <inf> ipc: ipc_cmd: rx   : 0x13000004|0x0
[ 1225.552391] <inf> pipe: pipeline_trigger: pipe:0 0x0 pipe trigger cmd 8
[ 1225.552396] <inf> ll_schedule: zephyr_ll_task_schedule_common: task add 0xa0115400 0xa00eabf4U priority 0 flags 0x0
[ 1225.552700] <inf> host_comp: host_get_copy_bytes_normal: comp:1 0x10004 no bytes to copy, available samples: 0, free_samples: 760
�����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������
�����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������
������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������

In the same software configuration, the HDA model has mass overruns:

https://sof-ci.01.org/sofpr/PR9314/build6595/devicetest/index.html?model=LNLM_RVP_HDA&testcase=multiple-pause-resume-50


This one has this and then silence, no logs.

t=28189 ms: arecord: (67/100) Found volume ### | xx%, recording for 186 ms
t=28206 ms: arecord: (67/100) Found   === PAUSE ===  ,  pausing for 131 ms
t=28337 ms: arecord: (68/100) Found volume ### | xx%, recording for 139 ms
t=28337 ms: arecord: WARNING: received == PAUSE == while in state recording! Ignoring.

https://sof-ci.01.org/softestpr/PR1222/build676/devicetest/index.html?model=LNLM_RVP_NOCODEC&testcase=check-pause-resume-capture-100

ranj063 commented 4 months ago

Here's my observation with one of the failure instances (https://sof-ci.01.org/linuxpr/PR5106/build4026/devicetest/index.html?model=LNLM_RVP_NOCODEC&testcase=multiple-pause-resume-50)

The multiple pause resume combination that gets started is aplay on Port 0 (stream 0) and the DeepBuffer Port 0 (stream 31). Pause/release iterations start almost simultaneously for both the streams and proceed normally up until the stream 31 finishes 15 iterations. After it is released in the 15th iteration, the FW runs into an overrun causing aplay to try to recover from the xrun and from then onwards all we see is the stream prepare, stream trigger start followed by an immediate stream trigger stop indicating that aplay cannot recover from the xrun but it keeps trying endlessly. But parallelly, stream 0 continues with its pause/release iterations without any issues.

So the question is why does the FW run into an overrun after 15 pause/releases? I am afraid this issue looks very similar to the issue #5080 in the sense that there's a random xrun in the FW after many iterations and there's nothing unusual in the sequences leading up to it.

plbossart commented 4 months ago

what happens if the "DeepBuffer Port 0" is skipped in the tests, do we see an xrun, ever?

If the xrun happens only in a case where pipelines are mixed, this could be a case of a) an initial condition where real-time deadlines are not met and b) a problematic state management where the state of the common part is not well handled.

IIRC we tested pause_push/pause_release with one stream playing, when all streams getting mixed are paused/released at the same time there could be races conditions left and right.

ujfalusi commented 4 months ago

@plbossart, what is interesting is that the issue (afaik) only can be reproduced on LNL.

plbossart commented 4 months ago

we've seen in the SoundWire case errors that were different on LNL, the timing of the transitions seems a bit different and that seems to open the door to race conditions we didn't see or notice before.

ranj063 commented 4 months ago

what happens if the "DeepBuffer Port 0" is skipped in the tests, do we see an xrun, ever?

@plbossart yes, there're several instances of the xrun happening with pause/release with just one stream (DMIC Raw)

ranj063 commented 4 months ago

what happens if the "DeepBuffer Port 0" is skipped in the tests, do we see an xrun, ever?

If the xrun happens only in a case where pipelines are mixed, this could be a case of a) an initial condition where real-time deadlines are not met and b) a problematic state management where the state of the common part is not well handled.

IIRC we tested pause_push/pause_release with one stream playing, when all streams getting mixed are paused/released at the same time there could be races conditions left and right.

whats interesting in this particular case is that the Port 0 stream is going on with its test just fine while the other stream gets stuck randomly. My initial suspicion was on link DMA but now I feel like it might be the host DMA thats the problem as the logs show that after releasing we're reporting that there's nothing available to copy from the host DMA buffer. But this makes it even harder to understand because in the case of pause/release, the host DMA is left untouched, it just keeps running forever until the pipeline is completely stopped.

marc-hb commented 4 months ago

https://github.com/thesofproject/sof-test/pull/1226 makes a huge difference, please review.

marc-hb commented 3 months ago

The failure logs have simplified and changed A LOT since my "fail fast" test fixes in https://github.com/thesofproject/sof-test/pull/1226 was merged.

Summary of the results of daily test 44477 (Start Time: 2024-08-02 13:07:34 UTC)

marc-hb commented 3 months ago

This failure has a file descriptor in bad state and an overrun:

https://sof-ci.01.org/sofpr/PR9323/build6836/devicetest/index.html?model=LNLM_RVP_NOCODEC&testcase=check-pause-resume-capture-100 Same in daily 44550?model=LNLM_RVP_NOCODEC&testcase=check-pause-resume-capture-100

This one has only a "non-zero exit status"

https://sof-ci.01.org/sofpr/PR9323/build6836/devicetest/index.html?model=LNLM_RVP_HDA&testcase=multiple-pause-resume-50

This one passed on the same system:

https://sof-ci.01.org/sofpr/PR9323/build6836/devicetest/index.html?model=LNLM_RVP_HDA&testcase=check-pause-resume-capture-100

HDA seems to have a higher pass rate and NOCODEC the worst, example: https://sof-ci.01.org/linuxpr/PR5133/build4270/devicetest/index.html NOCODEC has more to test and the test runs longer.

marc-hb commented 3 months ago

In these runs, some tests passed and some tests failed in every LNL model: https://sof-ci.01.org/sofpr/PR9351/build6896/devicetest/index.html , https://sof-ci.01.org/linuxpr/PR5110/build4324/devicetest/index.html

So failures seem intermittent in every configuration.

marc-hb commented 3 months ago

Daily run 44926 is interesting with 5 failures out of 6 test runs and they don't all look the same.

Same in https://sof-ci.01.org/sofpr/PR9299/build7192/devicetest/index.html

EDIT: in daily run 44967?model=LNLM_SDW_AIOC&testcase=multiple-pause-resume-50, the whole system crashed and become unresponsive. The journalctl logs end abruptly.