thesofproject / linux

Linux kernel source tree
Other
90 stars 129 forks source link

[LNL] Multiple Pipeline Playback/capture stress test fails. #5086

Open ssavati opened 3 months ago

ssavati commented 3 months ago

Multiple Pipeline Playback/capture t stress fails for HDA/NOCODEC/SDW. The issue reproduction is low but happens sporadically during CI stress test.

Steps HDA Test command TPLG=/lib/firmware/intel/sof-ipc4-tplg/sof-hda-generic-ace1-4ch.tplg MODEL=LNLM_RVP_HDA SOF_TEST_INTERVAL=5 ~/sof-test/test-case/multiple-pipeline.sh -f p -l 200

NOCDEC Tes tcommand TPLG=/lib/firmware/intel/development/sof-lnl-nocodec.tplg MODEL=LNLM_RVP_NOCODEC SOF_TEST_INTERVAL=5 ~/sof-test/test-case/multiple-pipeline.sh -f p -l 200

config Linux Branch: topic/sof-dev Linux Commit: c75a95d34885

SOF Branch: stable-v2.10 SOF Commit: b15f1f1a3238

Issue reproduction : sporadic and low

2024-06-25 22:24:06 UTC [REMOTE_INFO] checking pipeline status again
2024-06-25 22:24:06 UTC [REMOTE_COMMAND] pkill -9 aplay arecord
/home/ubuntu/sof-test/test-case/../case-lib/lib.sh: line 750: 156308 Killed                  aplay $SOF_ALSA_OPTS $SOF_APLAY_OPTS "$@"
/home/ubuntu/sof-test/test-case/../case-lib/lib.sh: line 750: 156301 Killed                  aplay $SOF_ALSA_OPTS $SOF_APLAY_OPTS "$@"
declare -- cmd="journalctl_cmd --since=@1719354239"
2024-06-25 22:24:07 UTC [REMOTE_INFO] ===== Testing: (Loop: 31/200) =====
2024-06-25 22:24:07 UTC [REMOTE_INFO] /home/ubuntu/sof-test/test-case/multiple-pipeline.sh will use topology /usr/lib/firmware/intel/sof-ipc4-tplg/sof-hda-generic-ace1-4ch.tplg to run the test case
2024-06-25 22:24:07 UTC [REMOTE_INFO] Pipeline list to ignore is specified, will ignore 'pcm=HDA Digital' in test case
2024-06-25 22:24:07 UTC [REMOTE_INFO] Run command to get pipeline parameters
2024-06-25 22:24:07 UTC [REMOTE_COMMAND] sof-tplgreader.py /usr/lib/firmware/intel/sof-ipc4-tplg/sof-hda-generic-ace1-4ch.tplg -f 'type:playback  & ~pcm:HDMI' -b ' pcm:HDA Digital' -s 0 -e
2024-06-25 22:24:07 UTC [REMOTE_INFO] Testing: HDA Analog [hw:0,0]
2024-06-25 22:24:07 UTC [REMOTE_COMMAND] aplay   -D hw:0,0 -c 2 -r 48000 -f S16_LE /dev/zero -q
2024-06-25 22:24:07 UTC [REMOTE_INFO] Testing: Deepbuffer HDA Analog [hw:0,31]
2024-06-25 22:24:07 UTC [REMOTE_INFO] wait 0.5s for aplay_opts()
2024-06-25 22:24:07 UTC [REMOTE_COMMAND] aplay   -D hw:0,31 -c 2 -r 48000 -f S16_LE /dev/zero -q
2024-06-25 22:24:08 UTC [REMOTE_INFO] checking pipeline status
2024-06-25 22:24:08 UTC [REMOTE_INFO] Letting playback/capture run for 5s
2024-06-25 22:24:13 UTC [REMOTE_INFO] checking pipeline status again
2024-06-25 22:24:13 UTC [REMOTE_COMMAND] pkill -9 aplay arecord
/home/ubuntu/sof-test/test-case/../case-lib/lib.sh: line 750: 156408 Killed                  aplay $SOF_ALSA_OPTS $SOF_APLAY_OPTS "$@"
/home/ubuntu/sof-test/test-case/../case-lib/lib.sh: line 750: 156400 Killed                  aplay $SOF_ALSA_OPTS $SOF_APLAY_OPTS "$@"
declare -- cmd="journalctl_cmd --since=@1719354246"
2024-06-25 22:24:14 UTC [REMOTE_INFO] ===== Testing: (Loop: 32/200) =====
2024-06-25 22:24:14 UTC [REMOTE_INFO] /home/ubuntu/sof-test/test-case/multiple-pipeline.sh will use topology /usr/lib/firmware/intel/sof-ipc4-tplg/sof-hda-generic-ace1-4ch.tplg to run the test case
2024-06-25 22:24:14 UTC [REMOTE_INFO] Pipeline list to ignore is specified, will ignore 'pcm=HDA Digital' in test case
2024-06-25 22:24:14 UTC [REMOTE_INFO] Run command to get pipeline parameters
2024-06-25 22:24:14 UTC [REMOTE_COMMAND] sof-tplgreader.py /usr/lib/firmware/intel/sof-ipc4-tplg/sof-hda-generic-ace1-4ch.tplg -f 'type:playback  & ~pcm:HDMI' -b ' pcm:HDA Digital' -s 0 -e
2024-06-25 22:24:14 UTC [REMOTE_INFO] Testing: HDA Analog [hw:0,0]
2024-06-25 22:24:14 UTC [REMOTE_COMMAND] aplay   -D hw:0,0 -c 2 -r 48000 -f S16_LE /dev/zero -q
2024-06-25 22:24:14 UTC [REMOTE_INFO] Testing: Deepbuffer HDA Analog [hw:0,31]
2024-06-25 22:24:14 UTC [REMOTE_COMMAND] aplay   -D hw:0,31 -c 2 -r 48000 -f S16_LE /dev/zero -q
2024-06-25 22:24:14 UTC [REMOTE_INFO] wait 0.5s for aplay_opts()
2024-06-25 22:24:14 UTC [REMOTE_INFO] checking pipeline status
2024-06-25 22:24:14 UTC [REMOTE_INFO] Letting playback/capture run for 5s
2024-06-25 22:24:19 UTC [REMOTE_INFO] checking pipeline status again
2024-06-25 22:24:19 UTC [REMOTE_COMMAND] pkill -9 aplay arecord
/home/ubuntu/sof-test/test-case/../case-lib/lib.sh: line 750: 156499 Killed                  aplay $SOF_ALSA_OPTS $SOF_APLAY_OPTS "$@"
declare -- cmd="journalctl_cmd --since=@1719354253"
2024-06-25 22:24:21 UTC [REMOTE_INFO] ===== Testing: (Loop: 33/200) =====
2024-06-25 22:24:21 UTC [REMOTE_INFO] /home/ubuntu/sof-test/test-case/multiple-pipeline.sh will use topology /usr/lib/firmware/intel/sof-ipc4-tplg/sof-hda-generic-ace1-4ch.tplg to run the test case
2024-06-25 22:24:21 UTC [REMOTE_INFO] Pipeline list to ignore is specified, will ignore 'pcm=HDA Digital' in test case
2024-06-25 22:24:21 UTC [REMOTE_INFO] Run command to get pipeline parameters
2024-06-25 22:24:21 UTC [REMOTE_COMMAND] sof-tplgreader.py /usr/lib/firmware/intel/sof-ipc4-tplg/sof-hda-generic-ace1-4ch.tplg -f 'type:playback  & ~pcm:HDMI' -b ' pcm:HDA Digital' -s 0 -e
2024-06-25 22:24:21 UTC [REMOTE_INFO] Testing: HDA Analog [hw:0,0]
2024-06-25 22:24:21 UTC [REMOTE_COMMAND] aplay   -D hw:0,0 -c 2 -r 48000 -f S16_LE /dev/zero -q
2024-06-25 22:24:21 UTC [REMOTE_INFO] Testing: Deepbuffer HDA Analog [hw:0,31]
2024-06-25 22:24:21 UTC [REMOTE_INFO] wait 0.5s for aplay_opts()
2024-06-25 22:24:21 UTC [REMOTE_COMMAND] aplay   -D hw:0,31 -c 2 -r 48000 -f S16_LE /dev/zero -q
aplay: main:831: audio open error: Device or resource busy
2024-06-25 22:24:21 UTC [REMOTE_INFO] checking pipeline status
2024-06-25 22:24:21 UTC [REMOTE_INFO] Letting playback/capture run for 5s
/home/ubuntu/sof-test/test-case/../case-lib/lib.sh: line 750: 156506 Killed                  aplay $SOF_ALSA_OPTS $SOF_APLAY_OPTS "$@"
2024-06-25 22:24:26 UTC [REMOTE_INFO] checking pipeline status again
2024-06-25 22:24:26 UTC [REMOTE_ERROR] Running process count is 1, but 2 is expected
156597 aplay -D hw:0,0 -c 2 -r 48000 -f S16_LE /dev/zero -q
/home/ubuntu/sof-test/test-case/../case-lib/lib.sh: line 750: 156597 Killed                  aplay $SOF_ALSA_OPTS $SOF_APLAY_OPTS "$@"
2024-06-25 22:24:26 UTC [REMOTE_INFO] Starting func_exit_handler(1)
2024-06-25 22:24:26 UTC [REMOTE_ERROR] Starting func_exit_handler(), exit status=1, FUNCNAME stack:
2024-06-25 22:24:26 UTC [REMOTE_ERROR]  func_error_exit()  @  /home/ubuntu/sof-test/test-case/multiple-pipeline.sh
2024-06-25 22:24:26 UTC [REMOTE_ERROR]  ps_checks()  @  /home/ubuntu/sof-test/test-case/multiple-pipeline.sh:148
2024-06-25 22:24:26 UTC [REMOTE_ERROR]  main()  @  /home/ubuntu/sof-test/test-case/multiple-pipeline.sh:202
2024-06-25 22:24:31 UTC [REMOTE_INFO] pkill -TERM -f mtrace-reader.py
2024-06-25 22:24:31 UTC [REMOTE_INFO] nlines=4914 /home/ubuntu/sof-test/logs/multiple-pipeline/2024-06-25-22:20:33-2957/mtrace.txt
+ grep -B 2 -A 1 -i --word-regexp -e ERR -e ERROR -e '' -e OSError /home/ubuntu/sof-test/logs/multiple-pipeline/2024-06-25-22:20:33-2957/mtrace.txt
2024-06-25 22:24:37 UTC [REMOTE_INFO] ktime=14535 sof-test PID=153125: ending
2024-06-25 22:24:37 UTC [REMOTE_INFO] Test Result: FAIL!
ssavati commented 3 months ago

"pkill -9 aplay arecord" cmd not able to terminate as one of aplay/arecord instance is stuck/hang/blocked. So next iteration fails to start new instances playback/capture

lyakh commented 3 months ago

if https://github.com/thesofproject/sof-test/pull/1213 is merged, testing with it should tell us whether any aplay / arecord processes are blocked in the kernel. As suggested by @lgirdwood enabling some kernel soft lockup debugging options could help too

plbossart commented 3 months ago

it's already enabled @lyakh, see https://github.com/thesofproject/kconfig/blob/8189104a4f38167aaaa4e70eece20ba4b6c19ef1/kconfig-sof-default.sh#L21 and https://github.com/thesofproject/kconfig/blob/master/lock-stall-defconfig

jsarha commented 2 months ago

I've now tried to reproduce this on both HDA and nocodec LNL setup. On HDA I only have 5 successful test runs, but on nocodec I just had 17 successful runs (about 7 hours) in a row. So the reproduction rate is very low. I still intend to catch up the nocodec numbers on HDA, but its starting to look like this issue does not show itself in reasonable time.

jsarha commented 2 months ago

Now I have run 19 successful runs (about 8 hours) on LNL HDA RVP. So that case too does not appear to reproduce. @lgirdwood , @kv2019i , should we close this one?

BTW, both tests were run with relatively recent daily build.

kv2019i commented 2 months ago

Ack @jsarha . @ssavati ok to close or is this still seen in larger test plans? So either close or keep open as P3.

marc-hb commented 1 month ago

if https://github.com/thesofproject/sof-test/pull/1213 is merged, testing with it should tell us whether any aplay / arecord processes are blocked in the kernel.

This just happened in daily test run /44499?model=LNLM_RVP_NOCODEC&testcase=multiple-pipeline-all-50

[ 5141.981465] kernel: task:aplay           state:R stack:10456 pid:60957 tgid:60957 ppid:60953  flags:0x00000006
[ 5141.981468] kernel: Call Trace:
[ 5141.981469] kernel:  
[ 5141.981470] kernel:  __schedule+0x311/0xa80
[ 5141.981473] kernel:  schedule+0x3e/0x130
[ 5141.981474] kernel:  schedule_timeout+0xb1/0x130
[ 5141.981476] kernel:  ? __pfx_process_timeout+0x10/0x10
[ 5141.981480] kernel:  __snd_pcm_lib_xfer+0x6ab/0x900 [snd_pcm]
[ 5141.981488] kernel:  ? __pfx_interleaved_copy+0x10/0x10 [snd_pcm]
[ 5141.981492] kernel:  ? __snd_pcm_lib_xfer+0x1a2/0x900 [snd_pcm]
[ 5141.981496] kernel:  ? __pfx_default_write_copy+0x10/0x10 [snd_pcm]
[ 5141.981500] kernel:  ? __pfx_default_wake_function+0x10/0x10
[ 5141.981503] kernel:  snd_pcm_common_ioctl+0xf7a/0x1600 [snd_pcm]
[ 5141.981507] kernel:  ? ioctl_has_perm.constprop.0.isra.0+0xba/0x110
[ 5141.981511] kernel:  snd_pcm_ioctl+0x27/0x40 [snd_pcm]
[ 5141.981515] kernel:  __x64_sys_ioctl+0x8e/0xd0
[ 5141.981517] kernel:  do_syscall_64+0x9e/0x1a0
[ 5141.981519] kernel:  entry_SYSCALL_64_after_hwframe+0x77/0x7f
[ 5141.981520] kernel: RIP: 0033:0x7f42f861794f
[ 5141.981521] kernel: RSP: 002b:00007ffe8f159ed0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 5141.981522] kernel: RAX: ffffffffffffffda RBX: 000055b4ddb26010 RCX: 00007f42f861794f
[ 5141.981523] kernel: RDX: 00007ffe8f159f30 RSI: 0000000040184150 RDI: 0000000000000004
[ 5141.981524] kernel: RBP: 000055b4ddb25f90 R08: 0000000000000000 R09: 000055b4ddb288c0
[ 5141.981524] kernel: R10: 000055b4ddb2e000 R11: 0000000000000246 R12: 0000000000000000
[ 5141.981525] kernel: R13: 000055b4ddb288c0 R14: 7fffffffffffffff R15: 0000000000000010