thesofproject / linux

Linux kernel source tree
Other
91 stars 133 forks source link

[MTL][SDW] rmmod stuck on kmod-load-unload tests #5015

Closed plbossart closed 3 months ago

plbossart commented 6 months ago

I've seen repeated issues with the kmod-load-unload not completing on MTL Dell SKU 0CC7

Updating the firmware didn't help. I can still log-in remotely but the device has to be rebooted for audio tests

 starting test_kmod-load-unload 
+ test_kmod-load-unload
+ /root/sof-test/test-case/check-kmod-load-unload.sh -l 1
2024-05-24 17:26:54 UTC Sub-Test: [INFO] ===== Starting iteration 1 of 1 =====
2024-05-24 17:26:54 UTC Sub-Test: [INFO] wait dsp power status to become suspended
2024-05-24 17:26:55 UTC Sub-Test: [INFO] run kmod/sof-kmod-remove.sh

WARNING: running as root is not supported

Specified filename /sys/kernel/debug/sof/trace does not exist.
SKIP    snd_usb_audio   not loaded
SKIP    snd_hda_intel   not loaded
SKIP    snd_sof_pci_intel_tng   not loaded
SKIP    snd_sof_pci_intel_skl   not loaded
SKIP    snd_sof_pci_intel_apl   not loaded
SKIP    snd_sof_pci_intel_tgl   not loaded
SKIP    snd_sof_pci_intel_icl   not loaded
SKIP    snd_sof_pci_intel_cnl   not loaded
SKIP    snd_sof_pci_intel_lnl   not loaded
RMMOD   snd_sof_pci_intel_mtl
[  427.318607] snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-mtl 0000:00:1f.3: ipc tx      : 0x47000000|0x0: MOD_SET_DX [data size: 8]
[  427.422240] snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-mtl 0000:00:1f.3: ipc tx reply: 0x67000000|0x0: MOD_SET_DX
[  427.422258] snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-mtl 0000:00:1f.3: ipc tx done : 0x47000000|0x0: MOD_SET_DX [data size: 8]
[  427.422264] Message payload: 00000000: 00000001 00000000
[  614.731740] INFO: task rmmod:7293 blocked for more than 122 seconds.
[  614.731760]       Not tainted 6.9.0-rc5-test-07960-ga15f31d16f50 #205
[  614.731764] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  614.731767] task:rmmod           state:D stack:13384 pid:7293  tgid:7293  ppid:7292   flags:0x00004002
[  614.731780] Call Trace:
[  614.731784]  <TASK>
[  614.731792]  __schedule+0x385/0xb00
[  614.731812]  schedule+0x3a/0x130
[  614.731817]  schedule_timeout+0x131/0x140
[  614.731828]  __wait_for_common+0xb9/0x1d0
[  614.731833]  ? __pfx_schedule_timeout+0x10/0x10
[  614.731841]  remove_one+0xe8/0x110
[  614.731850]  simple_recursive_removal+0x1e0/0x280
[  614.731856]  ? __pfx_remove_one+0x10/0x10
[  614.731862]  debugfs_remove+0x44/0x70
[  614.731872]  snd_sof_device_remove+0xfb/0x180 [snd_sof]
[  614.731904]  sof_pci_remove+0x1d/0x50 [snd_sof_pci]
[  614.731913]  pci_device_remove+0x3f/0xb0
[  614.731922]  device_release_driver_internal+0x1a9/0x210
[  614.731931]  driver_detach+0x4b/0x90
[  614.731936]  bus_remove_driver+0x70/0xf0
[  614.731941]  pci_unregister_driver+0x3f/0x90
[  614.731948]  __do_sys_delete_module+0x1e0/0x2c0
[  614.731961]  do_syscall_64+0x9d/0x1a0
[  614.731967]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
[  614.731975] RIP: 0033:0x7f7c360577cb
[  614.731980] RSP: 002b:00007ffc0a379158 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
[  614.731985] RAX: ffffffffffffffda RBX: 000055a9a1e8f700 RCX: 00007f7c360577cb
[  614.731988] RDX: 0000000000000000 RSI: 0000000000000800 RDI: 000055a9a1e8f768
[  614.731990] RBP: 00007ffc0a379180 R08: 1999999999999999 R09: 0000000000000000
[  614.731992] R10: 00007f7c360c8ac0 R11: 0000000000000206 R12: 0000000000000000
[  614.731994] R13: 00007ffc0a3793e0 R14: 000055a9a1e8f700 R15: 0000000000000000
[  614.732004]  </TASK>
[  614.732009] 
               Showing all locks held in the system:
[  614.732022] 1 lock held by khungtaskd/154:
[  614.732025]  #0: ffffffff89d6a0c0 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x36/0x120
[  614.732101] 1 lock held by dmesg/3479:
[  614.732104]  #0: ffff93f70174a0e0 (&user->lock){....}-{3:3}, at: devkmsg_read+0x6c/0x230
[  614.732117] 2 locks held by rmmod/7293:
[  614.732120]  #0: ffff93f7029251b0 (&dev->mutex){....}-{3:3}, at: device_release_driver_internal+0x3d/0x210
[  614.732130]  #1: ffff93f70655beb8 (&sb->s_type->i_mutex_key#2){....}-{3:3}, at: simple_recursive_removal+0x15f/0x280

[  614.732144] =============================================

[  737.608739] INFO: task rmmod:7293 blocked for more than 245 seconds.
[  737.608753]       Not tainted 6.9.0-rc5-test-07960-ga15f31d16f50 #205
[  737.608757] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  737.608760] task:rmmod           state:D stack:13384 pid:7293  tgid:7293  ppid:7292   flags:0x00004002
[  737.608770] Call Trace:
[  737.608773]  <TASK>
[  737.608779]  __schedule+0x385/0xb00
[  737.608795]  schedule+0x3a/0x130
[  737.608800]  schedule_timeout+0x131/0x140
[  737.608811]  __wait_for_common+0xb9/0x1d0
[  737.608816]  ? __pfx_schedule_timeout+0x10/0x10
[  737.608824]  remove_one+0xe8/0x110
[  737.608831]  simple_recursive_removal+0x1e0/0x280
[  737.608836]  ? __pfx_remove_one+0x10/0x10
[  737.608843]  debugfs_remove+0x44/0x70
[  737.608850]  snd_sof_device_remove+0xfb/0x180 [snd_sof]
[  737.608878]  sof_pci_remove+0x1d/0x50 [snd_sof_pci]
[  737.608886]  pci_device_remove+0x3f/0xb0
[  737.608894]  device_release_driver_internal+0x1a9/0x210
[  737.608902]  driver_detach+0x4b/0x90
[  737.608907]  bus_remove_driver+0x70/0xf0
[  737.608912]  pci_unregister_driver+0x3f/0x90
[  737.608918]  __do_sys_delete_module+0x1e0/0x2c0
[  737.608930]  do_syscall_64+0x9d/0x1a0
[  737.608936]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
[  737.608943] RIP: 0033:0x7f7c360577cb
[  737.608947] RSP: 002b:00007ffc0a379158 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
[  737.608952] RAX: ffffffffffffffda RBX: 000055a9a1e8f700 RCX: 00007f7c360577cb
[  737.608954] RDX: 0000000000000000 RSI: 0000000000000800 RDI: 000055a9a1e8f768
[  737.608957] RBP: 00007ffc0a379180 R08: 1999999999999999 R09: 0000000000000000
[  737.608959] R10: 00007f7c360c8ac0 R11: 0000000000000206 R12: 0000000000000000
[  737.608961] R13: 00007ffc0a3793e0 R14: 000055a9a1e8f700 R15: 0000000000000000
[  737.608970]  </TASK>
[  737.608975] 
               Showing all locks held in the system:
[  737.608986] 1 lock held by khungtaskd/154:
[  737.608988]  #0: ffffffff89d6a0c0 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x36/0x120
[  737.609019] 1 lock held by dmesg/3479:
[  737.609021]  #0: ffff93f70174a0e0 (&user->lock){....}-{3:3}, at: devkmsg_read+0x6c/0x230
[  737.609032] 2 locks held by rmmod/7293:
[  737.609035]  #0: ffff93f7029251b0 (&dev->mutex){....}-{3:3}, at: device_release_driver_internal+0x3d/0x210
[  737.609045]  #1: ffff93f70655beb8 (&sb->s_type->i_mutex_key#2){....}-{3:3}, at: simple_recursive_removal+0x15f/0x280

[  737.609059] =============================================
plbossart commented 6 months ago

reproduced with kernel topic/sof-dev d99d9a0ab917 ALSA: hda: intel-sdw-acpi: use acpi_get_local_u64_address() SOF main : 737d4d41fbaa app: add winconsole_overlay.conf

TPLG=/lib/firmware/intel/sof-ipc4-tplg/sof-mtl-rt713-l0-rt1316-l12.tplg ~/sof-test/test-case/run-all-tests.sh -l1

plbossart commented 6 months ago

The minimal sequence to make the problem appear is this in test-case/run-all-tests.sh

testlist=" multiple-pause-resume kmod-load-unload "

Without the multiple-pause-resume no errors are reported.

plbossart commented 6 months ago

this also happens when I remove everything except 'Jack Out"

 starting test_multiple-pause-resume 
+ test_multiple-pause-resume
+ /root/sof-test/test-case/multiple-pause-resume.sh -l 1 -r 25
2024-05-24 20:18:56 UTC Sub-Test: [INFO] /root/sof-test/test-case/multiple-pause-resume.sh will use topology /usr/lib/firmware/intel/sof-ipc4-tplg/sof-mtl-rt713-l0-rt1316-l12.tplg to run the test case
2024-05-24 20:18:56 UTC Sub-Test: [INFO] Pipeline list to ignore is specified, will ignore 'pcm=HDA Digital,HDMI1,HDMI2,HDMI3,Deepbuffer Jack Out,Speaker,Amp feedback,Jack In' in test case
2024-05-24 20:18:56 UTC Sub-Test: [INFO] Run command to get pipeline parameters
2024-05-24 20:18:56 UTC Sub-Test: [COMMAND] sof-tplgreader.py /usr/lib/firmware/intel/sof-ipc4-tplg/sof-mtl-rt713-l0-rt1316-l12.tplg -f 'type:any & ~pcm:Amplifier Reference' -b ' pcm:HDA Digital,HDMI1,HDMI2,HDMI3,Deepbuffer Jack Out,Speaker,Amp feedback,Jack In' -s 0 -e
2024-05-24 20:18:56 UTC Sub-Test: [INFO] Starting /usr/local/bin/mtrace-reader.py >& /root/sof-test/logs/multiple-pause-resume/2024-05-24-16:18:56-14946/mtrace.txt &
2024-05-24 20:18:57 UTC Sub-Test: [WARNING] pipeline count is 1, don't need to run this case
 ---------- 
 ---------- 
 starting test_kmod-load-unload 
+ test_kmod-load-unload
+ /root/sof-test/test-case/check-kmod-load-unload.sh -l 1
2024-05-24 20:19:00 UTC Sub-Test: [INFO] ===== Starting iteration 1 of 1 =====
2024-05-24 20:19:00 UTC Sub-Test: [INFO] wait dsp power status to become suspended
2024-05-24 20:19:00 UTC Sub-Test: [INFO] run kmod/sof-kmod-remove.sh

WARNING: running as root is not supported

Specified filename /sys/kernel/debug/sof/trace does not exist.
SKIP    snd_usb_audio   not loaded
SKIP    snd_hda_intel   not loaded
SKIP    snd_sof_pci_intel_tng   not loaded
SKIP    snd_sof_pci_intel_skl   not loaded
SKIP    snd_sof_pci_intel_apl   not loaded
SKIP    snd_sof_pci_intel_tgl   not loaded
SKIP    snd_sof_pci_intel_icl   not loaded
SKIP    snd_sof_pci_intel_cnl   not loaded
SKIP    snd_sof_pci_intel_lnl   not loaded
RMMOD   snd_sof_pci_intel_mtl
plbossart commented 6 months ago

looks like a test script problem maybe. I can't reproduce this with the simpler sequence

sudo TPLG=/lib/firmware/intel/sof-ipc4-tplg/sof-mtl-rt713-l0-rt1316-l12.tplg ~/sof-test/test-case/multiple-pause-resume.sh -l 1 -r 25
~/sof-test/test-case/check-kmod-load-unload.sh -l 2

maybe it's the debugfs stuff that we read now that causes a problem while removing debugfs. Or we have an open debugfs file.

Looks like no one is using run-all-tests.sh?

plbossart commented 5 months ago

Just saw this as well on LNL

starting test_kmod-load-unload 
+ test_kmod-load-unload
+ /home/ubuntu/sof-test/test-case/check-kmod-load-unload.sh -l 1
2024-06-19 08:24:57 UTC Sub-Test: [INFO] ===== Starting iteration 1 of 1 =====
2024-06-19 08:24:57 UTC Sub-Test: [INFO] wait dsp power status to become suspended
2024-06-19 08:24:57 UTC Sub-Test: [INFO] run kmod/sof-kmod-remove.sh
Specified filename /sys/kernel/debug/sof/trace does not exist.
RMMOD   snd_usb_audio
SKIP    snd_hda_intel   not loaded
SKIP    snd_sof_pci_intel_tng   not loaded
SKIP    snd_sof_pci_intel_skl   not loaded
SKIP    snd_sof_pci_intel_apl   not loaded
SKIP    snd_sof_pci_intel_tgl   not loaded
SKIP    snd_sof_pci_intel_icl   not loaded
SKIP    snd_sof_pci_intel_cnl   not loaded
RMMOD   snd_sof_pci_intel_lnl
[  277.387525] snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx      : 0x47000000|0x0: MOD_SET_DX [data size: 8]
[  277.539501] snd_soc_rt711_sdca:rt711_sdca_calibration: rt711-sdca sdw:0:0:025d:0711:01: rt711_sdca_calibration calibration complete, ret=0
[  277.541983] snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx reply: 0x67000000|0x0: MOD_SET_DX
[  277.542024] snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx done : 0x47000000|0x0: MOD_SET_DX [data size: 8]
[  277.542037] Message payload: 00000000: 00000001 00000000
[  277.547478] snd_soc_rt711_sdca:rt711_sdca_jack_init: rt711-sdca sdw:0:0:025d:0711:01: in rt711_sdca_jack_init enable
[  277.547489] snd_soc_rt711_sdca:rt711_sdca_io_init: rt711-sdca sdw:0:0:025d:0711:01: rt711_sdca_io_init hw_init complete
[  277.547493] soundwire_bus:sdw_handle_slave_status: rt711-sdca sdw:0:0:025d:0711:01: signaling initialization completion for Slave 6
[  277.552306] soundwire_cadence:cdns_update_slave_status_work: soundwire_intel soundwire_intel.link.0: Slave status change: 0x4000000
[  277.565812] snd_soc_rt711_sdca:rt711_sdca_interrupt_callback: rt711-sdca sdw:0:0:025d:0711:01: rt711_sdca_interrupt_callback control_port_stat=4, sdca_cascade=1
[  277.568162] soundwire_cadence:cdns_update_slave_status_work: soundwire_intel soundwire_intel.link.0: Slave status change: 0x2000000
[  277.669458] snd_soc_rt711_sdca:rt711_sdca_interrupt_callback: rt711-sdca sdw:0:0:025d:0711:01: rt711_sdca_interrupt_callback control_port_stat=4, sdca_cascade=1
[  280.639869] soundwire_bus:sdw_bus_wait_for_clk_prep_deprep: soundwire sdw-master-0-3: clock stop prepare done slave:15
[  280.639910] soundwire_bus:sdw_bus_wait_for_clk_prep_deprep: soundwire sdw-master-0-1: clock stop prepare done slave:15
[  280.640037] soundwire_bus:sdw_bus_wait_for_clk_prep_deprep: soundwire sdw-master-0-2: clock stop prepare done slave:15
[  281.306958] soundwire_bus:sdw_bus_wait_for_clk_prep_deprep: soundwire sdw-master-0-0: clock stop prepare done slave:15
marc-hb commented 4 months ago

The pause-resume tests had many implementation issues, I just rewrote it in https://github.com/thesofproject/sof-test/pull/1218 Can you try again?

It's not surprising that kmod-load-unload could hang if the pause-resume test left something bad behind. Examples:

CI is generally more robust that run-all-tests.sh so it's less likely to leave something bad behind. Not impossible but less likely in CI.

marc-hb commented 4 months ago

@plbossart can we close after https://github.com/thesofproject/sof-test/pull/1218 ?

plbossart commented 3 months ago

I haven't seen this problem in a very very long time, closing