thesofproject / linux

Linux kernel source tree
Other
89 stars 129 forks source link

sof_sdw: init platform error / mc_probe failure - no audio device (dmic_num issue) #5124

Open marc-hb opened 1 month ago

marc-hb commented 1 month ago

Recent regression.

Spotted on both MTL and LNL.

Earliest failure spotted so far: Start Time: 2024-07-28 13:09:12 UTC Linux Commit: dc9dd7b28159 KConfig Commit: 8189104a4f38 SOF Commit: dc28dbdc6a4c

~The failure is NOT deterministic~, for instance these did not crash with the same commits: https://sof-ci.01.org/sofpr/PR9156/build6782/devicetest/index.html https://sof-ci.01.org/sofpr/PR9156/build6783/devicetest/index.html

EDIT: actually deterministic, just BIOS configuration differences?

There were only two very recent changes since this crash started happening:

Last week there was also this one but it was merged longer ago:

dc9dd7b28159 brent.lu@intel.com // (origin/topic/sof-dev) ASoC: SOF: Intel: hda: refactoring topology name fixup for SDW mach aef91c6e3e9f brent.lu@intel.com // ASoC: SOF: Intel: hda: refactoring topology name fixup for HDA mach <==== guilty a31c32f7844c brent.lu@intel.com // ASoC: Intel: Skylake: remove HDA machine driver support 96ffc595b370 pierre-louis.bossart@linux.intel.com // ASoC: Intel: boards: always check the result of acpi_dev_get_first_match_dev() 1958a09d1354 peter.ujfalusi@linux.intel.com // ASoC: SOF: ipc4-topology: Preserve the DMA Link ID for ChainDMA on unprepare 17096415a97c peter.ujfalusi@linux.intel.com // ASoC: SOF: ipc4-topology: Only handle dai_config with HW_PARAMS for ChainDMA 0dff6ac871fa yung-chuan.liao@linux.intel.com // ASoC: Intel: sof_sdw: add rt1320 amp support

Sample failure: https://sof-ci.01.org/sofpr/PR9338/build6778/devicetest/index.html?model=LNLM_SDW_AIOC&testcase=verify-kernel-boot-log

Jul 28 13:15:47 ba-lnlm-rvp-sdw-01 kernel: snd_soc_sdw_utils:asoc_sdw_init_dai_link: sof_sdw sof_sdw: create dai link SDW2-Playback, id 2
Jul 28 13:15:47 ba-lnlm-rvp-sdw-01 kernel: snd_soc_sdw_utils:asoc_sdw_init_dai_link: sof_sdw sof_sdw: create dai link SDW2-Capture, id 3
Jul 28 13:15:47 ba-lnlm-rvp-sdw-01 kernel: snd_soc_sdw_utils:asoc_sdw_init_dai_link: sof_sdw sof_sdw: create dai link SDW1-Capture, id 4
Jul 28 13:15:47 ba-lnlm-rvp-sdw-01 kernel: sof_sdw sof_sdw: Ignoring PCH DMIC
Jul 28 13:15:47 ba-lnlm-rvp-sdw-01 kernel: snd_soc_sdw_utils:asoc_sdw_init_dai_link: sof_sdw sof_sdw: create dai link iDisp1, id 5
Jul 28 13:15:47 ba-lnlm-rvp-sdw-01 kernel: snd_soc_sdw_utils:asoc_sdw_init_dai_link: sof_sdw sof_sdw: create dai link iDisp2, id 6
Jul 28 13:15:47 ba-lnlm-rvp-sdw-01 kernel: snd_soc_sdw_utils:asoc_sdw_init_dai_link: sof_sdw sof_sdw: create dai link iDisp3, id 7
Jul 28 13:15:47 ba-lnlm-rvp-sdw-01 kernel: ------------[ cut here ]------------
Jul 28 13:15:47 ba-lnlm-rvp-sdw-01 kernel: WARNING: CPU: 1 PID: 285 at sound/soc/intel/boards/sof_sdw.c:1281 mc_probe+0x155c/0x15b0 [snd_soc_sof_sdw]
Jul 28 13:15:47 ba-lnlm-rvp-sdw-01 kernel: Modules linked in: snd_soc_sof_sdw(+) snd_soc_sdw_utils snd_soc_intel_hda_dsp_common snd_sof_probes snd_sof_ipc_msg_injector squashfs snd_soc_rt1316_sdw snd_soc_rt711_sdca snd_soc_rt715_sdca regmap_sdw_mbq regmap_sdw snd_soc_dmic snd_sof_pci_intel_lnl snd_sof_pci_intel_mtl snd_sof_intel_hda_generic soundwire_intel soundwire_cadence snd_sof_intel_hda_common snd_sof_intel_hda_mlink snd_sof_intel_hda snd_sof_pci snd_sof_xtensa_dsp snd_sof snd_sof_utils snd_soc_hdac_hda snd_hda_ext_core snd_soc_acpi_intel_match asix(+) snd_soc_acpi soundwire_generic_allocation phylink usbnet snd_soc_core mei_gsc_proxy wmi_bmof snd_compress xe snd_intel_dspcfg processor_thermal_device_pci drm_ttm_helper snd_intel_sdw_acpi gpu_sched processor_thermal_device drm_suballoc_helper processor_thermal_wt_hint snd_hda_codec drm_gpuvm drm_exec i2c_algo_bit processor_thermal_rfim processor_thermal_rapl snd_hda_core intel_rapl_common drm_buddy processor_thermal_wt_req x86_pkg_temp_thermal intel_powerclamp soundwire_bus ttm mei_me
Jul 28 13:15:47 ba-lnlm-rvp-sdw-01 kernel:  drm_display_helper processor_thermal_power_floor mei processor_thermal_mbox drm_kms_helper video int3403_thermal int340x_thermal_zone int3400_thermal wmi acpi_thermal_rel snd_usb_audio snd_usbmidi_lib snd_hwdep snd_pcm snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq snd_seq_device snd_timer snd soundcore fuse drm efivarfs xhci_pci intel_lpss_pci intel_lpss idma64 xhci_hcd mfd_core fan
Jul 28 13:15:47 ba-lnlm-rvp-sdw-01 kernel: CPU: 1 PID: 285 Comm: systemd-udevd Not tainted 6.10.0-rc5-gdc9dd7b28159 #dev
Jul 28 13:15:47 ba-lnlm-rvp-sdw-01 kernel: Hardware name: Intel Corporation Lunar Lake Client Platform/LNL-M LP5 RVP1, BIOS LNLMFWI1.R00.3173.D85.2405131446 05/13/2024
Jul 28 13:15:47 ba-lnlm-rvp-sdw-01 kernel: RIP: 0010:mc_probe+0x155c/0x15b0 [snd_soc_sof_sdw]
Jul 28 13:15:47 ba-lnlm-rvp-sdw-01 kernel: Code: 41 5b f2 e9 1e f6 ff ff 4c 8b 74 24 70 e9 85 f0 ff ff b9 f4 ff ff ff eb a5 31 c9 e9 7f f0 ff ff 90 0f 0b 90 e9 ea f6 ff ff 90 <0f> 0b 90 48 8b 7c 24 40 e8 57 b3 e8 f1 48 8b 3c 24 e8 4e b3 e8 f1
Jul 28 13:15:47 ba-lnlm-rvp-sdw-01 kernel: RSP: 0018:ffffac2881107ab0 EFLAGS: 00010287
Jul 28 13:15:47 ba-lnlm-rvp-sdw-01 kernel: RAX: ffff94ded46bb6e8 RBX: ffff94ded33fa010 RCX: 000000000000010e
Jul 28 13:15:47 ba-lnlm-rvp-sdw-01 kernel: RDX: ffff94ded46bb898 RSI: 0000000000000008 RDI: ffff94e22eddc8c0
Jul 28 13:15:47 ba-lnlm-rvp-sdw-01 kernel: RBP: ffff94ded8f4f748 R08: ffffffffb4766f48 R09: 00000000ffffdfff
Jul 28 13:15:47 ba-lnlm-rvp-sdw-01 kernel: R10: 0000000000000001 R11: ffffffffb4736f60 R12: 0000000000000000
Jul 28 13:15:47 ba-lnlm-rvp-sdw-01 kernel: R13: 0000000000000003 R14: ffff94ded46b5828 R15: ffffac2881107b80
Jul 28 13:15:47 ba-lnlm-rvp-sdw-01 kernel: FS:  00007f237f6b38c0(0000) GS:ffff94e22ec00000(0000) knlGS:0000000000000000
Jul 28 13:15:47 ba-lnlm-rvp-sdw-01 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jul 28 13:15:47 ba-lnlm-rvp-sdw-01 kernel: CR2: 00007f179115fae0 CR3: 0000000107fd8001 CR4: 0000000000f70ef0
Jul 28 13:15:47 ba-lnlm-rvp-sdw-01 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Jul 28 13:15:47 ba-lnlm-rvp-sdw-01 kernel: DR3: 0000000000000000 DR6: 00000000ffff07f0 DR7: 0000000000000400
Jul 28 13:15:47 ba-lnlm-rvp-sdw-01 kernel: PKRU: 55555554
Jul 28 13:15:47 ba-lnlm-rvp-sdw-01 kernel: Call Trace:
Jul 28 13:15:47 ba-lnlm-rvp-sdw-01 kernel:  
Jul 28 13:15:47 ba-lnlm-rvp-sdw-01 kernel:  ? __warn+0x83/0x170
Jul 28 13:15:47 ba-lnlm-rvp-sdw-01 kernel:  ? mc_probe+0x155c/0x15b0 [snd_soc_sof_sdw]
Jul 28 13:15:47 ba-lnlm-rvp-sdw-01 kernel:  ? report_bug+0x18e/0x1a0
Jul 28 13:15:47 ba-lnlm-rvp-sdw-01 kernel:  ? handle_bug+0x3d/0x70
Jul 28 13:15:47 ba-lnlm-rvp-sdw-01 kernel:  ? exc_invalid_op+0x18/0x70
Jul 28 13:15:47 ba-lnlm-rvp-sdw-01 kernel:  ? asm_exc_invalid_op+0x1a/0x20
Jul 28 13:15:47 ba-lnlm-rvp-sdw-01 kernel:  ? mc_probe+0x155c/0x15b0 [snd_soc_sof_sdw]
Jul 28 13:15:47 ba-lnlm-rvp-sdw-01 kernel:  platform_probe+0x5c/0xd0
Jul 28 13:15:47 ba-lnlm-rvp-sdw-01 kernel:  really_probe+0xd6/0x380
Jul 28 13:15:47 ba-lnlm-rvp-sdw-01 kernel:  __driver_probe_device+0x78/0x150
Jul 28 13:15:47 ba-lnlm-rvp-sdw-01 kernel:  driver_probe_device+0x1e/0x90
Jul 28 13:15:47 ba-lnlm-rvp-sdw-01 kernel:  __driver_attach+0xda/0x1d0
Jul 28 13:15:47 ba-lnlm-rvp-sdw-01 kernel:  ? __pfx___driver_attach+0x10/0x10
Jul 28 13:15:47 ba-lnlm-rvp-sdw-01 kernel:  bus_for_each_dev+0x7c/0xd0
Jul 28 13:15:47 ba-lnlm-rvp-sdw-01 kernel:  bus_add_driver+0x115/0x240
Jul 28 13:15:47 ba-lnlm-rvp-sdw-01 kernel:  driver_register+0x60/0x120
Jul 28 13:15:47 ba-lnlm-rvp-sdw-01 kernel:  ? __pfx_sof_sdw_driver_init+0x10/0x10 [snd_soc_sof_sdw]
Jul 28 13:15:47 ba-lnlm-rvp-sdw-01 kernel:  do_one_initcall+0x63/0x2c0
Jul 28 13:15:47 ba-lnlm-rvp-sdw-01 kernel:  ? trace_kmalloc+0x20/0xb0
Jul 28 13:15:47 ba-lnlm-rvp-sdw-01 kernel:  ? kmalloc_trace_noprof+0x107/0x210
Jul 28 13:15:47 ba-lnlm-rvp-sdw-01 kernel:  ? do_init_module+0x23/0x240
Jul 28 13:15:47 ba-lnlm-rvp-sdw-01 kernel:  do_init_module+0x64/0x240
Jul 28 13:15:47 ba-lnlm-rvp-sdw-01 kernel:  init_module_from_file+0x8b/0xd0
Jul 28 13:15:47 ba-lnlm-rvp-sdw-01 kernel:  idempotent_init_module+0x18d/0x240
Jul 28 13:15:47 ba-lnlm-rvp-sdw-01 kernel:  __x64_sys_finit_module+0x5e/0xb0
Jul 28 13:15:47 ba-lnlm-rvp-sdw-01 kernel:  do_syscall_64+0x9e/0x1a0
Jul 28 13:15:47 ba-lnlm-rvp-sdw-01 kernel:  entry_SYSCALL_64_after_hwframe+0x77/0x7f
Jul 28 13:15:47 ba-lnlm-rvp-sdw-01 kernel: RIP: 0033:0x7f237fdad88d
Jul 28 13:15:47 ba-lnlm-rvp-sdw-01 kernel: Code: 5b 41 5c c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 73 b5 0f 00 f7 d8 64 89 01 48
Jul 28 13:15:47 ba-lnlm-rvp-sdw-01 kernel: RSP: 002b:00007ffd2078ba08 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
Jul 28 13:15:47 ba-lnlm-rvp-sdw-01 kernel: RAX: ffffffffffffffda RBX: 000055a0465a86b0 RCX: 00007f237fdad88d
Jul 28 13:15:47 ba-lnlm-rvp-sdw-01 kernel: RDX: 0000000000000000 RSI: 000055a0465a0660 RDI: 0000000000000010
Jul 28 13:15:47 ba-lnlm-rvp-sdw-01 kernel: RBP: 0000000000020000 R08: 0000000000000000 R09: 0000000000000002
Jul 28 13:15:47 ba-lnlm-rvp-sdw-01 kernel: R10: 0000000000000010 R11: 0000000000000246 R12: 000055a0465a0660
Jul 28 13:15:47 ba-lnlm-rvp-sdw-01 kernel: R13: 000055a04659e860 R14: 0000000000000000 R15: 000055a0465a6d60
Jul 28 13:15:47 ba-lnlm-rvp-sdw-01 kernel:  
Jul 28 13:15:47 ba-lnlm-rvp-sdw-01 kernel: ---[ end trace 0000000000000000 ]---
Jul 28 13:15:47 ba-lnlm-rvp-sdw-01 kernel: sof_sdw sof_sdw: init platform error
Jul 28 13:15:47 ba-lnlm-rvp-sdw-01 kernel: sof_sdw sof_sdw: init platform error
Jul 28 13:15:47 ba-lnlm-rvp-sdw-01 kernel: sof_sdw sof_sdw: error -ENOMEM: snd_soc_register_card failed -12
Jul 28 13:15:47 ba-lnlm-rvp-sdw-01 kernel: sof_sdw sof_sdw: probe with driver sof_sdw failed with error -12
Jul 28 13:15:47 ba-lnlm-rvp-sdw-01 kernel: asix 3-3:1.0 (unnamed net_device) (uninitialized): PHY [usb-003:002:10] driver [Asix Electronics AX88772C] (irq=POLL)
Jul 28 13:15:47 ba-lnlm-rvp-sdw-01 kernel: Asix Electronics AX88772C usb-003:002:10: attached PHY driver (mii_bus:phy_addr=usb-003:002:10, irq=POLL)
Jul 28 13:15:47 ba-lnlm-rvp-sdw-01 kernel: asix 3-3:1.0 eth0: register 'asix' at usb-0000:00:14.0-3, ASIX AX88772B USB 2.0 Ethernet, 34:29:8f:90:5f:b5
Jul 28 13:15:47 ba-lnlm-rvp-sdw-01 kernel: usbcore: registered new interface driver asix
Jul 28 13:15:47 ba-lnlm-rvp-sdw-01 kernel: asix 3-3:1.0 enx34298f905fb5: renamed from eth0
Jul 28 13:15:47 ba-lnlm-rvp-sdw-01 kernel: asix 3-3:1.0 enx34298f905fb5: configuring for phy/internal link mode
Jul 28 13:15:49 ba-lnlm-rvp-sdw-01 kernel: asix 3-3:1.0 enx34298f905fb5: Link is Up - 100Mbps/Full - flow control rx/tx
Jul 28 13:15:50 ba-lnlm-rvp-sdw-01 kernel: soundwire_bus:sdw_bus_wait_for_clk_prep_deprep: soundwire sdw-master-0-1: clock stop prepare done slave:15
Jul 28 13:15:50 ba-lnlm-rvp-sdw-01 kernel: soundwire_bus:sdw_bus_wait_for_clk_prep_deprep: soundwire sdw-master-0-2: clock stop prepare done slave:15
Jul 28 13:15:50 ba-lnlm-rvp-sdw-01 kernel: soundwire_bus:sdw_bus_wait_for_clk_prep_deprep: soundwire sdw-master-0-3: clock stop prepare done slave:15
Jul 28 13:15:50 ba-lnlm-rvp-sdw-01 kernel: soundwire_bus:sdw_bus_wait_for_clk_prep_deprep: soundwire sdw-master-0-0: clock stop prepare done slave:15

Also observed in daily tests 44332 and 44336) (LNLM_SDW_AIOC ba-lnlm-rvp-sdw-01), 44336, 44363?model=MTLP_SDW_AIOC&testcase=verify-kernel-boot-log (ba-mtlp-sdw-aioc-03)

bardliao commented 1 month ago

The issue only happens on ba-lnlm-rvp-sdw-01, but not jf-lnlm-rvp-sdw-1. And I checked the dmesg found that

[ 8.481076] snd_soc_sof_sdw:sof_card_dai_links_create: sof_sdw sof_sdw: sdw 5, ssp 0, dmic 2, hdmi 0, bt: 0

in ba-lnlm-rvp-sdw-01.

We have rt714 working as DMIC, so we need to disable the PCH DMIC from BIOS.

bardliao commented 1 month ago

I added options snd_sof_intel_hda_generic dmic_num=0 as a temporary solution.

ssavati commented 1 month ago

ba-mtlp-sdw-aioc-03 also recovred with BIOS settings change. can you please close this issue

marc-hb commented 1 month ago

Thanks @bardliao!

dmic 2 We have rt714 working as DMIC, so we need to disable the PCH DMIC from BIOS.

cc: @brentlu

ba-mtlp-sdw-aioc-03 also recovred with BIOS settings change. can you please close this issue

Let's first get clarity on whether this is a workaround or fix. Configuring all devices is not convenient and the price of such minor misconfiguration seems way too high.

brentlu commented 1 month ago

Thanks @bardliao!

dmic 2 We have rt714 working as DMIC, so we need to disable the PCH DMIC from BIOS.

cc: @brentlu

ba-mtlp-sdw-aioc-03 also recovred with BIOS settings change. can you please close this issue

Let's first get clarity on whether this is a workaround or fix. Configuring all devices is not convenient and the price of such minor misconfiguration seems way too high.

Yes it's the side effect of "ASoC: SOF: Intel: hda: refactoring topology name fixup for SDW mach". In original design dmic_num is reported only when mach->link_mask <= 2 bits. Now it's always reported to machine driver.

Please check if following PR could solve this issue and sorry for the regression. https://github.com/thesofproject/linux/pull/5125

marc-hb commented 1 month ago

sorry for the regression.

Not at all, this is exactly how Continous Integration is supposed to work: catching regressions immediately! https://www.bing.com/images/search?q=relative+cost+of+defects If tests were never failing then they would not be trying hard enough.

Thanks for the super quick fixup you already submitted in #5125 = also how Continuous Integration is supposed to work!

Should @ssavati re-enable the DMICs in the BIOS for testing #5125? I still don't really understand what the technical problem is... is there some documentation we can look at? I mean anything higher level and faster to read than the source code :-)

marc-hb commented 1 month ago

The failure is NOT deterministic, for instance these did not crash with the same commits...

The issue only happens on ba-lnlm-rvp-sdw-01, but not jf-lnlm-rvp-sdw-1.

So @bardliao you're saying it is actually deterministic: it appeared non-deterministic only because of BIOS configuration differences across devices?

Sorry I should have asked earlier.

bardliao commented 1 month ago

Please check if following PR could solve this issue and sorry for the regression. https://github.com/thesofproject/linux/pull/5125

Thanks @brentlu I confirmed https://github.com/thesofproject/linux/pull/5125 works.

So @bardliao you're saying it is actually deterministic: it appeared non-deterministic only because of BIOS configuration differences across devices?

The issue is due to dmic_num != 0 when a sdw DMIC is present, and https://github.com/thesofproject/linux/pull/5125 is to set dmic_num = 0 when there are more than 2 sdw links are used. But, this will not work when we use a multi-function sdw codec like rt722 which contains DMIC function and only use one sdw link. Thus, I still think we should disable PCH DMIC when SDW DMIC is used.

plbossart commented 1 month ago

Agree with @bardliao, if there is a SoundWire codec that deals with microphones then we cannot also have DMICs enabled. I think this will be a recurring error though, if there is an NHLT table copy/pasted from previous programs by OEMs along with a SoundWire mic codec, then we will have a problem.

I am not sure how to go about this though, we cannot solve this at the machine driver level since it would be too late. We need to nuke the dmic count before the topology is chosen. That can only be done by checking if any of the SoundWire devices has a mic function and force dmic_num = 0 to prevent the selection of a non-supported topology.

That's not simple.