Closed plbossart closed 4 months ago
Commit https://github.com/thesofproject/linux/pull/5010/commits/2772fe4c59b3ee25ceb4c5788b25a9664223b3d5 (merged into https://github.com/thesofproject/linux/commit/e7e81a5e790e) crashed ba-lnlm-rvp-sdw-03
really bad and made it unavailable for CI for at least 7 hours (from Wed 2024-05-22 21:15:35 UTC to Thu 2024-05-23 03:18:11 UTC) because that devices (and other in BA) don't have remote power control yet and can't be power-cycled by CI.
Can you please avoid too "risky" pull requests until all devices are fully equipped?
cc: @ssavati
May 22 21:01:16 ba-lnlm-rvp-sdw-03 kernel: soundwire_bus:sdw_slave_set_frequency: rt711-sdca sdw:0:0:025d:0711:01: Configured bus base 1, scale 3, mclk 19200000, curr_freq 4800000
May 22 21:01:16 ba-lnlm-rvp-sdw-03 kernel: soundwire_bus:sdw_slave_set_frequency: rt715-sdca sdw:0:1:025d:0714:01: Configured bus base 1, scale 3, mclk 19200000, curr_freq 4800000
May 22 21:01:16 ba-lnlm-rvp-sdw-03 kernel: soundwire_bus:sdw_handle_slave_status: rt715-sdca sdw:0:1:025d:0714:01: signaling initialization completion for Slave 7
May 22 21:01:16 ba-lnlm-rvp-sdw-03 systemd[1]: Stopped target Sound Card.
May 22 21:01:16 ba-lnlm-rvp-sdw-03 acpid[382]: input device has been disconnected, fd 10
May 22 21:01:16 ba-lnlm-rvp-sdw-03 systemd[897]: Stopped target Sound Card.
May 22 21:01:16 ba-lnlm-rvp-sdw-03 kernel: snd_soc_rt1316_sdw:rt1316_io_init: rt1316-sdca sdw:0:3:025d:1316:01: rt1316_io_init hw_init complete
May 22 21:01:16 ba-lnlm-rvp-sdw-03 kernel: soundwire_bus:sdw_handle_slave_status: rt1316-sdca sdw:0:3:025d:1316:01: signaling initialization completion for Slave 1
May 22 21:01:16 ba-lnlm-rvp-sdw-03 kernel: snd_soc_rt1316_sdw:rt1316_io_init: rt1316-sdca sdw:0:2:025d:1316:01: rt1316_io_init hw_init complete
May 22 21:01:16 ba-lnlm-rvp-sdw-03 kernel: soundwire_bus:sdw_handle_slave_status: rt1316-sdca sdw:0:2:025d:1316:01: signaling initialization completion for Slave 1
May 22 21:01:16 ba-lnlm-rvp-sdw-03 kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx : 0x44000000|0x3060004c: MOD_LARGE_CONFIG_SET [data size: 76]
May 22 21:01:16 ba-lnlm-rvp-sdw-03 kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx reply: 0x64000000|0x3060004c: MOD_LARGE_CONFIG_SET
May 22 21:01:16 ba-lnlm-rvp-sdw-03 kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx done : 0x44000000|0x3060004c: MOD_LARGE_CONFIG_SET [data size: 76]
May 22 21:01:16 ba-lnlm-rvp-sdw-03 kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx : 0x47000000|0x0: MOD_SET_DX [data size: 8]
May 22 21:01:17 ba-lnlm-rvp-sdw-03 kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx reply: 0x67000000|0x0: MOD_SET_DX
May 22 21:01:17 ba-lnlm-rvp-sdw-03 kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx done : 0x47000000|0x0: MOD_SET_DX [data size: 8]
May 22 21:01:17 ba-lnlm-rvp-sdw-03 kernel: snd_soc_rt711_sdca:rt711_sdca_calibration: rt711-sdca sdw:0:0:025d:0711:01: rt711_sdca_calibration calibration complete, ret=0
May 22 21:01:17 ba-lnlm-rvp-sdw-03 kernel: snd_soc_rt711_sdca:rt711_sdca_jack_init: rt711-sdca sdw:0:0:025d:0711:01: in rt711_sdca_jack_init enable
May 22 21:01:17 ba-lnlm-rvp-sdw-03 kernel: snd_soc_rt711_sdca:rt711_sdca_io_init: rt711-sdca sdw:0:0:025d:0711:01: rt711_sdca_io_init hw_init complete
May 22 21:01:17 ba-lnlm-rvp-sdw-03 kernel: soundwire_bus:sdw_handle_slave_status: rt711-sdca sdw:0:0:025d:0711:01: signaling initialization completion for Slave 6
May 22 21:01:17 ba-lnlm-rvp-sdw-03 kernel: rt1316-sdca sdw:0:2:025d:1316:01: rt1316_sdw_remove: plb: before unregister_functions
May 22 21:01:17 ba-lnlm-rvp-sdw-03 kernel: BUG: kernel NULL pointer dereference, address: 0000000000000040
May 22 21:01:17 ba-lnlm-rvp-sdw-03 kernel: #PF: supervisor read access in kernel mode
May 22 21:01:17 ba-lnlm-rvp-sdw-03 kernel: #PF: error_code(0x0000) - not-present page
May 22 21:01:17 ba-lnlm-rvp-sdw-03 kernel: PGD 0 P4D 0
May 22 21:01:17 ba-lnlm-rvp-sdw-03 kernel: Oops: 0000 [#1] PREEMPT SMP NOPTI
May 22 21:01:17 ba-lnlm-rvp-sdw-03 kernel: CPU: 2 PID: 20385 Comm: rmmod Not tainted 6.9.0-rc5-ge7e81a5e790e #pr5010
May 22 21:01:17 ba-lnlm-rvp-sdw-03 kernel: Hardware name: Intel Corporation Lunar Lake Client Platform/LNL-M LP5 RVP1, BIOS LNLMFWI1.R00.2470.D84.2312061937 12/06/2023
May 22 21:01:17 ba-lnlm-rvp-sdw-03 kernel: RIP: 0010:device_del+0x23/0x3d0
May 22 21:01:17 ba-lnlm-rvp-sdw-03 kernel: Code: 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 41 57 31 f6 41 56 41 55 4c 8d af 80 00 00 00 41 54 55 48 89 fd 53 48 83 ec 20 <4c> 8b 67 40 4c 89 ef 65 48 8b 04 25 28 00 00 00 48 89 44 24 18 31
May 22 21:01:17 ba-lnlm-rvp-sdw-03 kernel: RSP: 0018:ffff9729410b7ac8 EFLAGS: 00010282
May 22 21:01:17 ba-lnlm-rvp-sdw-03 kernel: RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000000
May 22 21:01:17 ba-lnlm-rvp-sdw-03 kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
May 22 21:01:17 ba-lnlm-rvp-sdw-03 kernel: RBP: 0000000000000000 R08: ffffffffa835e638 R09: 0000000100035a29
May 22 21:01:17 ba-lnlm-rvp-sdw-03 kernel: R10: ffffffffa8287260 R11: ffffffffa835e638 R12: ffff91ae45837000
May 22 21:01:17 ba-lnlm-rvp-sdw-03 kernel: R13: 0000000000000080 R14: 0000000000000080 R15: 0000000000000000
May 22 21:01:17 ba-lnlm-rvp-sdw-03 kernel: FS: 00007f4979861c40(0000) GS:ffff91b5a0c00000(0000) knlGS:0000000000000000
May 22 21:01:17 ba-lnlm-rvp-sdw-03 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
May 22 21:01:17 ba-lnlm-rvp-sdw-03 kernel: CR2: 0000000000000040 CR3: 0000000116252006 CR4: 0000000000f70ef0
May 22 21:01:17 ba-lnlm-rvp-sdw-03 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
May 22 21:01:17 ba-lnlm-rvp-sdw-03 kernel: DR3: 0000000000000000 DR6: 00000000ffff07f0 DR7: 0000000000000400
May 22 21:01:17 ba-lnlm-rvp-sdw-03 kernel: PKRU: 55555554
May 22 21:01:17 ba-lnlm-rvp-sdw-03 kernel: Call Trace:
May 22 21:01:17 ba-lnlm-rvp-sdw-03 kernel: <TASK>
May 22 21:01:17 ba-lnlm-rvp-sdw-03 kernel: ? __die+0x24/0x70
May 22 21:01:17 ba-lnlm-rvp-sdw-03 kernel: ? page_fault_oops+0x15a/0x450
May 22 21:01:17 ba-lnlm-rvp-sdw-03 kernel: ? irq_work_queue+0x32/0x60
May 22 21:01:17 ba-lnlm-rvp-sdw-03 kernel: ? vprintk_emit+0x103/0x2f0
May 22 21:01:17 ba-lnlm-rvp-sdw-03 kernel: ? local_clock+0x15/0x30
May 22 21:01:17 ba-lnlm-rvp-sdw-03 kernel: ? exc_page_fault+0x68/0x1e0
May 22 21:01:17 ba-lnlm-rvp-sdw-03 kernel: ? asm_exc_page_fault+0x26/0x30
May 22 21:01:17 ba-lnlm-rvp-sdw-03 kernel: ? device_del+0x23/0x3d0
May 22 21:01:17 ba-lnlm-rvp-sdw-03 kernel: ? _dev_info+0x70/0x90
May 22 21:01:17 ba-lnlm-rvp-sdw-03 kernel: sdca_dev_unregister_functions+0x36/0x60 [snd_soc_sdca]
May 22 21:01:17 ba-lnlm-rvp-sdw-03 kernel: rt1316_sdw_remove+0x3e/0x60 [snd_soc_rt1316_sdw]
May 22 21:01:17 ba-lnlm-rvp-sdw-03 kernel: sdw_drv_remove+0x54/0x80 [soundwire_bus]
May 22 21:01:17 ba-lnlm-rvp-sdw-03 kernel: device_release_driver_internal+0x1a1/0x210
May 22 21:01:17 ba-lnlm-rvp-sdw-03 kernel: bus_remove_device+0xd4/0x140
May 22 21:01:17 ba-lnlm-rvp-sdw-03 kernel: device_del+0x151/0x3d0
May 22 21:01:17 ba-lnlm-rvp-sdw-03 kernel: ? __mutex_unlock_slowpath+0x44/0x290
May 22 21:01:17 ba-lnlm-rvp-sdw-03 kernel: device_unregister+0x17/0x60
May 22 21:01:17 ba-lnlm-rvp-sdw-03 kernel: sdw_delete_slave+0xad/0xc0 [soundwire_bus]
May 22 21:01:17 ba-lnlm-rvp-sdw-03 kernel: ? __pfx_sdw_delete_slave+0x10/0x10 [soundwire_bus]
May 22 21:01:17 ba-lnlm-rvp-sdw-03 kernel: device_for_each_child+0x59/0xa0
May 22 21:01:17 ba-lnlm-rvp-sdw-03 kernel: sdw_bus_master_delete+0x1e/0x60 [soundwire_bus]
May 22 21:01:17 ba-lnlm-rvp-sdw-03 kernel: auxiliary_bus_remove+0x1c/0x30
May 22 21:01:17 ba-lnlm-rvp-sdw-03 kernel: device_release_driver_internal+0x1a1/0x210
May 22 21:01:17 ba-lnlm-rvp-sdw-03 kernel: bus_remove_device+0xd4/0x140
May 22 21:01:17 ba-lnlm-rvp-sdw-03 kernel: device_del+0x151/0x3d0
May 22 21:01:17 ba-lnlm-rvp-sdw-03 kernel: sdw_intel_exit+0x89/0x110 [soundwire_intel]
May 22 21:01:17 ba-lnlm-rvp-sdw-03 kernel: hda_dsp_remove+0x72/0x180 [snd_sof_intel_hda_generic]
May 22 21:01:17 ba-lnlm-rvp-sdw-03 kernel: snd_sof_device_remove+0x119/0x180 [snd_sof]
May 22 21:01:17 ba-lnlm-rvp-sdw-03 kernel: sof_pci_remove+0x1e/0x50 [snd_sof_pci]
May 22 21:01:17 ba-lnlm-rvp-sdw-03 kernel: pci_device_remove+0x3f/0xb0
May 22 21:01:17 ba-lnlm-rvp-sdw-03 kernel: device_release_driver_internal+0x1a1/0x210
May 22 21:01:17 ba-lnlm-rvp-sdw-03 kernel: driver_detach+0x4b/0x90
May 22 21:01:17 ba-lnlm-rvp-sdw-03 kernel: bus_remove_driver+0x70/0xf0
May 22 21:01:17 ba-lnlm-rvp-sdw-03 kernel: pci_unregister_driver+0x3f/0x90
May 22 21:01:17 ba-lnlm-rvp-sdw-03 kernel: __do_sys_delete_module+0x1a5/0x2a0
May 22 21:01:17 ba-lnlm-rvp-sdw-03 kernel: ? __x64_sys_close+0x3c/0x80
May 22 21:01:17 ba-lnlm-rvp-sdw-03 kernel: do_syscall_64+0xa8/0x1b0
May 22 21:01:17 ba-lnlm-rvp-sdw-03 kernel: entry_SYSCALL_64_after_hwframe+0x77/0x7f
May 22 21:01:17 ba-lnlm-rvp-sdw-03 kernel: RIP: 0033:0x7f4979989aeb
May 22 21:01:17 ba-lnlm-rvp-sdw-03 kernel: Code: 73 01 c3 48 8b 0d 45 33 0f 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 b0 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 15 33 0f 00 f7 d8 64 89 01 48
May 22 21:01:17 ba-lnlm-rvp-sdw-03 kernel: RSP: 002b:00007ffead476c78 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
May 22 21:01:17 ba-lnlm-rvp-sdw-03 kernel: RAX: ffffffffffffffda RBX: 00005565dbe3e780 RCX: 00007f4979989aeb
Can you please avoid too "risky" pull requests until all devices are fully equipped?
You would one know if a PR is risky or not?
Don't put a device in CI if it's not fully supported by the CI infrastructure...
You would one know if a PR is risky or not?
... when it says "experiments" in the name and causes "kernel NULL pointer dereferences", which I believe are fairly rare.
Don't put a device in CI if it's not fully supported by the CI infrastructure...
It's better to have a device that must be manually rebooted from time to time compared to not having it at all. Manually rebooting it is a small (and temporary) price to pay for significantly more bandwidth. We're running really short in some areas.
You would one know if a PR is risky or not?
... when it says "experiments" in the name and causes "kernel NULL pointer dereferences", which I believe are fairly rare.
Race conditions happen. I couldn't reproduce the problem on my own device with the scripts until I found a simpler recipe that shows it 100% of the time.
Don't put a device in CI if it's not fully supported by the CI infrastructure...
It's better to have a device that must be manually rebooted from time to time compared to not having it at all. Manually rebooting it is a small (and temporary) price to pay for significantly more bandwidth. We're running really short in some areas.
Those devices pay for themselves given the average engineer hourly rate.
For the record, here's the race condition fix
diff --git a/drivers/soundwire/slave.c b/drivers/soundwire/slave.c
index 4e886a37910a..f5ecae748106 100644
--- a/drivers/soundwire/slave.c
+++ b/drivers/soundwire/slave.c
@@ -71,6 +71,14 @@ int sdw_slave_add(struct sdw_bus *bus,
list_add_tail(&slave->node, &bus->slaves);
mutex_unlock(&bus->bus_lock);
+ /*
+ * SDCA basic properties are used in the SoundWire driver
+ * probe. To avoid any race condition, they must be updated before
+ * device_register(), which will trigger the driver probe.
+ */
+ sdca_lookup_interface_revision(slave);
+ sdca_lookup_functions(slave);
+
ret = device_register(&slave->dev);
if (ret) {
dev_err(bus->dev, "Failed to add slave: ret %d\n", ret);
@@ -88,9 +96,6 @@ int sdw_slave_add(struct sdw_bus *bus,
}
sdw_slave_debugfs_init(slave);
- sdca_lookup_interface_revision(slave);
- sdca_lookup_functions(slave);
-
return ret;
}
EXPORT_SYMBOL(sdw_slave_add);
in some cases, sdca_lookup_functions() was called AFTER the codec probe, which caused a kernel oops on remove.
Those devices pay for themselves given the average engineer hourly rate.
Come on, you know it's never that simple.
Anyway I think we keeping adding more hardware right now.
closing, will re-add a new branch
Tentative branch to show how simple it'd be to create subdevices for each SDCA function, using the auxiliary bus.
Tested on a MTL device with 2x SmartAmp, UAJ and HID with the following options added:
The code gets to a probe for all the drivers:
What we are missing is a callback to tell the functions that the init_io is complete on attachment. suspend-resume can be added as well.
@bardliao @ujfalusi @kv2019i let me know what you think.