thesofproject / linux

Linux kernel source tree
Other
89 stars 129 forks source link

[LNL] SoundWire: pm_runtime resume timed out - interrupts lost #4943

Closed marc-hb closed 2 weeks ago

marc-hb commented 4 months ago

Spotted by chance on jf-lnlm-rvp-sdw-1 while testing something totally unrelated:

https://sof-ci.01.org/softestpr/PR1178/build324/devicetest/index.html?model=LNLM_SDW_AIOC&testcase=check-sof-logger (internal run ID 40150)

2024-04-22 22:40:16 UTC [ERROR] Caught kernel log error
===========================>>
[  173.964280] kernel: rt715-sdca sdw:0:1:025d:0714:01: rt715_dev_resume: Enumeration not complete, timed out
[  173.979644] kernel: rt715-sdca sdw:0:1:025d:0714:01: intel_resume_child_device: pm_runtime_resume failed: -110
[  173.979746] kernel: soundwire sdw-master-0-1: sdw_intel_exit: intel_resume_child_device failed: -110

(Spotted while testing unrelated https://github.com/thesofproject/sof-test/pull/1178)

marc-hb commented 4 months ago
Apr 22 22:40:00 jf-lnlm-rvp-sdw-1 kernel: soundwire_cadence:cdns_fill_msg_resp: soundwire_intel soundwire_intel.link.2: Msg ignored for Slave 0
Apr 22 22:40:00 jf-lnlm-rvp-sdw-1 kernel: soundwire_bus:sdw_program_device_num: soundwire sdw-master-0-2: No more devices to enumerate
Apr 22 22:40:00 jf-lnlm-rvp-sdw-1 kernel: soundwire_cadence:cdns_update_slave_status_work: soundwire_intel soundwire_intel.link.2: Slave status change: 0x21
Apr 22 22:40:00 jf-lnlm-rvp-sdw-1 kernel: soundwire_bus:sdw_modify_slave_status: rt1316-sdca sdw:0:2:025d:1316:01: signaling enumeration completion for Slave 1
Apr 22 22:40:00 jf-lnlm-rvp-sdw-1 kernel: soundwire_bus:sdw_slave_set_frequency: rt1316-sdca sdw:0:2:025d:1316:01: Configured bus base 1, scale 3, mclk 19200000, curr_freq 4800000
Apr 22 22:40:00 jf-lnlm-rvp-sdw-1 kernel: snd_soc_rt1316_sdw:rt1316_io_init: rt1316-sdca sdw:0:3:025d:1316:01: rt1316_io_init hw_init complete
Apr 22 22:40:00 jf-lnlm-rvp-sdw-1 kernel: soundwire_bus:sdw_handle_slave_status: rt1316-sdca sdw:0:3:025d:1316:01: signaling initialization completion for Slave 1
Apr 22 22:40:00 jf-lnlm-rvp-sdw-1 kernel: snd_soc_rt1316_sdw:rt1316_io_init: rt1316-sdca sdw:0:2:025d:1316:01: rt1316_io_init hw_init complete
Apr 22 22:40:00 jf-lnlm-rvp-sdw-1 kernel: soundwire_bus:sdw_handle_slave_status: rt1316-sdca sdw:0:2:025d:1316:01: signaling initialization completion for Slave 1
Apr 22 22:40:00 jf-lnlm-rvp-sdw-1 kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx      : 0x44000000|0x3060004c: MOD_LARGE_CONFIG_SET [data size: 76]
Apr 22 22:40:00 jf-lnlm-rvp-sdw-1 kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx reply: 0x64000000|0x3060004c: MOD_LARGE_CONFIG_SET
Apr 22 22:40:00 jf-lnlm-rvp-sdw-1 kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx done : 0x44000000|0x3060004c: MOD_LARGE_CONFIG_SET [data size: 76]
Apr 22 22:40:00 jf-lnlm-rvp-sdw-1 kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx      : 0x47000000|0x0: MOD_SET_DX [data size: 8]
Apr 22 22:40:00 jf-lnlm-rvp-sdw-1 kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx reply: 0x67000000|0x0: MOD_SET_DX
Apr 22 22:40:00 jf-lnlm-rvp-sdw-1 kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx done : 0x47000000|0x0: MOD_SET_DX [data size: 8]
Apr 22 22:40:00 jf-lnlm-rvp-sdw-1 kernel: snd_soc_rt711_sdca:rt711_sdca_calibration: rt711-sdca sdw:0:0:025d:0711:01: rt711_sdca_calibration calibration complete, ret=0
Apr 22 22:40:00 jf-lnlm-rvp-sdw-1 kernel: snd_soc_rt711_sdca:rt711_sdca_jack_init: rt711-sdca sdw:0:0:025d:0711:01: in rt711_sdca_jack_init enable
Apr 22 22:40:00 jf-lnlm-rvp-sdw-1 kernel: snd_soc_rt711_sdca:rt711_sdca_io_init: rt711-sdca sdw:0:0:025d:0711:01: rt711_sdca_io_init hw_init complete
Apr 22 22:40:00 jf-lnlm-rvp-sdw-1 kernel: soundwire_bus:sdw_handle_slave_status: rt711-sdca sdw:0:0:025d:0711:01: signaling initialization completion for Slave 6
Apr 22 22:40:01 jf-lnlm-rvp-sdw-1 kernel: soundwire_cadence:cdns_update_slave_status_work: soundwire_intel soundwire_intel.link.0: Slave status change: 0x4000000
Apr 22 22:40:01 jf-lnlm-rvp-sdw-1 kernel: snd_soc_rt711_sdca:rt711_sdca_interrupt_callback: rt711-sdca sdw:0:0:025d:0711:01: rt711_sdca_interrupt_callback control_port_stat=4, sdca_cascade=1
Apr 22 22:40:01 jf-lnlm-rvp-sdw-1 kernel: soundwire_cadence:cdns_update_slave_status_work: soundwire_intel soundwire_intel.link.0: Slave status change: 0x2000000
Apr 22 22:40:04 jf-lnlm-rvp-sdw-1 kernel: soundwire_bus:sdw_bus_wait_for_clk_prep_deprep: soundwire sdw-master-0-3: clock stop prepare done slave:15
Apr 22 22:40:04 jf-lnlm-rvp-sdw-1 kernel: soundwire_bus:sdw_bus_wait_for_clk_prep_deprep: soundwire sdw-master-0-2: clock stop prepare done slave:15
Apr 22 22:40:04 jf-lnlm-rvp-sdw-1 kernel: soundwire_bus:sdw_bus_wait_for_clk_prep_deprep: soundwire sdw-master-0-0: clock stop prepare done slave:15
Apr 22 22:40:06 jf-lnlm-rvp-sdw-1 kernel: rt715-sdca sdw:0:1:025d:0714:01: rt715_dev_resume: Enumeration not complete, timed out
Apr 22 22:40:06 jf-lnlm-rvp-sdw-1 kernel: soundwire_bus:sdw_show_ping_status: soundwire sdw-master-0-1: PING status: 0x1
Apr 22 22:40:06 jf-lnlm-rvp-sdw-1 kernel: rt715-sdca sdw:0:1:025d:0714:01: intel_resume_child_device: pm_runtime_resume failed: -110
Apr 22 22:40:06 jf-lnlm-rvp-sdw-1 kernel: soundwire sdw-master-0-1: sdw_intel_exit: intel_resume_child_device failed: -110
Apr 22 22:40:06 jf-lnlm-rvp-sdw-1 kernel: soundwire_intel:intel_shim_vs_set_clock_source: soundwire_intel soundwire_intel.link.2: clock source 0 LVSCTL 0x0
Apr 22 22:40:06 jf-lnlm-rvp-sdw-1 kernel: soundwire_bus:sdw_modify_slave_status: rt1316-sdca sdw:0:2:025d:1316:01: initializing enumeration and init completion for Slave 1
Apr 22 22:40:06 jf-lnlm-rvp-sdw-1 kernel: soundwire_cadence:cdns_init_clock_ctrl: soundwire_intel soundwire_intel.link.2: mclk 19200000 max 4800000 row 50 col 4
Apr 22 22:40:06 jf-lnlm-rvp-sdw-1 kernel: soundwire_cadence:cdns_update_slave_status_work: soundwire_intel soundwire_intel.link.2: Slave status change: 0x2
Apr 22 22:40:06 jf-lnlm-rvp-sdw-1 kernel: soundwire_bus:sdw_handle_slave_status: soundwire sdw-master-0-2: Slave attached, programming device number
Apr 22 22:40:06 jf-lnlm-rvp-sdw-1 kernel: soundwire_bus:sdw_extract_slave_id: soundwire sdw-master-0-2: SDW Slave Addr: 31025d131601
Apr 22 22:40:06 jf-lnlm-rvp-sdw-1 kernel: soundwire_bus:sdw_extract_slave_id: soundwire sdw-master-0-2: SDW Slave class_id 0x01, mfg_id 0x025d, part_id 0x1316, unique_id 0x1, version 0x3
Apr 22 22:40:06 jf-lnlm-rvp-sdw-1 kernel: soundwire_bus:sdw_assign_device_num: soundwire sdw-master-0-2: Slave already registered, reusing dev_num:1
Apr 22 22:40:06 jf-lnlm-rvp-sdw-1 kernel: soundwire_cadence:cdns_fill_msg_resp: soundwire_intel soundwire_intel.link.2: Msg ignored for Slave 0
Apr 22 22:40:06 jf-lnlm-rvp-sdw-1 kernel: soundwire_bus:sdw_program_device_num: soundwire sdw-master-0-2: No more devices to enumerate
Apr 22 22:40:06 jf-lnlm-rvp-sdw-1 kernel: soundwire_cadence:cdns_update_slave_status_work: soundwire_intel soundwire_intel.link.2: Slave status change: 0x21
Apr 22 22:40:06 jf-lnlm-rvp-sdw-1 kernel: soundwire_bus:sdw_modify_slave_status: rt1316-sdca sdw:0:2:025d:1316:01: signaling enumeration completion for Slave 1
Apr 22 22:40:06 jf-lnlm-rvp-sdw-1 kernel: soundwire_bus:sdw_slave_set_frequency: rt1316-sdca sdw:0:2:025d:1316:01: Configured bus base 1, scale 3, mclk 19200000, curr_freq 4800000
Apr 22 22:40:06 jf-lnlm-rvp-sdw-1 kernel: snd_soc_rt1316_sdw:rt1316_io_init: rt1316-sdca sdw:0:2:025d:1316:01: rt1316_io_init hw_init complete
Apr 22 22:40:06 jf-lnlm-rvp-sdw-1 kernel: soundwire_bus:sdw_handle_slave_status: rt1316-sdca sdw:0:2:025d:1316:01: signaling initialization completion for Slave 1
Apr 22 22:40:06 jf-lnlm-rvp-sdw-1 kernel: soundwire_intel:intel_shim_vs_set_clock_source: soundwire_intel soundwire_intel.link.3: clock source 0 LVSCTL 0x0
Apr 22 22:40:06 jf-lnlm-rvp-sdw-1 kernel: soundwire_bus:sdw_modify_slave_status: rt1316-sdca sdw:0:3:025d:1316:01: initializing enumeration and init completion for Slave 1
Apr 22 22:40:06 jf-lnlm-rvp-sdw-1 kernel: soundwire_cadence:cdns_init_clock_ctrl: soundwire_intel soundwire_intel.link.3: mclk 19200000 max 4800000 row 50 col 4
Apr 22 22:40:06 jf-lnlm-rvp-sdw-1 kernel: soundwire_cadence:cdns_update_slave_status_work: soundwire_intel soundwire_intel.link.3: Slave status change: 0x2
Apr 22 22:40:06 jf-lnlm-rvp-sdw-1 kernel: soundwire_bus:sdw_handle_slave_status: soundwire sdw-master-0-3: Slave attached, programming device number
Apr 22 22:40:06 jf-lnlm-rvp-sdw-1 kernel: soundwire_bus:sdw_extract_slave_id: soundwire sdw-master-0-3: SDW Slave Addr: 30025d131601
A
plbossart commented 4 months ago

Looks like a hardware/board issue to me

The last log for the rt714 is this https://sof-ci.ostc.intel.com/#/result/planresultdetail/40150?model=LNLM_SDW_AIOC&testcase=verify-kernel-boot-log

[    5.655780] kernel: soundwire_cadence:cdns_update_slave_status_work: soundwire_intel soundwire_intel.link.1: Slave status change: 0x20000001

This means the device was assigned an address and was no longer seen as device 0. But then nothing happened after that.

@bardliao what do you think?

bardliao commented 4 months ago

@plbossart Device number has been assigned. And there is no trace in rt715_sdca_io_init(). I think it is probably why we don't see the next action in the log.

[    5.654922] kernel: soundwire_cadence:cdns_update_slave_status_work: soundwire_intel soundwire_intel.link.1: Slave status change: 0x2
[    5.654946] kernel: soundwire_bus:sdw_handle_slave_status: soundwire sdw-master-0-1: Slave attached, programming device number
...
[    5.655694] kernel: soundwire_bus:sdw_modify_slave_status: rt711-sdca sdw:0:0:025d:0711:01: signaling enumeration completion for Slave 6

To me, it is more like a codec driver error. See change below. We need to wait for initialization_complete instead of enumeration_complete when the codec get resume.

diff --git a/sound/soc/codecs/rt715-sdca-sdw.c b/sound/soc/codecs/rt715-sdca-sdw.c
index d3fb02e8f31a..c8dabb9b16b5 100644
--- a/sound/soc/codecs/rt715-sdca-sdw.c
+++ b/sound/soc/codecs/rt715-sdca-sdw.c
@@ -234,10 +234,10 @@ static int __maybe_unused rt715_dev_resume(struct device *dev)
        if (!slave->unattach_request)
                goto regmap_sync;

-       time = wait_for_completion_timeout(&slave->enumeration_complete,
+       time = wait_for_completion_timeout(&slave->initialization_complete,
                                           msecs_to_jiffies(RT715_PROBE_TIMEOUT));
        if (!time) {
-               dev_err(&slave->dev, "%s: Enumeration not complete, timed out\n", __func__);
+               dev_err(&slave->dev, "%s: Initialization not complete, timed out\n", __func__);
                sdw_show_ping_status(slave->bus, true);

                return -ETIMEDOUT;
plbossart commented 4 months ago

I saw that confusion @bardliao but if the enumeration didn't complete by the deadline, then the initialization would not complete either?

marc-hb commented 4 months ago

Seen again today on a different device, this time it's ba-lnlm-rvp-sdw-01

https://sof-ci.01.org/softestpr/PR1075/build329/devicetest/index.html (internal run 40213)

bardliao commented 4 months ago

I saw that confusion @bardliao but if the enumeration didn't complete by the deadline, then the initialization would not complete either?

Fair point. Indeed, I can't find soundwire sdw-master-0-1: Slave attached, programming device number from https://github.com/thesofproject/linux/issues/4943#issuecomment-2071101627. And it happened again with different DUTs/codecs. https://github.com/thesofproject/linux/issues/4943#issuecomment-2073642246. Can it be AC timing issue again?

plbossart commented 4 months ago

@bardliao possibly an AC timing problem indeed. With PR https://github.com/thesofproject/linux/pull/4915, we modified the default for DOAIS from 0x1 to 0x3 (hardware default). It's worth checking if reverting this change 'fixes' the issue.

plbossart commented 4 months ago

reverting the DOAIS setting does not trigger any obvious change, so we'll have to track where and when this issue re-occurs, if at all.

plbossart commented 4 months ago

similar problem reported on LNL RVP:

https://sof-ci.01.org/linuxpr/PR4857/build2721/devicetest/index.html?model=LNLM_SDW_AIOC&testcase=check-kmod-load-unload

[ 1964.576041] kernel: rt711-sdca sdw:0:0:025d:0711:01: rt711_sdca_dev_resume: Initialization not complete, timed out
[ 1964.591372] kernel: soundwire_bus:sdw_show_ping_status: soundwire sdw-master-0-0: PING status: 0x1
[ 1964.591437] kernel: rt711-sdca sdw:0:0:025d:0711:01: intel_resume_child_device: pm_runtime_resume failed: -110
[ 1964.591584] kernel: soundwire sdw-master-0-0: sdw_intel_exit: intel_resume_child_device failed: -110

It looks like the RT711 is attached on link0 but somehow we never deal with it. That's starting to look like an interrupt that's not detected.

It was the same pattern in the initial bug report above but this time for link1.

Apr 22 22:40:06 jf-lnlm-rvp-sdw-1 kernel: rt715-sdca sdw:0:1:025d:0714:01: rt715_dev_resume: Enumeration not complete, timed out
Apr 22 22:40:06 jf-lnlm-rvp-sdw-1 kernel: soundwire_bus:sdw_show_ping_status: soundwire sdw-master-0-1: PING status: 0x1
Apr 22 22:40:06 jf-lnlm-rvp-sdw-1 kernel: rt715-sdca sdw:0:1:025d:0714:01: intel_resume_child_device: pm_runtime_resume failed: -110
Apr 22 22:40:06 jf-lnlm-rvp-sdw-1 kernel: soundwire sdw-master-0-1: sdw_intel_exit: intel_resume_child_device failed: -110
plbossart commented 4 months ago

@bardliao FYI. My two hypotheses are a) this comes from a bad setup with delayed interrupt enable b) there's an edge interrupt from two separate links at the same time and we miss one of the two.

plbossart commented 4 months ago

The mechanism for interrupts is that we have ONE interrupt, and then we deal with all the links with this loop

    list_for_each_entry(link, &ctx->link_list, list)
        sdw_cdns_irq(irq, link->cdns);

If we have an interrupt on link2, we could very well miss an interrupt on link0 if it happens AFTER we deal with link0 in the list walk. If we look at the reports so far we seem to miss the interrupt for link0 and link1, in all cases we don't seem to have problems dealing with the amplifiers on link2-3.

bardliao commented 4 months ago

The mechanism for interrupts is that we have ONE interrupt, and then we deal with all the links with this loop

  list_for_each_entry(link, &ctx->link_list, list)
      sdw_cdns_irq(irq, link->cdns);

If we have an interrupt on link2, we could very well miss an interrupt on link0 if it happens AFTER we deal with link0 in the list walk. If we look at the reports so far we seem to miss the interrupt for link0 and link1, in all cases we don't seem to have problems dealing with the amplifiers on link2-3.

Good finding. And it explains why the issue only seen with AIOC. However, I am curious why we only see the issue on LNL?

plbossart commented 4 months ago

@bardliao I am not sure this theory is correct, if it is that would mean the solution we implemented years ago was never robust... For LNL we start the links earlier, and IIRC we don't power-up all the links at once as we do for earlier generations. That could explain differences in timing during the enumeration stage.

bardliao commented 4 months ago

@bardliao I am not sure this theory is correct, if it is that would mean the solution we implemented years ago was never robust... For LNL we start the links earlier, and IIRC we don't power-up all the links at once as we do for earlier generations. That could explain differences in timing during the enumeration stage.

Hope we can find something with https://github.com/thesofproject/linux/pull/4976. But if it is the case, how should we fix it?

plbossart commented 4 months ago

If I am correct, we can loop multiple times to reduce the possibility of such issues, but it would not be a 100% robust fix. If we still miss an interrupt, we could add a recovery procedure when the resume timeout happens, so that we force the sdw_interrupt thread to be revisited again. That would be really ugly.

charleskeepax commented 4 months ago

Question 1 is probably, is the interrupt definitely edge triggered? And secondly, does it have to be? I would assume it is here and there is no option to switch it to level triggered, which is generally much easier to get right. The normal handling for overlapping edge triggered IRQs works roughly like this:

  1. IRQ comes in, the IRQ is left enabled, the state is set to handling, and the handler is started.
  2. If another IRQ comes in, we see the state is already handling so the IRQ is masked and the state is set to pending.
  3. When the first IRQ handler completes, we see the state is pending so the state is set back to handling, the IRQ is unmasked and the handler is re-entered.

Assuming one has the IRQ stuff all setup correctly this should be handled by the IRQ core.

plbossart commented 4 months ago

@charleskeepax we have ONE interrupt, but it's a combined status for IPC, streams, SoundWire and wakes. The solution we used so far was the same as Windows and it's been quite reliable since CML. What is possible is that with the change to use the HDaudio multi-link stuff, we handle things differently and the combination of events across links is possibly questionable. It's not really that the interrupt generation changed.

charleskeepax commented 4 months ago

I have had very occasional enumeration failures with cs42l43 on the TGL UpX, that has been on my list to investigate for a quite a while. This is probably a good candidate for a cause. If I follow the code correctly in sound/soc/sof/intel/hda.c, it is masking the IRQ in the handler, and unmasking it at the bottom of the thread. That is a pretty questionable thing to be doing with an edge triggered IRQ, will depend on the exact details of the hardware but generally that would be broken as you lose any IRQs that happen during the time the handler is running. Also keep in mind this could be a result of there being more IRQs. The less IRQs there are the less likely you are to lose some, so maybe one of those other changes increased the IRQ volume, or just changed the time such that two IRQs were more likely to happen close to each other.

charleskeepax commented 4 months ago

But first step would really be to check the hardware behaviour, we still haven't had any luck getting docs for the Cadence IP, although I assume this bit falls outside of that anyway.

plbossart commented 4 months ago

It's not a Cadence problem.

plbossart commented 4 months ago

we did change the timing on LNL, now the links are started serially whereas in previous solutions we have to power them up at the same time. I think the serial part creates a case where a link has an alert while handling another, it would be less likely in the simultaneous case since all devices would be handled at the same time.

marc-hb commented 4 months ago

1) 1.1 IRQ comes in, the IRQ is left enabled, 1.2 the state is set to handling, ... 2) If another IRQ comes in, we see the state is already handling so the IRQ is masked and the state is set to pending.

What happens if 2) comes between 1.1 and 1.2?

3) ... so 3.2 the state is set back to handling, 3.3 the IRQ is unmasked and...

Looks like there is another short window between 3.2 and 3.3 for another race here...

The normal handling for overlapping edge triggered IRQs works roughly like this ... Assuming one has the IRQ stuff all setup correctly this should be handled by the IRQ core.

The generic issue should indeed be a "solved problem": do you have a pointer to a good and not "rough" reference documentation? Unless edge-triggered interrupts are broken by design :-D

charleskeepax commented 4 months ago
  1. 1.1 IRQ comes in, the IRQ is left enabled, 1.2 the state is set to handling, ...
  2. If another IRQ comes in, we see the state is already handling so the IRQ is masked and the state is set to pending.

What happens if 2) comes between 1.1 and 1.2?

I believe the state variable would need to be either handled as an atomic or be guarded with a spin lock. I think the IRQ code uses a spin lock if memory serves but will try to find time to have a look tomorrow.

  1. ... so 3.2 the state is set back to handling, 3.3 the IRQ is unmasked and...

Looks like there is another short window between 3.2 and 3.3 for another race here...

Not sure this one actually counts since until the IRQ is unmasked you know no new handlers will kick off but depends a bit on how you order the code. Probably safest to wrap all access to the state in a spin lock.

The normal handling for overlapping edge triggered IRQs works roughly like this ... Assuming one has the IRQ stuff all setup correctly this should be handled by the IRQ core.

The generic issue should indeed be a "solved problem": do you have a pointer to a good and not "rough" reference documentation? Unless edge-triggered interrupts are broken by design :-D

They are not broken by design, but they sure are much harder to get right than level triggered IRQs. Partly why it's a good idea to outsource the handling to the IRQ subsystem. I think I had a couple emails from Thomas Gleixner saved somewhere I will try to dig out.

charleskeepax commented 4 months ago

So yeah checking the code (handle_edge_irq in kernel/irq/chip.c), it does indeed protect things with a spinlock.

The IRQ docs contains a description of the edge IRQ flow:

https://www.kernel.org/doc/html/v4.12/core-api/genericirq.html#default-edge-irq-flow-handler

Here is a thread on the mailing list where someone was attempting to do something similar with an edge IRQ masking it in the handler, unmasking it in the thread and Thomas explaining the issues with that:

https://lore.kernel.org/all/alpine.LFD.2.02.1210121440160.2779@ionos/#R

@marc-hb EDIT: in this newer page, the formatting is not broken: https://www.kernel.org/doc/html/v6.8/core-api/genericirq.html#default-edge-irq-flow-handler (thanks!)

plbossart commented 4 months ago

We have confirmation from our hardware folks that all IRQ handling is level-based. So we have to look for other explanations for the behavior we're seeing.

charleskeepax commented 4 months ago

Nice, that does generally make life much easier and the current code masking in the handler and unmasking at the end of the thread should be totally safe with a level IRQ. Since you should just fall right back into the handler again. It does make it much less clear what is going on though :-)

marc-hb commented 4 months ago

Looks similar but with rt711_sdca_dev_resume() instead:

https://sof-ci.01.org/sofpr/PR9115/build4618/devicetest/index.html?model=LNLM_SDW_AIOC&testcase=check-kmod-load-unload-after-playback

[ 2015.947252] kernel: soundwire_bus:sdw_handle_slave_status: rt1316-sdca sdw:0:3:025d:1316:01: signaling initialization completion for Slave 1
[ 2015.975402] kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx      : 0x44000000|0x3060004c: MOD_LARGE_CONFIG_SET [data size: 76]
[ 2015.975955] kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx reply: 0x64000000|0x3060004c: MOD_LARGE_CONFIG_SET
[ 2015.975977] kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx done : 0x44000000|0x3060004c: MOD_LARGE_CONFIG_SET [data size: 76]
[ 2015.975981] kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx      : 0x47000000|0x0: MOD_SET_DX [data size: 8]
[ 2016.136402] kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx reply: 0x67000000|0x0: MOD_SET_DX
[ 2016.136700] kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx done : 0x47000000|0x0: MOD_SET_DX [data size: 8]
[ 2019.699760] kernel: soundwire_bus:sdw_bus_wait_for_clk_prep_deprep: soundwire sdw-master-0-3: clock stop prepare done slave:15
[ 2019.699820] kernel: soundwire_bus:sdw_bus_wait_for_clk_prep_deprep: soundwire sdw-master-0-2: clock stop prepare done slave:15
[ 2019.700110] kernel: soundwire_bus:sdw_bus_wait_for_clk_prep_deprep: soundwire sdw-master-0-1: clock stop prepare done slave:15
[ 2021.409075] kernel: rt711-sdca sdw:0:0:025d:0711:01: rt711_sdca_dev_resume: Initialization not complete, timed out
[ 2021.424492] kernel: soundwire_bus:sdw_show_ping_status: soundwire sdw-master-0-0: PING status: 0x1
[ 2021.424536] kernel: rt711-sdca sdw:0:0:025d:0711:01: intel_resume_child_device: pm_runtime_resume failed: -110
[ 2021.424658] kernel: soundwire sdw-master-0-0: sdw_intel_exit: intel_resume_child_device failed: -110
[ 2021.424994] kernel: soundwire_intel:intel_shim_vs_set_clock_source: soundwire_intel soundwire_intel.link.1: clock source 0 LVSCTL 0x0
[ 2021.425067] kernel: soundwire_bus:sdw_modify_slave_status: rt715-sdca sdw:0:1:025d:0714:01: initializing enumeration and init completion for Slave 7
[ 2021.425088] kernel: soundwire_cadence:cdns_init_clock_ctrl: soundwire_intel soundwire_intel.link.1: mclk 19200000 max 4800000 row 50 col 4
[ 2021.426733] kernel: soundwire_cadence:cdns_update_slave_status_work: soundwire_intel soundwire_intel.link.1: Slave status change: 0x2
[ 2021.426899] kernel: soundwire_bus:sdw_handle_slave_status: soundwire sdw-master-0-1: Slave attached, programming device number
[

EDIT: another one https://sof-ci.01.org/sofpr/PR9159/build4831/devicetest/index.html?model=LNLM_SDW_AIOC&testcase=multiple-pipeline-all

https://sof-ci.01.org/sofpr/PR9159/build4831/devicetest/index.html?model=LNLM_SDW_AIOC&testcase=check-capture-all-formats

https://sof-ci.01.org/sofpr/PR9159/build4831/devicetest/index.html?model=LNLM_SDW_AIOC&testcase=check-kmod-load-unload

plbossart commented 4 months ago

Another one reported in https://github.com/thesofproject/sof/issues/9122 https://sof-ci.01.org/sofpr/PR9115/build4618/devicetest/index.html?model=LNLM_SDW_AIOC&testcase=check-kmod-load-unload-after-playback

plbossart commented 3 months ago

I haven't seen this error reported in the last two weeks. @marc-hb @fredoh9 do we have a mechanism to look for strings in test logs. A match on "PING status" would be enough

fredoh9 commented 3 months ago

We can scan test results into internal storage server directly. Need to narrow scan scope to avoid taking long time. I can try this this afternoon and let you know.

marc-hb commented 3 months ago

Last observation from me 8 days ago: https://github.com/thesofproject/linux/issues/4943#issuecomment-2107989123

fredoh9 commented 3 months ago

Last seen is 41430 (generic_test_new number) "Start Time: 2024-05-17 08:27:34 UTC"

40617/LNLM_SDW_AIOC/check-kmod-load-unload/dmesg.html:[ 1925.680979] kernel: soundwire_bus:sdw_show_ping_status: soundwire sdw-master-0-0: PING status: 0x1
40686/LNLM_SDW_AIOC/check-kmod-load-unload/dmesg.html:[ 1964.591372] kernel: soundwire_bus:sdw_show_ping_status: soundwire sdw-master-0-0: PING status: 0x1
40703/LNLM_SDW_AIOC/check-sof-logger/dmesg.html:[  247.913812] kernel: soundwire_bus:sdw_show_ping_status: soundwire sdw-master-0-0: PING status: 0x1
40703/LNLM_SDW_AIOC/multiple-pipeline-capture/dmesg.html:[ 2120.781727] kernel: soundwire_bus:sdw_show_ping_status: soundwire sdw-master-0-3: PING status: 0x1
40889/LNLM_SDW_AIOC/check-kmod-load-unload/dmesg.html:[ 2005.563315] kernel: soundwire_bus:sdw_show_ping_status: soundwire sdw-master-0-2: PING status: 0x1
40948/LNLM_SDW_AIOC/check-kmod-load-unload-after-playback/dmesg.html:[ 2021.424492] kernel: soundwire_bus:sdw_show_ping_status: soundwire sdw-master-0-0: PING status: 0x1
40988/LNLM_SDW_AIOC/check-capture-50rounds/dmesg.html:[ 1681.461390] kernel: soundwire_bus:sdw_show_ping_status: soundwire sdw-master-0-1: PING status: 0x1
41016/LNLM_SDW_AIOC/check-alsabat-headset-capture-599/dmesg.html:[ 2167.503016] kernel: soundwire_bus:sdw_show_ping_status: soundwire sdw-master-0-0: PING status: 0x1
41089/LNLM_SDW_AIOC/check-kmod-load-unload-after-playback/dmesg.html:[  365.663480] kernel: soundwire_bus:sdw_show_ping_status: soundwire sdw-master-0-0: PING status: 0x1
41118/LNLM_SDW_AIOC/check-kmod-load-unload-after-playback/dmesg.html:[ 2118.263892] kernel: soundwire_bus:sdw_show_ping_status: soundwire sdw-master-0-0: PING status: 0x1
41136/LNLM_SDW_AIOC/check-kmod-load-unload-after-playback/dmesg.html:[ 2018.861151] kernel: soundwire_bus:sdw_show_ping_status: soundwire sdw-master-0-3: PING status: 0x1
41224/LNLM_SDW_AIOC/check-signal-stop-start-capture-10/dmesg.html:[  690.746202] kernel: soundwire_bus:sdw_show_ping_status: soundwire sdw-master-0-2: PING status: 0x1
41253/LNLM_SDW_AIOC/check-runtime-pm-status/dmesg.html:[ 1709.624393] kernel: soundwire_bus:sdw_show_ping_status: soundwire sdw-master-0-0: PING status: 0x1
41269/LNLM_SDW_AIOC/check-kmod-load-unload-after-playback/dmesg.html:[ 2036.784993] kernel: soundwire_bus:sdw_show_ping_status: soundwire sdw-master-0-1: PING status: 0x1
41264/LNLM_SDW_AIOC/check-runtime-pm-status/dmesg.html:[ 1730.617024] kernel: soundwire_bus:sdw_show_ping_status: soundwire sdw-master-0-0: PING status: 0x1
41299/LNLM_SDW_AIOC/check-sof-logger/dmesg.html:[  250.471737] kernel: soundwire_bus:sdw_show_ping_status: soundwire sdw-master-0-2: PING status: 0x1
41311/LNLM_SDW_AIOC/check-sof-logger/dmesg.html:[  249.447548] kernel: soundwire_bus:sdw_show_ping_status: soundwire sdw-master-0-2: PING status: 0x1
41346/LNLM_SDW_AIOC/check-capture-all-formats/dmesg.html:[  290.918537] kernel: soundwire_bus:sdw_show_ping_status: soundwire sdw-master-0-3: PING status: 0x1
41437/LNLM_SDW_AIOC/check-kmod-load-unload-after-playback/dmesg.html:[ 2050.687592] kernel: soundwire_bus:sdw_show_ping_status: soundwire sdw-master-0-3: PING status: 0x1
41405/LNLM_SDW_AIOC/check-runtime-pm-status/dmesg.html:[ 1628.698374] kernel: soundwire_bus:sdw_show_ping_status: soundwire sdw-master-0-3: PING status: 0x1
41430/LNLM_SDW_AIOC/multiple-pipeline-all/dmesg.html:[ 1840.828359] kernel: soundwire_bus:sdw_show_ping_status: soundwire sdw-master-0-1: PING status: 0x1
marc-hb commented 3 months ago

May 18th above https://sof-ci.01.org/sofpr/PR9159/build4831/devicetest/index.html?model=LNLM_SDW_AIOC&testcase=check-kmod-load-unload

June 12th: https://sof-ci.01.org/softestpr/PR1180/build512/devicetest/index.html?model=LNLM_SDW_AIOC&testcase=check-sof-logger

plbossart commented 3 months ago

seen today on ba-lnlm-rvp-sdw-03 boot -28 2379f50512f242698128c0505c5f9e90 Thu 2024-06-06 16:00:15 UTC—Thu 2024-06-06 16:04:19 UTC

dmesg.log

plbossart commented 3 months ago

seen in https://sof-ci.ostc.intel.com/#/result/planresultdetail/42290?model=LNLM_SDW_AIOC&testcase=check-sof-logger

ssavati commented 3 months ago

@marc-hb @plbossart
This issue constantly reproducing for below test case

TPLG=/lib/firmware/intel/sof-ipc4-tplg/sof-lnl-rt711-l0-rt1316-l23-rt714-l1.tplg MODEL=LNLM_SDW_AIOC SOF_TEST_INTERVAL=5 ~/sof-test/test-case/check-kmod-load-unload.sh -l 50

I have observed issue mostly occurring after 21th iteration (3 times observed) load_unload_50times_log1.txt

marc-hb commented 3 months ago

Keep in mind that kmod-load-unload is part of check-sof-logger

bardliao commented 3 months ago

I am thinking will on board rt711 + external rt1316/rt714 cause any issue? Can we test on board rt711 only or use external rt711?

plbossart commented 3 months ago

@bardliao is certainly correct that we should keep one unmodified RVP with just the on-board RT711 codec, to help triangulate issues with board reworks and integration of external AIOC.

@fredoh9 do you know if the 'nocodec' boards used on LNL can be retargeted for SoundWire tests?

That said, the problem did occur with link0 and RT711, so I am not sure if this is only a board/hardware problem. Since this happens only on resume, there is a non-zero likelihood that we're missing something that's not restored properly.

plbossart commented 2 months ago

seen again in daily tests planresultdetail/42570?model=LNLM_SDW_AIOC&testcase=check-runtime-pm-status

plbossart commented 2 months ago

I ran 500 iterations of playback/pm_runtime suspend without seeing this enumeration problem with https://github.com/thesofproject/linux/pull/5065

I did see a bunch of xruns reported in issue #5066, but that's a separate problem.

@bardliao @lgirdwood we are entering the second order twilight zone...

plbossart commented 2 months ago

I managed to reproduce the issue with a non-stop test:

while true; do 
    TPLG=/lib/firmware/intel/sof-ipc4-tplg/sof-lnl-rt711-l0-rt1316-l23-rt714-l1.tplg MODEL=LNLM_SDW_AIOC SOF_TEST_INTERVAL=5 ~/sof-test/test-case/multiple-pipeline.sh -f p 2; 
done

The first error came after 3800s, over an hour. I am not sure how we are going to make progress on this one. The link2 was powered-up 95 times before the failure....

log.txt.gz

@bardliao @lgirdwood this is going to be fun!

marc-hb commented 2 months ago

To stop the while loop immediately after the first failure:

while true; do
   VAR1=VAL1 test.sh || break
done

Or:

while VAR1=VAL1 test.sh; do
   :
done
marc-hb commented 2 months ago

multiple-pipeline.sh -h has two parameters that could speed up reproduction:

    -w parameter |  --wait parameter
    duration of one (sub)test iteration
    Default Value: 5

    -l parameter |  --loop parameter
    loop count
    Default Value: 1
plbossart commented 2 months ago

It took 649 iterations to reproduce the problem!

2024-06-17 20:13:08 UTC [INFO] Test Result: PASS!
iteration 648 done
2024-06-17 20:13:13 UTC [INFO] ktime=12651 sof-test PID=218993: starting
2024-06-17 20:13:18 UTC [INFO] Starting /usr/local/bin/mtrace-reader.py >& /home/ubuntu/sof-test/logs/multiple-pipeline/2024-06-17-20:13:13-5762/mtrace.txt &
2024-06-17 20:13:18 UTC [INFO] /home/ubuntu/sof-test/test-case/multiple-pipeline.sh will use topology /usr/lib/firmware/intel/sof-ipc4-tplg/sof-lnl-rt711-l0-rt1316-l23-rt714-l1.tplg to run the test case
2024-06-17 20:13:18 UTC [INFO] Pipeline list to ignore is specified, will ignore 'pcm=HDA Digital' in test case
2024-06-17 20:13:18 UTC [INFO] Run command to get pipeline parameters
2024-06-17 20:13:18 UTC [COMMAND] sof-tplgreader.py /usr/lib/firmware/intel/sof-ipc4-tplg/sof-lnl-rt711-l0-rt1316-l23-rt714-l1.tplg -f 'type:playback & ~pcm:Amplifier Reference & ~pcm:HDMI' -b ' pcm:HDA Digital' -s 0 -e
2024-06-17 20:13:18 UTC [INFO] ===== Testing: (Loop: 1/1) =====
2024-06-17 20:13:18 UTC [INFO] /home/ubuntu/sof-test/test-case/multiple-pipeline.sh will use topology /usr/lib/firmware/intel/sof-ipc4-tplg/sof-lnl-rt711-l0-rt1316-l23-rt714-l1.tplg to run the test case
2024-06-17 20:13:18 UTC [INFO] Pipeline list to ignore is specified, will ignore 'pcm=HDA Digital' in test case
2024-06-17 20:13:18 UTC [INFO] Run command to get pipeline parameters
2024-06-17 20:13:18 UTC [COMMAND] sof-tplgreader.py /usr/lib/firmware/intel/sof-ipc4-tplg/sof-lnl-rt711-l0-rt1316-l23-rt714-l1.tplg -f 'type:playback  & ~pcm:HDMI' -b ' pcm:HDA Digital' -s 0 -e
2024-06-17 20:13:18 UTC [INFO] Testing: Speaker [hw:0,2]
2024-06-17 20:13:18 UTC [COMMAND] aplay   -D hw:0,2 -c 2 -r 48000 -f S16_LE /dev/zero -q
2024-06-17 20:13:18 UTC [INFO] Testing: Jack Out [hw:0,0]
2024-06-17 20:13:18 UTC [COMMAND] aplay   -D hw:0,0 -c 2 -r 48000 -f S16_LE /dev/zero -q
2024-06-17 20:13:18 UTC [INFO] Testing: Deepbuffer Jack Out [hw:0,31]
2024-06-17 20:13:18 UTC [INFO] wait 0.5s for aplay_opts()
2024-06-17 20:13:18 UTC [COMMAND] aplay   -D hw:0,31 -c 2 -r 48000 -f S16_LE /dev/zero -q
2024-06-17 20:13:18 UTC [INFO] checking pipeline status
2024-06-17 20:13:18 UTC [INFO] Letting playback/capture run for 5s
2024-06-17 20:13:23 UTC [INFO] checking pipeline status again
2024-06-17 20:13:23 UTC [ERROR] Running process count is 2, but 3 is expected
219161 aplay -D hw:0,0 -c 2 -r 48000 -f S16_LE /dev/zero -q
219170 aplay -D hw:0,31 -c 2 -r 48000 -f S16_LE /dev/zero -q
2024-06-17 20:13:24 UTC [INFO] Starting func_exit_handler(1)
2024-06-17 20:13:24 UTC [ERROR] Starting func_exit_handler(), exit status=1, FUNCNAME stack:
2024-06-17 20:13:24 UTC [ERROR]  func_error_exit()  @  /home/ubuntu/sof-test/test-case/multiple-pipeline.sh
2024-06-17 20:13:24 UTC [ERROR]  ps_checks()  @  /home/ubuntu/sof-test/test-case/multiple-pipeline.sh:148
2024-06-17 20:13:24 UTC [ERROR]  main()  @  /home/ubuntu/sof-test/test-case/multiple-pipeline.sh:202
2024-06-17 20:13:26 UTC [INFO] pkill -TERM -f mtrace-reader.py
2024-06-17 20:13:26 UTC [INFO] nlines=129 /home/ubuntu/sof-test/logs/multiple-pipeline/2024-06-17-20:13:13-5762/mtrace.txt
2024-06-17 20:13:28 UTC [INFO] ktime=12666 sof-test PID=218993: ending
2024-06-17 20:13:28 UTC [INFO] Test Result: FAIL!
[12656.153835] soundwire_intel:intel_shim_vs_set_clock_source: soundwire_intel soundwire_intel.link.2: clock source 0 LVSCTL 0x0
[12656.153842] soundwire_intel:intel_link_power_up: soundwire_intel soundwire_intel.link.2: first link up, programming SYNCPRD
[12656.153972] soundwire_intel:intel_shim_vs_set_clock_source: soundwire_intel soundwire_intel.link.1: clock source 0 LVSCTL 0x0
[12656.153986] soundwire_bus:sdw_modify_slave_status: rt1316-sdca sdw:0:2:025d:1316:01: initializing enumeration and init completion for Slave 1
[12656.154006] soundwire_bus:sdw_modify_slave_status: rt715-sdca sdw:0:1:025d:0714:01: initializing enumeration and init completion for Slave 7
[12656.154072] soundwire_intel:intel_shim_vs_set_clock_source: soundwire_intel soundwire_intel.link.0: clock source 0 LVSCTL 0x0
[12656.154128] soundwire_bus:sdw_modify_slave_status: rt711-sdca sdw:0:0:025d:0711:01: initializing enumeration and init completion for Slave 6
[12656.154172] soundwire_intel:intel_shim_vs_set_clock_source: soundwire_intel soundwire_intel.link.3: clock source 0 LVSCTL 0x0
[12656.154228] soundwire_cadence:cdns_init_clock_ctrl: soundwire_intel soundwire_intel.link.1: mclk 19200000 max 4800000 row 50 col 4
[12656.154272] soundwire_cadence:cdns_init_clock_ctrl: soundwire_intel soundwire_intel.link.0: mclk 19200000 max 4800000 row 50 col 4
[12656.154306] soundwire_bus:sdw_modify_slave_status: rt1316-sdca sdw:0:3:025d:1316:01: initializing enumeration and init completion for Slave 1
[12656.154316] soundwire_cadence:cdns_init_clock_ctrl: soundwire_intel soundwire_intel.link.3: mclk 19200000 max 4800000 row 50 col 4
[12656.154355] soundwire_cadence:cdns_init_clock_ctrl: soundwire_intel soundwire_intel.link.2: mclk 19200000 max 4800000 row 50 col 4
[12656.155635] cdns_update_slave_status_work: 2 callbacks suppressed
[12656.155639] soundwire_cadence:cdns_update_slave_status_work: soundwire_intel soundwire_intel.link.0: Slave status change: 0x2
[12656.155681] soundwire_bus:sdw_handle_slave_status: soundwire sdw-master-0-0: Slave attached, programming device number
[12656.155700] soundwire_cadence:cdns_update_slave_status_work: soundwire_intel soundwire_intel.link.1: Slave status change: 0x2
[12656.155737] soundwire_bus:sdw_handle_slave_status: soundwire sdw-master-0-1: Slave attached, programming device number
[12656.155806] soundwire_cadence:cdns_update_slave_status_work: soundwire_intel soundwire_intel.link.3: Slave status change: 0x2
[12656.155951] soundwire_bus:sdw_extract_slave_id: soundwire sdw-master-0-0: SDW Slave Addr: 30025d071101
[12656.155955] soundwire_bus:sdw_extract_slave_id: soundwire sdw-master-0-0: SDW Slave class_id 0x01, mfg_id 0x025d, part_id 0x0711, unique_id 0x0, version 0x3
[12656.155959] soundwire_bus:sdw_assign_device_num: soundwire sdw-master-0-0: Slave already registered, reusing dev_num:6
[12656.155968] soundwire_bus:sdw_extract_slave_id: soundwire sdw-master-0-1: SDW Slave Addr: 30025d071401
[12656.155971] soundwire_bus:sdw_extract_slave_id: soundwire sdw-master-0-1: SDW Slave class_id 0x01, mfg_id 0x025d, part_id 0x0714, unique_id 0x0, version 0x3
[12656.155974] soundwire_bus:sdw_assign_device_num: soundwire sdw-master-0-1: Slave already registered, reusing dev_num:7
[12656.155995] soundwire_bus:sdw_handle_slave_status: soundwire sdw-master-0-3: Slave attached, programming device number
[12656.156306] soundwire_bus:sdw_extract_slave_id: soundwire sdw-master-0-3: SDW Slave Addr: 30025d131601
[12656.156309] soundwire_bus:sdw_extract_slave_id: soundwire sdw-master-0-3: SDW Slave class_id 0x01, mfg_id 0x025d, part_id 0x1316, unique_id 0x0, version 0x3
[12656.156312] soundwire_bus:sdw_assign_device_num: soundwire sdw-master-0-3: Slave already registered, reusing dev_num:1
[12656.156324] soundwire_cadence:cdns_fill_msg_resp: soundwire_intel soundwire_intel.link.0: Msg ignored for Slave 0
[12656.156327] soundwire_bus:sdw_program_device_num: soundwire sdw-master-0-0: No more devices to enumerate
[12656.156336] soundwire_cadence:cdns_fill_msg_resp: soundwire_intel soundwire_intel.link.1: Msg ignored for Slave 0
[12656.156399] soundwire_bus:sdw_program_device_num: soundwire sdw-master-0-1: No more devices to enumerate
[12656.156474] soundwire_cadence:cdns_update_slave_status_work: soundwire_intel soundwire_intel.link.0: Slave status change: 0x2000001
[12656.156510] soundwire_bus:sdw_modify_slave_status: rt711-sdca sdw:0:0:025d:0711:01: signaling enumeration completion for Slave 6
[12656.156525] soundwire_cadence:cdns_update_slave_status_work: soundwire_intel soundwire_intel.link.1: Slave status change: 0x20000001
[12656.156558] soundwire_bus:sdw_modify_slave_status: rt715-sdca sdw:0:1:025d:0714:01: signaling enumeration completion for Slave 7
[12656.156632] soundwire_cadence:cdns_fill_msg_resp: soundwire_intel soundwire_intel.link.3: Msg ignored for Slave 0
[12656.156634] soundwire_bus:sdw_program_device_num: soundwire sdw-master-0-3: No more devices to enumerate
[12656.156755] soundwire_cadence:cdns_update_slave_status_work: soundwire_intel soundwire_intel.link.3: Slave status change: 0x21
[12656.156836] soundwire_bus:sdw_modify_slave_status: rt1316-sdca sdw:0:3:025d:1316:01: signaling enumeration completion for Slave 1
[12656.156844] soundwire_bus:sdw_slave_set_frequency: rt711-sdca sdw:0:0:025d:0711:01: Configured bus base 1, scale 3, mclk 19200000, curr_freq 4800000
[12656.157120] soundwire_bus:sdw_slave_set_frequency: rt715-sdca sdw:0:1:025d:0714:01: Configured bus base 1, scale 3, mclk 19200000, curr_freq 4800000
[12656.157311] soundwire_bus:sdw_slave_set_frequency: rt1316-sdca sdw:0:3:025d:1316:01: Configured bus base 1, scale 3, mclk 19200000, curr_freq 4800000
[12656.157748] soundwire_bus:sdw_handle_slave_status: rt715-sdca sdw:0:1:025d:0714:01: signaling initialization completion for Slave 7
[12656.177031] snd_soc_rt1316_sdw:rt1316_io_init: rt1316-sdca sdw:0:3:025d:1316:01: rt1316_io_init hw_init complete
[12656.177038] soundwire_bus:sdw_handle_slave_status: rt1316-sdca sdw:0:3:025d:1316:01: signaling initialization completion for Slave 1
[12656.358042] snd_soc_rt711_sdca:rt711_sdca_calibration: rt711-sdca sdw:0:0:025d:0711:01: rt711_sdca_calibration calibration complete, ret=0
[12656.367475] snd_soc_rt711_sdca:rt711_sdca_jack_init: rt711-sdca sdw:0:0:025d:0711:01: in rt711_sdca_jack_init enable
[12656.367487] snd_soc_rt711_sdca:rt711_sdca_io_init: rt711-sdca sdw:0:0:025d:0711:01: rt711_sdca_io_init hw_init complete
[12656.367490] soundwire_bus:sdw_handle_slave_status: rt711-sdca sdw:0:0:025d:0711:01: signaling initialization completion for Slave 6
[12656.371690] soundwire_cadence:cdns_update_slave_status_work: soundwire_intel soundwire_intel.link.0: Slave status change: 0x4000000
[12656.380101] snd_soc_rt711_sdca:rt711_sdca_interrupt_callback: rt711-sdca sdw:0:0:025d:0711:01: rt711_sdca_interrupt_callback control_port_stat=4, sdca_cascade=1
[12656.381118] soundwire_cadence:cdns_update_slave_status_work: soundwire_intel soundwire_intel.link.0: Slave status change: 0x2000000
[12656.412867] snd_soc_rt711_sdca:rt711_sdca_headset_detect: rt711-sdca sdw:0:0:025d:0711:01: rt711_sdca_headset_detect, detected_mode=0x0
[12656.412876] snd_soc_rt711_sdca:rt711_sdca_jack_detect_handler: rt711-sdca sdw:0:0:025d:0711:01: in rt711_sdca_jack_detect_handler, jack_type=0x0
[12656.412879] snd_soc_rt711_sdca:rt711_sdca_jack_detect_handler: rt711-sdca sdw:0:0:025d:0711:01: in rt711_sdca_jack_detect_handler, btn_type=0x0
[12656.412882] snd_soc_rt711_sdca:rt711_sdca_jack_detect_handler: rt711-sdca sdw:0:0:025d:0711:01: in rt711_sdca_jack_detect_handler, scp_sdca_stat1=0x1, scp_sdca_stat2=0x0
[12656.882461] soundwire_cadence:cdns_update_slave_status_work: soundwire_intel soundwire_intel.link.0: Slave status change: 0x4000000
[12656.883993] snd_soc_rt711_sdca:rt711_sdca_interrupt_callback: rt711-sdca sdw:0:0:025d:0711:01: rt711_sdca_interrupt_callback control_port_stat=4, sdca_cascade=1
[12656.885729] soundwire_cadence:cdns_update_slave_status_work: soundwire_intel soundwire_intel.link.0: Slave status change: 0x2000000
[12656.919436] snd_soc_rt711_sdca:rt711_sdca_headset_detect: rt711-sdca sdw:0:0:025d:0711:01: rt711_sdca_headset_detect, detected_mode=0x5
[12656.919446] snd_soc_rt711_sdca:rt711_sdca_jack_detect_handler: rt711-sdca sdw:0:0:025d:0711:01: in rt711_sdca_jack_detect_handler, jack_type=0x3
[12656.919451] snd_soc_rt711_sdca:rt711_sdca_jack_detect_handler: rt711-sdca sdw:0:0:025d:0711:01: in rt711_sdca_jack_detect_handler, btn_type=0x0
[12656.919454] snd_soc_rt711_sdca:rt711_sdca_jack_detect_handler: rt711-sdca sdw:0:0:025d:0711:01: in rt711_sdca_jack_detect_handler, scp_sdca_stat1=0x1, scp_sdca_stat2=0x0
[12656.920011] snd_soc_rt711_sdca:rt711_sdca_interrupt_callback: rt711-sdca sdw:0:0:025d:0711:01: rt711_sdca_interrupt_callback control_port_stat=4, sdca_cascade=1
[12656.954731] snd_soc_rt711_sdca:rt711_sdca_headset_detect: rt711-sdca sdw:0:0:025d:0711:01: rt711_sdca_headset_detect, detected_mode=0x5
[12656.954741] snd_soc_rt711_sdca:rt711_sdca_jack_detect_handler: rt711-sdca sdw:0:0:025d:0711:01: in rt711_sdca_jack_detect_handler, jack_type=0x3
[12656.954744] snd_soc_rt711_sdca:rt711_sdca_jack_detect_handler: rt711-sdca sdw:0:0:025d:0711:01: in rt711_sdca_jack_detect_handler, btn_type=0x0
[12656.954746] snd_soc_rt711_sdca:rt711_sdca_jack_detect_handler: rt711-sdca sdw:0:0:025d:0711:01: in rt711_sdca_jack_detect_handler, scp_sdca_stat1=0x1, scp_sdca_stat2=0x0
[12659.847445] soundwire_bus:sdw_bus_wait_for_clk_prep_deprep: soundwire sdw-master-0-1: clock stop prepare done slave:15
[12660.131857] soundwire_bus:sdw_bus_wait_for_clk_prep_deprep: soundwire sdw-master-0-0: clock stop prepare done slave:15
[12661.260889] rt1316-sdca sdw:0:2:025d:1316:01: rt1316_dev_resume: Initialization not complete, timed out
[12661.276146] soundwire_bus:sdw_show_ping_status: soundwire sdw-master-0-2: PING status: 0x1
[12661.276171] rt1316-sdca sdw:0:2:025d:1316:01: ASoC: error at snd_soc_pcm_component_pm_runtime_get on sdw:0:2:025d:1316:01: -110
[12661.276279]  SDW2-Playback: ASoC: error at __soc_pcm_open on SDW2-Playback: -110
[12661.276309]  Speaker: ASoC: error at dpcm_be_dai_startup on Speaker: -110
[12661.276335]  Speaker: ASoC: error at dpcm_fe_dai_startup on Speaker: -110

so far it's always link2 that shows the problem, not sure why. Maybe it's because it's the first link that gets used with this script.

plbossart commented 2 months ago

My theory is wrong, there are examples of failures where the first link to be used is NOT the one that fails, example copied from https://sof-ci.01.org/linuxpr/PR4857/build2721/devicetest/index.html?model=LNLM_SDW_AIOC&testcase=check-kmod-load-unload shows link1/rt715 is started first but link0/rt711 fails to resume.

[ 1959.256937] kernel: soundwire_intel:intel_shim_vs_set_clock_source: soundwire_intel soundwire_intel.link.1: clock source 0 LVSCTL 0x0
[ 1959.256949] kernel: soundwire_intel:intel_link_power_up: soundwire_intel soundwire_intel.link.1: first link up, programming SYNCPRD
[ 1959.257044] kernel: soundwire_bus:sdw_modify_slave_status: rt715-sdca sdw:0:1:025d:0714:01: initializing enumeration and init completion for Slave 6
[ 1959.257065] kernel: soundwire_cadence:cdns_init_clock_ctrl: soundwire_intel soundwire_intel.link.1: mclk 19200000 max 4800000 row 50 col 4
[ 1959.257217] kernel: soundwire_intel:intel_shim_vs_set_clock_source: soundwire_intel soundwire_intel.link.2: clock source 0 LVSCTL 0x0
[ 1959.257284] kernel: soundwire_bus:sdw_modify_slave_status: rt1316-sdca sdw:0:2:025d:1316:01: initializing enumeration and init completion for Slave 1
[ 1959.257306] kernel: soundwire_cadence:cdns_init_clock_ctrl: soundwire_intel soundwire_intel.link.2: mclk 19200000 max 4800000 row 50 col 4
[ 1959.257444] kernel: soundwire_intel:intel_shim_vs_set_clock_source: soundwire_intel soundwire_intel.link.3: clock source 0 LVSCTL 0x0
[ 1959.257658] kernel: soundwire_bus:sdw_modify_slave_status: rt1316-sdca sdw:0:3:025d:1316:01: initializing enumeration and init completion for Slave 1
[ 1959.257675] kernel: soundwire_cadence:cdns_init_clock_ctrl: soundwire_intel soundwire_intel.link.3: mclk 19200000 max 4800000 row 50 col 4
[ 1959.257765] kernel: soundwire_intel:intel_shim_vs_set_clock_source: soundwire_intel soundwire_intel.link.0: clock source 0 LVSCTL 0x0
[ 1959.258299] kernel: soundwire_bus:sdw_modify_slave_status: rt711-sdca sdw:0:0:025d:0711:01: initializing enumeration and init completion for Slave 7
[ 1959.258329] kernel: soundwire_cadence:cdns_init_clock_ctrl: soundwire_intel soundwire_intel.link.0: mclk 19200000 max 4800000 row 50 col 4
[ 1959.258940] kernel: soundwire_cadence:cdns_update_slave_status_work: soundwire_intel soundwire_intel.link.1: Slave status change: 0x2
[ 1959.259073] kernel: soundwire_bus:sdw_handle_slave_status: soundwire sdw-master-0-1: Slave attached, programming device number
[ 1959.259135] kernel: soundwire_cadence:cdns_update_slave_status_work: soundwire_intel soundwire_intel.link.2: Slave status change: 0x2
[ 1959.259178] kernel: soundwire_bus:sdw_handle_slave_status: soundwire sdw-master-0-2: Slave attached, programming device number
[ 1959.259201] kernel: soundwire_cadence:cdns_update_slave_status_work: soundwire_intel soundwire_intel.link.3: Slave status change: 0x1
[ 1959.259309] kernel: soundwire_bus:sdw_extract_slave_id: soundwire sdw-master-0-1: SDW Slave Addr: 30025d071401
[ 1959.259313] kernel: soundwire_bus:sdw_extract_slave_id: soundwire sdw-master-0-1: SDW Slave class_id 0x01, mfg_id 0x025d, part_id 0x0714, unique_id 0x0, version 0x3
[ 1959.259318] kernel: soundwire_bus:sdw_assign_device_num: soundwire sdw-master-0-1: Slave already registered, reusing dev_num:6
[ 1959.259439] kernel: soundwire_bus:sdw_extract_slave_id: soundwire sdw-master-0-2: SDW Slave Addr: 31025d131601
[ 1959.259442] kernel: soundwire_bus:sdw_extract_slave_id: soundwire sdw-master-0-2: SDW Slave class_id 0x01, mfg_id 0x025d, part_id 0x1316, unique_id 0x1, version 0x3
[ 1959.259446] kernel: soundwire_bus:sdw_assign_device_num: soundwire sdw-master-0-2: Slave already registered, reusing dev_num:1
[ 1959.259668] kernel: soundwire_cadence:cdns_update_slave_status_work: soundwire_intel soundwire_intel.link.3: Slave status change: 0x2
[ 1959.259711] kernel: soundwire_bus:sdw_handle_slave_status: soundwire sdw-master-0-3: Slave attached, programming device number
[ 1959.259729] kernel: soundwire_cadence:cdns_fill_msg_resp: soundwire_intel soundwire_intel.link.1: Msg ignored for Slave 0
[ 1959.259734] kernel: soundwire_bus:sdw_program_device_num: soundwire sdw-master-0-1: No more devices to enumerate
[ 1959.259980] kernel: soundwire_cadence:cdns_update_slave_status_work: soundwire_intel soundwire_intel.link.1: Slave status change: 0x2000001
[ 1959.260013] kernel: soundwire_bus:sdw_modify_slave_status: rt715-sdca sdw:0:1:025d:0714:01: signaling enumeration completion for Slave 6
[ 1959.260074] kernel: soundwire_bus:sdw_extract_slave_id: soundwire sdw-master-0-3: SDW Slave Addr: 30025d131601
[ 1959.260085] kernel: soundwire_bus:sdw_extract_slave_id: soundwire sdw-master-0-3: SDW Slave class_id 0x01, mfg_id 0x025d, part_id 0x1316, unique_id 0x0, version 0x3
[ 1959.260100] kernel: soundwire_bus:sdw_assign_device_num: soundwire sdw-master-0-3: Slave already registered, reusing dev_num:1
[ 1959.260267] kernel: soundwire_cadence:cdns_fill_msg_resp: soundwire_intel soundwire_intel.link.2: Msg ignored for Slave 0
[ 1959.260278] kernel: soundwire_bus:sdw_program_device_num: soundwire sdw-master-0-2: No more devices to enumerate
[ 1959.260360] kernel: soundwire_cadence:cdns_update_slave_status_work: soundwire_intel soundwire_intel.link.2: Slave status change: 0x21
[ 1959.260393] kernel: soundwire_bus:sdw_modify_slave_status: rt1316-sdca sdw:0:2:025d:1316:01: signaling enumeration completion for Slave 1
[ 1959.260479] kernel: soundwire_cadence:cdns_fill_msg_resp: soundwire_intel soundwire_intel.link.3: Msg ignored for Slave 0
[ 1959.260485] kernel: soundwire_bus:sdw_program_device_num: soundwire sdw-master-0-3: No more devices to enumerate
[ 1959.260673] kernel: soundwire_bus:sdw_slave_set_frequency: rt715-sdca sdw:0:1:025d:0714:01: Configured bus base 1, scale 3, mclk 19200000, curr_freq 4800000
[ 1959.260697] kernel: soundwire_cadence:cdns_update_slave_status_work: soundwire_intel soundwire_intel.link.3: Slave status change: 0x21
[ 1959.260796] kernel: soundwire_bus:sdw_modify_slave_status: rt1316-sdca sdw:0:3:025d:1316:01: signaling enumeration completion for Slave 1
[ 1959.260805] kernel: soundwire_bus:sdw_slave_set_frequency: rt1316-sdca sdw:0:2:025d:1316:01: Configured bus base 1, scale 3, mclk 19200000, curr_freq 4800000
[ 1959.261408] kernel: soundwire_bus:sdw_slave_set_frequency: rt1316-sdca sdw:0:3:025d:1316:01: Configured bus base 1, scale 3, mclk 19200000, curr_freq 4800000
[ 1959.261440] kernel: soundwire_bus:sdw_handle_slave_status: rt715-sdca sdw:0:1:025d:0714:01: signaling initialization completion for Slave 6
[ 1959.274495] kernel: snd_soc_rt1316_sdw:rt1316_io_init: rt1316-sdca sdw:0:2:025d:1316:01: rt1316_io_init hw_init complete
[ 1959.274499] kernel: soundwire_bus:sdw_handle_slave_status: rt1316-sdca sdw:0:2:025d:1316:01: signaling initialization completion for Slave 1
[ 1959.275046] kernel: snd_soc_rt1316_sdw:rt1316_io_init: rt1316-sdca sdw:0:3:025d:1316:01: rt1316_io_init hw_init complete
[ 1959.275048] kernel: soundwire_bus:sdw_handle_slave_status: rt1316-sdca sdw:0:3:025d:1316:01: signaling initialization completion for Slave 1
[ 1959.278233] kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx      : 0x44000000|0x3060004c: MOD_LARGE_CONFIG_SET [data size: 76]
[ 1959.278536] kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx reply: 0x64000000|0x3060004c: MOD_LARGE_CONFIG_SET
[ 1959.278555] kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx done : 0x44000000|0x3060004c: MOD_LARGE_CONFIG_SET [data size: 76]
[ 1959.278559] kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx      : 0x47000000|0x0: MOD_SET_DX [data size: 8]
[ 1959.438214] kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx reply: 0x67000000|0x0: MOD_SET_DX
[ 1959.438494] kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx done : 0x47000000|0x0: MOD_SET_DX [data size: 8]
[ 1963.028494] kernel: soundwire_bus:sdw_bus_wait_for_clk_prep_deprep: soundwire sdw-master-0-1: clock stop prepare done slave:15
[ 1963.028662] kernel: soundwire_bus:sdw_bus_wait_for_clk_prep_deprep: soundwire sdw-master-0-3: clock stop prepare done slave:15
[ 1963.028754] kernel: soundwire_bus:sdw_bus_wait_for_clk_prep_deprep: soundwire sdw-master-0-2: clock stop prepare done slave:15
[ 1964.576041] kernel: rt711-sdca sdw:0:0:025d:0711:01: rt711_sdca_dev_resume: Initialization not complete, timed out
[ 1964.591372] kernel: soundwire_bus:sdw_show_ping_status: soundwire sdw-master-0-0: PING status: 0x1
[ 1964.591437] kernel: rt711-sdca sdw:0:0:025d:0711:01: intel_resume_child_device: pm_runtime_resume failed: -110
[ 1964.591584] kernel: soundwire sdw-master-0-0: sdw_intel_exit: intel_resume_child_device failed: -110
fredoh9 commented 2 months ago

@fredoh9 do you know if the 'nocodec' boards used on LNL can be retargeted for SoundWire tests?

I can change one nocodec RVP to AIOC model today

marc-hb commented 2 months ago

Another one today. The logs look a little bit different

https://sof-ci.01.org/sofpr/PR9185/build5843/devicetest/index.html?model=LNLM_SDW_AIOC&testcase=test-speaker

plbossart commented 2 months ago

Another one today. The logs look a little bit different

https://sof-ci.01.org/sofpr/PR9185/build5843/devicetest/index.html?model=LNLM_SDW_AIOC&testcase=test-speaker

It's really the same issue.

ssavati commented 1 month ago

@plbossart @kv2019i Can we close this issue? I'm not observing this issue. if anyone observing sporadically let me know we can keep this issue with reduce priority and remove “production” label