Very frequent rt711-sdca "parity errors" on tglu-sku0a32-sdca / 6.10-rc5 regression?

marc-hb commented 3 days ago

Starting from daily test run 43136, most tests fail on tglu-sku0a32-sdca most of the time, always because a parity error.

rt711-sdca sdw:0:0:025d:0711:01: Parity error detected

Last successful daily test run: 43016 on June 23rd with kernel commit e9eea140a1d9

First failing daily test run 43136 on June 25th with kernel commit c75a95d34885.

There are more than 1000 commits between the two, most of them apparently coming from 6.10-rc5

5081

[  113.727660] kernel: soundwire_bus:sdw_program_device_num: soundwire sdw-master-0-0: No more devices to enumerate
[  113.727684] kernel: soundwire_cadence:cdns_update_slave_status_work: soundwire_intel soundwire_intel.link.0: Slave status change: 0x2000001
[  113.727698] kernel: soundwire_bus:sdw_modify_slave_status: rt711-sdca sdw:0:0:025d:0711:01: signaling enumeration completion for Slave 6
[  113.727881] kernel: soundwire_bus:sdw_slave_set_frequency: rt711-sdca sdw:0:0:025d:0711:01: Configured bus base 1, scale 3, mclk 19200000, curr_freq 4800000
[  113.829317] kernel: soundwire_cadence:cdns_check_attached_status_dwork: soundwire_intel soundwire_intel.link.0: Peripheral 6 status: 2
[  113.910742] kernel: snd_soc_rt711_sdca:rt711_sdca_calibration: rt711-sdca sdw:0:0:025d:0711:01: rt711_sdca_calibration calibration complete, ret=0
[  113.915449] kernel: snd_soc_rt711_sdca:rt711_sdca_jack_init: rt711-sdca sdw:0:0:025d:0711:01: in rt711_sdca_jack_init enable
[  113.915458] kernel: snd_soc_rt711_sdca:rt711_sdca_io_init: rt711-sdca sdw:0:0:025d:0711:01: rt711_sdca_io_init hw_init complete
[  113.915463] kernel: soundwire_bus:sdw_handle_slave_status: rt711-sdca sdw:0:0:025d:0711:01: signaling initialization completion for Slave 6
[  113.915517] kernel: soundwire_cadence:cdns_update_slave_status_work: soundwire_intel soundwire_intel.link.0: Slave status change: 0x4000000
[  113.924650] kernel: snd_sof:sof_pcm_open: sof-audio-pci-intel-tgl 0000:00:1f.3: pcm: open stream 0 dir 0
[  113.924652] kernel: snd_sof:sof_pcm_open: sof-audio-pci-intel-tgl 0000:00:1f.3: period min 192 max 16384 bytes
[  113.924654] kernel: snd_sof:sof_pcm_open: sof-audio-pci-intel-tgl 0000:00:1f.3: period count 2 max 16
[  113.924655] kernel: snd_sof:sof_pcm_open: sof-audio-pci-intel-tgl 0000:00:1f.3: buffer max 65536 bytes
[  113.924803] kernel: snd_soc_rt711_sdca:rt711_sdca_pcm_hw_params: rt711-sdca sdw:0:0:025d:0711:01: rt711_sdca_pcm_hw_params rt711-sdca-aif1
[  113.925461] kernel: rt711-sdca sdw:0:0:025d:0711:01: Parity error detected
[  113.925869] kernel: snd_sof:ipc3_log_header: sof-audio-pci-intel-tgl 0000:00:1f.3: ipc tx: 0x80010000: GLB_DAI_MSG: CONFIG
[  113.926110] kernel: snd_sof:sof_pcm_hw_params: sof-audio-pci-intel-tgl 0000:00:1f.3: pcm: hw params stream 0 dir 0
[  113.926117] kernel: snd_sof_intel_hda_common:hda_dsp_stream_hw_params: sof-audio-pci-intel-tgl 0000:00:1f.3: FW Poll Status: reg[0x160]=0x40000 successful
[  113.926131] kernel: snd_sof_intel_hda_common:hda_dsp_stream_hw_params: sof-audio-pci-intel-tgl 0000:00:1f.3: FW Poll Status: reg[0x160]=0x40000 successful
[  113.926134] kernel: snd_sof_intel_hda_common:hda_dsp_stream_setup_bdl: sof-audio-pci-intel-tgl 0000:00:1f.3: period_bytes:0x4000
[  113.926135] kernel: snd_sof_intel_hda_common:hda_dsp_stream_setup_bdl: sof-audio-pci-intel-tgl 0000:00:1f.3: periods:4
[  113.926137] kernel: soundwire_cadence:cdns_update_slave_status_work: soundwire_intel soundwire_intel.link.0: Slave status change: 0x2000000
[  113.926158] kernel: snd_sof:sof_ipc3_pcm_hw_params: sof-audio-pci-intel-tgl 0000:00:1f.3: stream_tag 1

cc: @bardliao , @plbossart , @ssavati , @shumingfan

plbossart commented 3 days ago

@bardliao is this something you can try to bisect with your TGL device? I don't have one available so can't do it myself...

bardliao commented 2 days ago

@bardliao is this something you can try to bisect with your TGL device? I don't have one available so can't do it myself...

Sure, I will try to bisect with my TGL device.

marc-hb commented 2 days ago

FWIW, jf-cml-sku0983-sdw-1 uses RT711 too and has no failure.

plbossart commented 2 days ago

not the same version of RT711, CML uses the 'old' version and TGL the SDCA version. the -sdca suffix matters!

bardliao commented 2 days ago

This seems to be related to https://github.com/thesofproject/linux/pull/5074. I did some test today and confirmed that no issue on topic/sof-dev-rebase and broonie/for-next branch. And can reproduce the issue after applying https://github.com/thesofproject/linux/pull/5074 on top of the topic/sof-dev-rebase branch.

plbossart commented 2 days ago

Humm very interesting.

the first two commits of PR #5074 are ace-2.x specific so that would point to the last patch with the deferred workqueue.

I can't quite figure out what the deferred enumeration check would have to do with the parity error.

Can you try with this that last commit @bardliao ?

bardliao commented 2 days ago

Humm very interesting.

the first two commits of PR #5074 are ace-2.x specific so that would point to the last patch with the deferred workqueue.

I can't quite figure out what the deferred enumeration check would have to do with the parity error.

Can you try with this that last commit @bardliao ?

Sure, I am guessing it is related to slave->first_interrupt_done.

bardliao commented 2 days ago

@plbossart Apply the "soundwire: cadence: re-check Peripheral status with delayed_work" commit can reproduce the issue.

plbossart commented 1 day ago

can you check the logs @bardliao, in theory if you see a parity error then it should be reported in this delayed work part

    for (i = 0; i <= SDW_MAX_DEVICES; i++) {
        status[i] = val & 0x3;
        if (status[i])
            dev_dbg(cdns->dev, "Peripheral %d status: %d\n", i, status[i]);
        val >>= 2;
    }

If there's really a parity issue this would be status 0x2 (alert).

bardliao commented 1 day ago

@plbossart Yes, [ 33.214013] soundwire_cadence:cdns_check_attached_status_dwork: soundwire_intel soundwire_intel.link.0: Peripheral 6 status: 2 And sdw_handle_slave_alerts is called twice.

[   33.214031] rt711-sdca sdw:0:0:025d:0711:01: bard: sdw_handle_slave_alerts
[   33.294601] rt711-sdca sdw:0:0:025d:0711:01: bard: sdw_handle_slave_alerts

Then

[   33.302491] rt711-sdca sdw:0:0:025d:0711:01: bard: buf 0x1
[   33.302494] rt711-sdca sdw:0:0:025d:0711:01: bard: buf & SDW_SCP_INT1_PARITY, slave->first_interrupt_done 0 slave->prop.scp_int1_mask 0x3 slave->prop.quirks 0x1
[   33.302496] rt711-sdca sdw:0:0:025d:0711:01: bard: sdw_handle_slave_alerts clear 0x1
[   33.302553] rt711-sdca sdw:0:0:025d:0711:01: bard: sdw_handle_slave_alerts set first_interrupt_done = true
[   33.302616] rt711-sdca sdw:0:0:025d:0711:01: bard: buf 0x1
[   33.302617] rt711-sdca sdw:0:0:025d:0711:01: bard: buf & SDW_SCP_INT1_PARITY, slave->first_interrupt_done 1 slave->prop.scp_int1_mask 0x3 slave->prop.quirks 0x1
[   33.302618] rt711-sdca sdw:0:0:025d:0711:01: Parity error detected
[   33.302621] rt711-sdca sdw:0:0:025d:0711:01: bard: sdw_handle_slave_alerts clear 0x1
[   33.302678] rt711-sdca sdw:0:0:025d:0711:01: bard: sdw_handle_slave_alerts set first_interrupt_done = true

diff --git a/drivers/soundwire/bus.c b/drivers/soundwire/bus.c
index 263ca32f0c5c..53df723fe258 100644
--- a/drivers/soundwire/bus.c
+++ b/drivers/soundwire/bus.c
@@ -1618,6 +1618,7 @@ static int sdw_handle_slave_alerts(struct sdw_slave *slave)
        bool parity_check;
        bool parity_quirk;

+       dev_info(&slave->dev, "bard: %s\n", __func__);
        sdw_modify_slave_status(slave, SDW_SLAVE_ALERT);

        ret = pm_runtime_get_sync(&slave->dev);
@@ -1654,6 +1655,7 @@ static int sdw_handle_slave_alerts(struct sdw_slave *slave)
        }

        do {
+               dev_info(&slave->dev, "bard: buf %#x\n", buf);
                slave_notify = false;

                /*
@@ -1665,6 +1667,8 @@ static int sdw_handle_slave_alerts(struct sdw_slave *slave)
                        parity_quirk = !slave->first_interrupt_done &&
                                (slave->prop.quirks & SDW_SLAVE_QUIRKS_INVALID_INITIAL_PARITY);

+                       dev_info(&slave->dev, "bard: buf & SDW_SCP_INT1_PARITY, slave->first_interrupt_done %d slave->prop.scp_int1_mask %#x slave->prop.quirks %#x\n",
+                               slave->first_interrupt_done, slave->prop.scp_int1_mask, slave->prop.quirks);
                        if (parity_check && !parity_quirk)
                                dev_err(&slave->dev, "Parity error detected\n");
                        clear |= SDW_SCP_INT1_PARITY;
@@ -1754,6 +1758,7 @@ static int sdw_handle_slave_alerts(struct sdw_slave *slave)
                }

                /* Ack interrupt */
+               dev_info(&slave->dev, "bard: %s clear %#x\n", __func__, clear);
                ret = sdw_write_no_pm(slave, SDW_SCP_INT1, clear);
                if (ret < 0) {
                        dev_err(&slave->dev,
@@ -1763,6 +1768,7 @@ static int sdw_handle_slave_alerts(struct sdw_slave *slave)

                /* at this point all initial interrupt sources were handled */
                slave->first_interrupt_done = true;
+               dev_info(&slave->dev, "bard: %s set first_interrupt_done = true\n", __func__);

                /*
                 * Read status again to ensure no new interrupts arrived

My theory is that we got the alert interrupt and right after that cdns_check_attached_status_dwork is called. Full log: dmesg_parity.txt

bardliao commented 1 day ago

@plbossart Potential fix: https://github.com/thesofproject/linux/pull/5095

plbossart commented 1 day ago

cancel_work_sync() does two things a) it cancels the work if it didn't start already or, b) it waits for the work to be finished.

the b) part makes sense here, but what about a)?

If that if we cancel a scheduled workqueue before it's run, we will keep the interrupts disabled forever, no?

How would we deal with the alert if we first deal with the attachment dwork?

plbossart commented 1 day ago

the scenario that worries me is

a) an alert occurs b) sdw_cdns_irq() detects the alert and schedules the workqueue c) the dwork is scheduled and it cancels the workqueue before it's executed

-> cdns_update_slave_status_work() will not be executed and as a result the interrupt will not be cleared with a write to MCP_INTSTAT

the open is whether would be handled a second time. Because the device did not change status, the INTSTATn bit will not change so the interrupt will not be handled.

Looks like we need a different way of preventing the two workqueues from executing concurrently, maybe the only solution is a mutex?

plbossart commented 1 day ago

Alternate mutex-based solution in https://github.com/thesofproject/linux/pull/5096 @bardliao - completely untested, just to show what I had in mind.

bardliao commented 1 day ago

the scenario that worries me is

a) an alert occurs b) sdw_cdns_irq() detects the alert and schedules the workqueue c) the dwork is scheduled and it cancels the workqueue before it's executed

-> cdns_update_slave_status_work() will not be executed and as a result the interrupt will not be cleared with a write to MCP_INTSTAT

the open is whether would be handled a second time. Because the device did not change status, the INTSTATn bit will not change so the interrupt will not be handled.

Looks like we need a different way of preventing the two workqueues from executing concurrently, maybe the only solution is a mutex?

We will check Peripheral status after the interrupts are disabled in cdns_check_attached_status_dwork, right? And we can get the alert. Am I missing something?

plbossart commented 23 hours ago

yes we'll get the peripheral status, but we will not deal with the interrupt registers, e.g. MCP_INTSTAT will not be cleared. I think we will create a mess if we deal with interrupt registers in two places. All we wanted was to read and update the status, this can be done in a simpler way I think

marc-hb commented 11 hours ago

Just spotted a couple of parity errors today on ba-adlp-rvp-sdw-02 in the daily test run.

43530?model=ADLP_RVP_SDW-ipc4&testcase=check-playback-50rounds

They are nowhere near as frequent though.

thesofproject / linux

Very frequent rt711-sdca "parity errors" on tglu-sku0a32-sdca / 6.10-rc5 regression? #5090

5081