Open marc-hb opened 3 days ago
@bardliao is this something you can try to bisect with your TGL device? I don't have one available so can't do it myself...
@bardliao is this something you can try to bisect with your TGL device? I don't have one available so can't do it myself...
Sure, I will try to bisect with my TGL device.
FWIW, jf-cml-sku0983-sdw-1 uses RT711 too and has no failure.
not the same version of RT711, CML uses the 'old' version and TGL the SDCA version. the -sdca suffix matters!
This seems to be related to https://github.com/thesofproject/linux/pull/5074. I did some test today and confirmed that no issue on topic/sof-dev-rebase and broonie/for-next branch. And can reproduce the issue after applying https://github.com/thesofproject/linux/pull/5074 on top of the topic/sof-dev-rebase branch.
Humm very interesting.
the first two commits of PR #5074 are ace-2.x specific so that would point to the last patch with the deferred workqueue.
I can't quite figure out what the deferred enumeration check would have to do with the parity error.
Can you try with this that last commit @bardliao ?
Humm very interesting.
the first two commits of PR #5074 are ace-2.x specific so that would point to the last patch with the deferred workqueue.
I can't quite figure out what the deferred enumeration check would have to do with the parity error.
Can you try with this that last commit @bardliao ?
Sure, I am guessing it is related to slave->first_interrupt_done.
@plbossart Apply the "soundwire: cadence: re-check Peripheral status with delayed_work" commit can reproduce the issue.
can you check the logs @bardliao, in theory if you see a parity error then it should be reported in this delayed work part
for (i = 0; i <= SDW_MAX_DEVICES; i++) {
status[i] = val & 0x3;
if (status[i])
dev_dbg(cdns->dev, "Peripheral %d status: %d\n", i, status[i]);
val >>= 2;
}
If there's really a parity issue this would be status 0x2 (alert).
@plbossart Yes, [ 33.214013] soundwire_cadence:cdns_check_attached_status_dwork: soundwire_intel soundwire_intel.link.0: Peripheral 6 status: 2
And sdw_handle_slave_alerts
is called twice.
[ 33.214031] rt711-sdca sdw:0:0:025d:0711:01: bard: sdw_handle_slave_alerts
[ 33.294601] rt711-sdca sdw:0:0:025d:0711:01: bard: sdw_handle_slave_alerts
Then
[ 33.302491] rt711-sdca sdw:0:0:025d:0711:01: bard: buf 0x1
[ 33.302494] rt711-sdca sdw:0:0:025d:0711:01: bard: buf & SDW_SCP_INT1_PARITY, slave->first_interrupt_done 0 slave->prop.scp_int1_mask 0x3 slave->prop.quirks 0x1
[ 33.302496] rt711-sdca sdw:0:0:025d:0711:01: bard: sdw_handle_slave_alerts clear 0x1
[ 33.302553] rt711-sdca sdw:0:0:025d:0711:01: bard: sdw_handle_slave_alerts set first_interrupt_done = true
[ 33.302616] rt711-sdca sdw:0:0:025d:0711:01: bard: buf 0x1
[ 33.302617] rt711-sdca sdw:0:0:025d:0711:01: bard: buf & SDW_SCP_INT1_PARITY, slave->first_interrupt_done 1 slave->prop.scp_int1_mask 0x3 slave->prop.quirks 0x1
[ 33.302618] rt711-sdca sdw:0:0:025d:0711:01: Parity error detected
[ 33.302621] rt711-sdca sdw:0:0:025d:0711:01: bard: sdw_handle_slave_alerts clear 0x1
[ 33.302678] rt711-sdca sdw:0:0:025d:0711:01: bard: sdw_handle_slave_alerts set first_interrupt_done = true
diff --git a/drivers/soundwire/bus.c b/drivers/soundwire/bus.c
index 263ca32f0c5c..53df723fe258 100644
--- a/drivers/soundwire/bus.c
+++ b/drivers/soundwire/bus.c
@@ -1618,6 +1618,7 @@ static int sdw_handle_slave_alerts(struct sdw_slave *slave)
bool parity_check;
bool parity_quirk;
+ dev_info(&slave->dev, "bard: %s\n", __func__);
sdw_modify_slave_status(slave, SDW_SLAVE_ALERT);
ret = pm_runtime_get_sync(&slave->dev);
@@ -1654,6 +1655,7 @@ static int sdw_handle_slave_alerts(struct sdw_slave *slave)
}
do {
+ dev_info(&slave->dev, "bard: buf %#x\n", buf);
slave_notify = false;
/*
@@ -1665,6 +1667,8 @@ static int sdw_handle_slave_alerts(struct sdw_slave *slave)
parity_quirk = !slave->first_interrupt_done &&
(slave->prop.quirks & SDW_SLAVE_QUIRKS_INVALID_INITIAL_PARITY);
+ dev_info(&slave->dev, "bard: buf & SDW_SCP_INT1_PARITY, slave->first_interrupt_done %d slave->prop.scp_int1_mask %#x slave->prop.quirks %#x\n",
+ slave->first_interrupt_done, slave->prop.scp_int1_mask, slave->prop.quirks);
if (parity_check && !parity_quirk)
dev_err(&slave->dev, "Parity error detected\n");
clear |= SDW_SCP_INT1_PARITY;
@@ -1754,6 +1758,7 @@ static int sdw_handle_slave_alerts(struct sdw_slave *slave)
}
/* Ack interrupt */
+ dev_info(&slave->dev, "bard: %s clear %#x\n", __func__, clear);
ret = sdw_write_no_pm(slave, SDW_SCP_INT1, clear);
if (ret < 0) {
dev_err(&slave->dev,
@@ -1763,6 +1768,7 @@ static int sdw_handle_slave_alerts(struct sdw_slave *slave)
/* at this point all initial interrupt sources were handled */
slave->first_interrupt_done = true;
+ dev_info(&slave->dev, "bard: %s set first_interrupt_done = true\n", __func__);
/*
* Read status again to ensure no new interrupts arrived
My theory is that we got the alert interrupt and right after that cdns_check_attached_status_dwork
is called.
Full log: dmesg_parity.txt
@plbossart Potential fix: https://github.com/thesofproject/linux/pull/5095
cancel_work_sync() does two things a) it cancels the work if it didn't start already or, b) it waits for the work to be finished.
the b) part makes sense here, but what about a)?
If that if we cancel a scheduled workqueue before it's run, we will keep the interrupts disabled forever, no?
How would we deal with the alert if we first deal with the attachment dwork?
the scenario that worries me is
a) an alert occurs b) sdw_cdns_irq() detects the alert and schedules the workqueue c) the dwork is scheduled and it cancels the workqueue before it's executed
-> cdns_update_slave_status_work() will not be executed and as a result the interrupt will not be cleared with a write to MCP_INTSTAT
the open is whether would be handled a second time. Because the device did not change status, the INTSTATn bit will not change so the interrupt will not be handled.
Looks like we need a different way of preventing the two workqueues from executing concurrently, maybe the only solution is a mutex?
Alternate mutex-based solution in https://github.com/thesofproject/linux/pull/5096 @bardliao - completely untested, just to show what I had in mind.
the scenario that worries me is
a) an alert occurs b) sdw_cdns_irq() detects the alert and schedules the workqueue c) the dwork is scheduled and it cancels the workqueue before it's executed
-> cdns_update_slave_status_work() will not be executed and as a result the interrupt will not be cleared with a write to MCP_INTSTAT
the open is whether would be handled a second time. Because the device did not change status, the INTSTATn bit will not change so the interrupt will not be handled.
Looks like we need a different way of preventing the two workqueues from executing concurrently, maybe the only solution is a mutex?
We will check Peripheral status after the interrupts are disabled in cdns_check_attached_status_dwork
, right? And we can get the alert. Am I missing something?
yes we'll get the peripheral status, but we will not deal with the interrupt registers, e.g. MCP_INTSTAT will not be cleared. I think we will create a mess if we deal with interrupt registers in two places. All we wanted was to read and update the status, this can be done in a simpler way I think
Just spotted a couple of parity errors today on ba-adlp-rvp-sdw-02 in the daily test run.
43530?model=ADLP_RVP_SDW-ipc4&testcase=check-playback-50rounds
They are nowhere near as frequent though.
Starting from daily test run 43136, most tests fail on tglu-sku0a32-sdca most of the time, always because a parity error.
Last successful daily test run: 43016 on June 23rd with kernel commit e9eea140a1d9
First failing daily test run 43136 on June 25th with kernel commit c75a95d34885.
There are more than 1000 commits between the two, most of them apparently coming from 6.10-rc5
5081
cc: @bardliao , @plbossart , @ssavati , @shumingfan