thesofproject / linux

Linux kernel source tree
Other
91 stars 134 forks source link

SoundWire locking issues #5052

Open plbossart opened 5 months ago

plbossart commented 5 months ago

There are 2 theoretical cases where the current locking is broken.

stream locking

the use of sdw_acquire_bus_lock() can lead to an AB/BA deadlock.

If two streams use the same link and the the m_rt information is added in opposite order, this loop would dead-lock. Example Stream A uses link0 and link1, and Stream B uses link1 and link0.

The following loop would result in StreamA lock link0 Stream B link link1 Deadlock.

        /* Iterate for all Master(s) in Master list */
    list_for_each_entry(m_rt, &stream->master_list, stream_node) {
        bus = m_rt->bus;

        mutex_lock(&bus->bus_lock);
    }

SyncArm/SyncGo locks on Intel platforms

On Intel platforms, the mechanism is to do a 'Sync Arm' in the pre-bank switch for all m_rt, then do a 'Sync Go' in the post_bank_switch callback of the first m_rt.

Assuming Stream A uses link 0 and link1, and stream B uses link2 and link3, it's theoretically possible to have the following sequence:

Stream A link0 sync arm Stream A link1 sync arm Stream B link2 sync arm Stream A link0 sync go -> link 0, 1, 2 would start when only link0 and link1 should start... Stream B link3 sync arm Stream B link2 sync go

It's anyone's guess when might happen after the first sync go. At worst, link3 never starts. At best, link3 starts but wouldn't be synchronized with link2. Both are bad options.

It's not clear if these two cases explain the problems we're seeing in #4823 but it's rather of a concern that this locking looks rather broken.

@bardliao @ujfalusi let me know if this makes any sense.

plbossart commented 5 months ago

https://sof-ci.ostc.intel.com/#/result/planresultdetail/42223?model=LNLM_SDW_AIOC&testcase=check-playback-3times is also interesting, we can see a regular transfer starting before a bank switch completes

[  397.255338] kernel: soundwire sdw-master-0-0: do_transfer_defer
[  397.255342] kernel: soundwire_intel soundwire_intel.link.0: intel_sync_go_unlocked: started
[  397.255348] kernel: soundwire_intel soundwire_intel.link.0: intel_sync_go_unlocked: done
[  397.255349] kernel: soundwire_intel soundwire_intel.link.0: intel_post_bank_switch: sync_go wait start
[  397.257453] kernel: soundwire_intel soundwire_intel.link.0: intel_post_bank_switch: sync_go wait done
[  397.279332] kernel: soundwire sdw-master-0-0: sdw_transfer  <<<<  THIS SHOULD NOT HAPPEN!!!!
[  400.270302] kernel: soundwire sdw-master-0-0: Controller Timed out on bank switch
[  400.270416] kernel: soundwire sdw-master-0-0: multi link bank switch failed: -110
[  400.270474] kernel: _sdw_prepare_stream: do_bank_switch failed for stream subdevice #0-Playback: -110
[  400.270620] kernel: sdw_release_bus_lock: releasing locks for stream subdevice #0-Playback
[  400.270629] kernel: sdw_release_bus_lock: released locks for stream subdevice #0-Playback
[  400.270633] kernel:  SDW0-Playback: ASoC: error at snd_soc_link_prepare on SDW0-Playback: -110
[  400.270737] kernel:  SDW0-Playback: ASoC: error at __soc_pcm_prepare on SDW0-Playback: -110
[  400.270768] kernel:  Jack Out: ASoC: error at dpcm_be_dai_prepare on Jack Out: -110
[  400.270798] kernel:  Jack Out: ASoC: error at dpcm_fe_dai_prepare on Jack Out: -110
plbossart commented 5 months ago

Another possible locking issue:

a) the bank switch starts and the bus takes the msg->lock b) an interrupt happens and the workqueue schedules a read from IntStat0 register, which is blocked by the msg->lock c) the manager writes the bank switch command, which is blocked by the read that won't go out.

So I think we need to disable interrupts coming from the peripherals during a bank switch.

Edit: after talking with @bardliao this isn't really possible.

bardliao commented 5 months ago

Another possible locking issue:

a) the bank switch starts and the bus takes the msg->lock b) an interrupt happens and the workqueue schedules a read from IntStat0 register, which is blocked by the msg->lock c) the manager writes the bank switch command, which is blocked by the read that won't go out.

So I think we need to disable interrupts coming from the peripherals during a bank switch.

I added 3 sec sleep after do_transfer_defer, and keep pressing button key while headphone playback. It looks like the sdw_transfer will be called after bank switch in normal case. Not sure what happens in the failed case.

[   61.834921] sdw_acquire_bus_lock: acquiring locks for stream subdevice #0-Playback
[   61.834923] sdw_acquire_bus_lock: acquired locks for stream subdevice #0-Playback
[   61.834925] soundwire sdw-master-0-0: sdw_transfer
[   61.835091] soundwire_intel soundwire_intel.link.0: sdw_cdns_irq int_status 0x80000004
[   61.835166] soundwire sdw-master-0-0: sdw_transfer
[   61.835382] soundwire_intel soundwire_intel.link.0: sdw_cdns_irq int_status 0x80000004
[   61.835446] soundwire sdw-master-0-0: sdw_transfer
[   61.835505] soundwire_intel soundwire_intel.link.0: sdw_cdns_irq int_status 0x80000004
[   61.835522] soundwire sdw-master-0-0: sdw_transfer
[   61.835608] soundwire_intel soundwire_intel.link.0: sdw_cdns_irq int_status 0x80000004
[   61.835624] soundwire sdw-master-0-0: sdw_transfer
[   61.835677] soundwire_intel soundwire_intel.link.0: sdw_cdns_irq int_status 0x80000004
[   61.835692] soundwire sdw-master-0-0: sdw_transfer
[   61.835805] soundwire_intel soundwire_intel.link.0: sdw_cdns_irq int_status 0x80000004
[   61.835867] soundwire sdw-master-0-0: sdw_transfer
[   61.835983] soundwire_intel soundwire_intel.link.0: sdw_cdns_irq int_status 0x80000004
[   61.836046] soundwire sdw-master-0-0: sdw_transfer
[   61.836125] soundwire_intel soundwire_intel.link.0: sdw_cdns_irq int_status 0x80000004
[   61.836185] soundwire sdw-master-0-0: sdw_transfer
[   61.836286] soundwire_intel soundwire_intel.link.0: sdw_cdns_irq int_status 0x80000004
[   61.836365] soundwire sdw-master-0-0: sdw_transfer
[   61.836580] soundwire_intel soundwire_intel.link.0: sdw_cdns_irq int_status 0x80000004
[   61.836590] soundwire sdw-master-0-0: sdw_transfer
[   61.836650] soundwire_intel soundwire_intel.link.0: sdw_cdns_irq int_status 0x80000004
[   61.836673] soundwire sdw-master-0-0: do_transfer_defer
[   61.836679] soundwire sdw-master-0-0: bard: before sleep 3 sec
[   64.838910] soundwire sdw-master-0-0: bard: after sleep 3 sec
[   64.839577] soundwire_intel soundwire_intel.link.0: sdw_cdns_irq int_status 0x80000004
[   64.839654] soundwire sdw-master-0-0: sdw_transfer
[   64.840052] soundwire_intel soundwire_intel.link.0: sdw_cdns_irq int_status 0x80000004
[   64.840133] sdw_release_bus_lock: releasing locks for stream subdevice #0-Playback
[   64.840174] sdw_release_bus_lock: released locks for stream subdevice #0-Playback
bardliao commented 5 months ago

Same test as https://github.com/thesofproject/linux/issues/5052#issuecomment-2160611644 with https://github.com/thesofproject/linux/pull/5047

[   43.664000] sdw_acquire_bus_lock: acquiring locks for stream subdevice #0-Playback
[   43.664002] sdw_acquire_bus_lock: acquired locks for stream subdevice #0-Playback
[   43.664006] soundwire sdw-master-0-0: sdw_transfer: dev 6 addr 0x130 flags 0x1
[   43.664009] soundwire sdw-master-0-0: sdw_transfer: after mutex
[   43.664211] soundwire_intel soundwire_intel.link.0: sdw_cdns_irq int_status 0x80000004
[   43.664281] soundwire sdw-master-0-0: sdw_transfer: done
[   43.664296] soundwire sdw-master-0-0: sdw_transfer: dev 6 addr 0x102 flags 0x0
[   43.664301] soundwire sdw-master-0-0: sdw_transfer: after mutex
[   43.664477] soundwire_intel soundwire_intel.link.0: sdw_cdns_irq int_status 0x80000004
[   43.664553] soundwire sdw-master-0-0: sdw_transfer: done
[   43.664556] soundwire sdw-master-0-0: sdw_transfer: dev 6 addr 0x102 flags 0x1
[   43.664561] soundwire sdw-master-0-0: sdw_transfer: after mutex
[   43.664611] soundwire_intel soundwire_intel.link.0: sdw_cdns_irq int_status 0x80000004
[   43.664621] soundwire sdw-master-0-0: sdw_transfer: done
[   43.664623] soundwire sdw-master-0-0: sdw_transfer: dev 6 addr 0x103 flags 0x1
[   43.664625] soundwire sdw-master-0-0: sdw_transfer: after mutex
[   43.664678] soundwire_intel soundwire_intel.link.0: sdw_cdns_irq int_status 0x80000004
[   43.664687] soundwire sdw-master-0-0: sdw_transfer: done
[   43.664689] soundwire sdw-master-0-0: sdw_transfer: dev 6 addr 0x132 flags 0x1
[   43.664690] soundwire sdw-master-0-0: sdw_transfer: after mutex
[   43.664768] soundwire_intel soundwire_intel.link.0: sdw_cdns_irq int_status 0x80000004
[   43.664830] soundwire sdw-master-0-0: sdw_transfer: done
[   43.664834] soundwire sdw-master-0-0: sdw_transfer: dev 6 addr 0x134 flags 0x1
[   43.664837] soundwire sdw-master-0-0: sdw_transfer: after mutex
[   43.665017] soundwire_intel soundwire_intel.link.0: sdw_cdns_irq int_status 0x80000004
[   43.665083] soundwire sdw-master-0-0: sdw_transfer: done
[   43.665088] soundwire sdw-master-0-0: sdw_transfer: dev 6 addr 0x138 flags 0x1
[   43.665092] soundwire sdw-master-0-0: sdw_transfer: after mutex
[   43.665322] soundwire_intel soundwire_intel.link.0: sdw_cdns_irq int_status 0x80000004
[   43.665394] soundwire sdw-master-0-0: sdw_transfer: done
[   43.665398] soundwire sdw-master-0-0: sdw_transfer: dev 6 addr 0x135 flags 0x1
[   43.665402] soundwire sdw-master-0-0: sdw_transfer: after mutex
[   43.665470] soundwire_intel soundwire_intel.link.0: sdw_cdns_irq int_status 0x80000004
[   43.665481] soundwire sdw-master-0-0: sdw_transfer: done
[   43.665482] soundwire sdw-master-0-0: sdw_transfer: dev 6 addr 0x137 flags 0x1
[   43.665484] soundwire sdw-master-0-0: sdw_transfer: after mutex
[   43.665532] soundwire_intel soundwire_intel.link.0: sdw_cdns_irq int_status 0x80000004
[   43.665542] soundwire sdw-master-0-0: sdw_transfer: done
[   43.665543] soundwire sdw-master-0-0: sdw_transfer: dev 6 addr 0x133 flags 0x1
[   43.665544] soundwire sdw-master-0-0: sdw_transfer: after mutex
[   43.665595] soundwire_intel soundwire_intel.link.0: sdw_cdns_irq int_status 0x80000004
[   43.665604] soundwire sdw-master-0-0: sdw_transfer: done
[   43.665605] soundwire sdw-master-0-0: sdw_transfer: dev 6 addr 0x136 flags 0x1
[   43.665607] soundwire sdw-master-0-0: sdw_transfer: after mutex
[   43.665664] soundwire_intel soundwire_intel.link.0: sdw_cdns_irq int_status 0x80000004
[   43.665673] soundwire sdw-master-0-0: sdw_transfer: done
[   43.665682] soundwire_intel soundwire_intel.link.0: intel_sync_arm: started
[   43.665687] soundwire_intel soundwire_intel.link.0: intel_sync_arm: done
[   43.665691] soundwire sdw-master-0-0: do_transfer_defer
[   43.665699] soundwire sdw-master-0-0: bard: before sleep 3 sec
[   46.726949] soundwire sdw-master-0-0: bard: after sleep 3 sec
[   46.726983] soundwire_intel soundwire_intel.link.0: intel_sync_go_unlocked: started
[   46.727002] soundwire_intel soundwire_intel.link.0: intel_sync_go_unlocked: done
[   46.727004] soundwire_intel soundwire_intel.link.0: intel_post_bank_switch: sync_go wait start
[   46.727537] soundwire_intel soundwire_intel.link.0: sdw_cdns_irq int_status 0x80000004
[   46.729114] soundwire_intel soundwire_intel.link.0: intel_post_bank_switch: sync_go wait done
[   46.729124] soundwire sdw-master-0-0: sdw_transfer: dev 6 addr 0x120 flags 0x1
[   46.729128] soundwire sdw-master-0-0: sdw_transfer: after mutex
[   46.729277] soundwire_intel soundwire_intel.link.0: sdw_cdns_irq int_status 0x80000004
[   46.729323] soundwire sdw-master-0-0: sdw_transfer: done
[   46.729327] sdw_release_bus_lock: releasing locks for stream subdevice #0-Playback
[   46.729330] sdw_release_bus_lock: released locks for stream subdevice #0-Playback
bardliao commented 5 months ago

a) the bank switch starts and the bus takes the msg->lock b) an interrupt happens and the workqueue schedules a read from IntStat0 register, which is blocked by the msg->lock c) the manager writes the bank switch command, which is blocked by the read that won't go out.

Are you referring to mutex_lock(&bus->msg_lock);? mutex_lock(&bus->msg_lock); is used by do_bank_switch and sdw_transfer. So in theory, sdw_transfer will wait for do_bank_switch when an interrupt happens, right?

plbossart commented 5 months ago

In theory yes, but there's a corner case as described above

a) the bank switch starts and the bus takes the msg->lock b) an interrupt happens and the workqueue schedules a read from IntStat0 register, which is blocked by the msg->lock c) the manager writes the bank switch command, which is blocked by the read that won't go out.

b) is blocked by a) and c) is blocked by b)

plbossart commented 5 months ago

wait @bardliao are you saying that with a 3s sleep time you never see any bank switch errors?

bardliao commented 5 months ago

wait @bardliao are you saying that with a 3s sleep time you never see any bank switch errors?

No, I tested it manually which can not reproduce the issue anyway.