The SPI sharing utilities are broken for fallible chipselect pins

GrantM11235 commented 8 months ago

This affects all the spi sharing utilities in some way, but I would like to highlight the worst case scenario:

You are using the mutex or critical section impl
In thread 1 you do a transaction to device A, but the cs pin fails to deassert
Before thread 1 even has a chance to react to the error, it is preempted by thread 2
Thread 2 does a transaction to device B while the cs for device A is still active
This leads to data corruption at best, and physical damage at worst

Option 1: Infallible only

CS: OutputPin<Error = Infallible>

This is the simplest option. In fact it is even simpler than the existing impls because it allows us to remove the DeviceError enum.

You can still use fallible chipselect pins if you wrap them in some sort of adapter that panics on failure. We could even provide such an adapter.

Option 2: track "poisoned" chipselects

I have some idea how to implement this, it would involve adding a shared flag to each bus, as well as another flag for each SpiDevice. The impl would refuse to do any spi transactions until the offending chipselect is fixed.

This would be a lot of extra overhead and complexity that is completely unnecessary for 99.9% of users. I don't think the compiler will be able to remove the unnecessary overhead.

Option 3: do nothing

Document the broken error handling, how to work around it, and where it is impossible to work around.

If we want to be thorough, the amount of warnings and disclaimers will probably outweigh the amount of actual code:

expand me!

- how to detect a failed chipselect - `DeviceError::Cs`, obviously, but also - `DeviceError::Spi`, because it can "hide" chipselect errors - basically any `SpiDevice` error - downstream errors such as `DisplayError::BusWriteError` from `display_interface` (but ironically not `DisplayError::CSError`) - how to properly recover from a chipselect error - **before** using any device on the bus: - get access to the `SpiDevice` by destroying whichever driver was using it - get access to the chipselect by destroying the `SpiDevice` - do whatever you need to do to fix and deassert the chipselect pin - in multitheaded/concurrent code, it is impossible to ensure that this is done before another thread uses a device on the bus. It is not possible to correctly handle chipselect failure in this case - what happens if you use a device on the bus before fixing the chipselect error - this is even a minor problem for `ExclusiveDevice` because the framing will be messed up - if two chipselects are active at once - the other device will receive garbage data - both devices will fight to drive MISO, causing data corruption and possibly even physical damage

Conclusions

If you only have infallible chipselect pins, or if you want to panic on chipselect failure, option 1 is the best.

If you actually want to try to gracefully recover from chipselect errors, option 2 is the only real option.

Option 3 isn't good for either group. It is slightly inconvenient for the first group, and it is almost impossible to use correctly for the second group

My recommendation is option 1 because it is the best option for most people. If someone really needs option 2, they can write it for themself, it doesn't need to be in our crate.

Sh3Rm4n commented 8 months ago

If I understand it correctly, the CS assertion on hardware somewhat resembles a lock in software on the SPI device, so if the CS deassert fails, the software lock should not be unlocked either or at least let this not happen:

Thread 2 does a transaction to device B while the cs for device A is still active

If and how to recover the situation, I'm not sure about. Panicking should be avoided. Reporting the chip select error can be useful. The user is at least able to shut down the SPI to avoid any further damage to the hardware (if at risk).

ryankurte commented 8 months ago

i'm not super convinced making CS pins infallible / wrapping in panics is the right way to handle this... i can't think of any situations with fallible IOs in which you're likely to be able to -recover- from a CS failure, particularly because if a write fails you're not going to be able to tell what state the pin is in, so you'd need to basically tear down and re-init the bus.

the problem with wrapping in a panic is that it stops the error handling path that would let you at least tear things down neatly, it seems to me like a CS failure should poison the whole shared bus (within the critical section).

GrantM11235 commented 8 months ago

i'm not super convinced making CS pins infallible / wrapping in panics is the right way to handle this

It's not always the right way to handle it, but I think it is the best way to handle it for the vast majority of cases.

I think #574 strikes a good balance. It is useful to most people, has no unnecessary overhead, and is honest about it's limitations.

Anyone who does need to do more complex error handling is free to write their own SpiDevice impl. After all, there is nothing special about the impls in embedded-hal-bus, they are just some utilities that we think will be useful to other people.

If there is enough interest, we could even add another set of SpiDevice impls to embedded-hal-bus that do poisoning, but I don't think enough people would use it to justify the mild maintenance burden.

rust-embedded / embedded-hal