Open cbiffle opened 3 months ago
I may have found a clue here.
I worked out how to reproduce this fairly reliably, and set up the Saleae in analog mode. During an apparently reasonable transaction with the disk, just before we decided to do a reset sequence, we have this:
This is a glitch in SDA while the disk is in control of its level (we can tell it's the disk from the voltage offset, since the SP is on the far side of a level translator that introduces a ground offset). It occurs while SCL is high --- specifically, immediately after SCL's rising edge. This is a glitch while SCL is high, which you're not supposed to do in I2C.
Now, as measured at the logic analyzer, this glitch doesn't reach the 0.3VDD threshold for being considered "not zero" (0.99 V for VDD=3.3). However, the logic analyzer's sample rate is limited, it seems to apply some analog filtering, and the SP is on the far side of a level shifter; the magnitude of the rising edge from the SP's perspective may have been larger.
This is relevant because of how the STM32 I2C controller defines Bus Error:
A bus error is detected when a START or a STOP condition is detected and is not located after a multiple of 9 SCL clock pulses. A START or a STOP condition is detected when a SDA edge occurs while SCL is high.
If this glitch had enough swing at the SP for it to be recognized as a rising edge, then the controller is likely to have set the BERR flag. Our I2C driver's reaction to the BERR condition is currently to generate a bus-wiggle reset sequence. (I think this is an overreaction in any case where the lines are not stuck, so maybe I'll fix that later.)
And sure enough, just off the right side of the capture image, the SP begins a reset sequence.
Currently we're missing some counters that would let me distinguish BusError from other cases, so this analysis is what I've got. But it would explain why we sometimes appear to generate reset sequences even in cases where we don't appear to have lost arbitration.
This is another follow-on to #1821. I'll repost the first trace from that issue for reference:
In this trace, we make it one I2C frame into a read transaction against the device at 0x6A (which happens to be an NVMe-MI interface, but I don't think that's material), when suddenly we take over the bus pins and do the bus recovery wiggle sequence, normally reserved for emergencies where SCL is stuck.
In this case, SCL is being held low by the disk for the first ~900us. This is within its rights. The problem is us -- immediately after the ack of its address, we're freaking out about it and triggering a bus reset, which is causing other knock-on problems (see #1821 and #1822).
I currently can't explain why we're doing this. I think fixing #1821 and #1822 will at least remove one layer of obfuscation and may reveal the source, but someone more familiar with the driver architecture might want to have a look.