Avoid occasional hang in NRF52I2C::waitForStop

martinwork commented 2 years ago

NRF52I2C::waitForStop can hang in the while loop https://github.com/lancaster-university/codal-nrf52/blob/master/source/NRF52I2C.cpp#L128. This PR replaces the while loop with a short timeout for loop, and adds code to reset I2C if STOPPED doesn't happen.

Could we shorten the timeout in waitForStop, from the current 10 seconds (10us * 1 million) to, say, 50ms or 100ms? The longest successful operation I have seen is about 12ms.

Would it be useful to add NRF52I2C::setTimeout/getTimeout?

microbit-carlos commented 2 years ago

@finneyj this is something that need your review. Let us know if you need a call to discuss it.

microbit-carlos commented 2 years ago

@martinwork when you get a chance, would you be able to add a bit of info here about when this error happens and a hex with code to reproduce? The main discussion is quite long and complex and in a private repo.

martinwork commented 2 years ago

Here is the test program, hex and source: i2c-serial.zip, It tests datalogging with serial mirror. After flashing the hex, press button A to start. It usually hangs after about 3 minutes (LEDs stop flashing), but can survive longer. See comments in the source.

finneyj commented 2 years ago

Thanks @martinwork. Sorry for slow response here.

If I understand correctly, it seems this patch:

factors out clearing the I2C bus (and associated diagnostics) into a new function
adds calls to that function before every read/write operation.
adds a 1mS timeout when the I2C peripheral is instructed to assert a STOP condition on the bus
forces a much deeper hardware reset of the I2C peripheral when we detect an I2C bus hang

Generally this all sounds fine and code looks good to me. A couple of questions though:

1) Did you mean to timeout the STOP after 1ms? It sounds sensible, but I've never profiled how long this could take in normal conditions. I guess you've checked that the "normal" path executes most of the time, and the exception path (with the hard rest of the hardware peripheral) is followed less often?

2) You discuss a proposal above to reduce the waitForStop() timeout. The numbers you're suggesting sound fairly sensible to me, but we do need to bear in mind that this code is also used for external I2C peripherals and also for devices other than micro:bit. I think the timeout was so high because the last time I investigated it, the failure case seemed super rare (like once every few days in a stress test)... so being super conservative seemed to carry low cost. I'm not against reducing to say 100ms if we have compelling evidence. I'm curious that you say "It usually hangs after about 3 minutes" above. This is much more common than I saw before. I seem to recall previously that the common fail case reported an error, not a timeout. I assume you're testing with a micro:bit v2 here. Is it a V2.00 or V2.2?

martinwork commented 2 years ago

Thanks @finneyj This PR was to flag up that there was a way to reset the peripheral to get past the hang.

With "normal" micro:bit logging regimes the hang is less common, but when data logging, even hanging after a few days could be annoying! I think I saw it happen randomly with a fairly normal test, so looked for a way to trigger it more often. The test program in https://github.com/lancaster-university/codal-nrf52/pull/40#issuecomment-1028901832 usually hangs after a few minutes.

I haven't seen a hang with just accelerometer etc., but I haven't found a micro:bit model and interface firmware combination which doesn't hang during data logging. I haven't managed to figure out why it hangs or how to avoid the hang.

factors out clearing the I2C bus (and associated diagnostics) into a new function adds calls to that function before every read/write operation.

The new function, clearError, only clears ERRORSRC and NRF_TWIM_EVENT_ERROR, not the I2C bus. The addition is clearing ERRORSRC, and DMESGing the error, to make it clearer when errors occur.

Maybe ERRORSRC only needs to be cleared when debugging, and the whole thing should be made conditional on DMESG, so the non-debug version is unchanged?

adds a 1mS timeout when the I2C peripheral is instructed to assert a STOP condition on the bus

Did you mean to timeout the STOP after 1ms? It sounds sensible, but I've never profiled how long this could take in normal conditions. I guess you've checked that the "normal" path executes most of the time, and the exception path (with the hard rest of the hardware peripheral) is followed less often?

The hang occurs in the tight while loop that waits for STOPPED, when the RESUME/STOP is used to try to recover after a transfer has failed, so we need to timeout the tight while loop and try something else.

It's a long time since I was doing the tests! I have a feeling that RESUME/STOP works in case of an error, but doesn't work when waitForStop reaches its 10s timeout.

I've just updated this PR to wait longer for STOPPED, and added back some DMESG. I'm building https://github.com/martinwork/microbit-v2-samples/tree/i2c-recover with codal-microbit-v2 branch daplink-interface-versioning. See today's test results below.

After previous tests I noted... The hang seems to occur during the read half of a MICROBIT_USB_FLASH_WRITE_CMD transaction, after the write part has seen the "no stop" case: https://github.com/lancaster-university/codal-nrf52/blob/master/source/NRF52I2C.cpp#L151. But the "no stop" case is very common, and only rarely followed by a read failure. V2.2 and V2.0 seem similar.

You discuss a proposal above to reduce the waitForStop() timeout.

Yes, and to make it configurable. I was hoping others have a better idea of what the default timeout should be. I have no compelling evidence that it can safely be less than 10s. If we have to wait 10s, and then another 10s for RESUME/STOP, it could still be useful to avoid the hang. On a long log we might not notice, but shorter logs would still see an occasional huge glitch. Maybe the default could be 10s, and MicroBitXYZ clients could reduce it and reset it around their calls?

we do need to bear in mind that this code is also used for [...] devices other than micro:bit.

Should I make the whole PR depend on a config macro?

Testing an unmodified retail V2.0, it looks like RESUME/STOP completes in 10-20us after an error, but doesn't work during logging after 277s.

NRF52I2C error 1 ERRORSRC 0x2
NRF52I2C STOPPED 1
relocate vtor to 0x0 -> 0x20002200 0x20002200
Time (milliseconds),delta,txbuf,a,b,c,d,e,f,g,h,i,j
5491,0,0,1023,1023,1023,1023,1023,1023,1023,1023,1023,1023
...
218803,16,0,1023,1023,1023,1023,1023,1023,1023,1023,1023,1023
NRF52I2C RECOVERED
TRANSACT: [I2C READ ERROR: -1010]

a second try ends

323776,16,0,1023,1023,1023,1023,1023,1023,1023,1023,1023,1023
NRF52I2C RECOVERED
TRANSACT: [I2C READ ERROR: -1010]

A retail V2.20 with latest interface firmware is similar:

NRF52I2C error 1 ERRORSRC 0x2
NRF52I2C STOPPED 1
relocate vtor to 0x0 -> 0x20002200 0x20002200
...
581983,7,0,1023,1023,1023,1023,1023,1023,1023,1023,1023,1023
NRF52I2C RECOVERED
TRANSACT: [I2C READ ERROR: -1010]

finneyj commented 2 years ago

Thanks @martinwork.

I fully agree - we need to recover from this hang. I dig look into this last year sometime, when the bug first emerged. I suspect it's a hardware race condition, where one of the I2C hardware events gets lost. I vaguely remember it happening when there was a lot of IRQ activity going on that used the same hardware event bus as the I2C peripheral (or similar)... Anyhow, the cause seems to be deep in the hardware, not the software.

I don't think we should put a conditional compile block around this. We should aim to make the I2C code as stable as possible whilst maintenance as much performance as we can, and have that in all builds.

Sorry, I don't think we need to wait 10s before resetting the bus if the I2C peripheral fails to assert a STOP condition. I was just querying is the 1ms delay was enough under normal conditions - it seems from your tests above that it is, (a timeout of 1ms seems fine if it normally responds within 20uS). Apologies if I sent you off in the wrong direction there!

For the timeout in the main waitForStop() timeout, why don't we just make hat a CONFIG option, with a default set to 10s? We can then override that in the microbit-v2 target to be something shorter, whilst leaving the default the same for all other targets?

martinwork commented 2 years ago

Sorry, I don't think we need to wait 10s before resetting the bus if the I2C peripheral fails to assert a STOP condition.

No, but you were absolutely right to make me want to check! I can't remember how I arrived at 1ms. I can't find any mention of how long it might take. Maybe waiting 10ms wouldn't hurt?

For the timeout in the main waitForStop() timeout, why don't we just make hat a CONFIG option, with a default set to 10s? We can then override that in the microbit-v2 target to be something shorter, whilst leaving the default the same for all other targets?

Making the default a config option sounds right, but I don't know what the micro:bit default would be! Retaining the super long default for the unknown case, but enabling it to be temporarily reduced for known cases, seems like the cautious approach. That would enable a good workaround for datalogging.

martinwork commented 2 years ago

@finneyj I have removed the unnecessary changes that were just for debugging, and added configs for the timeouts., default 10s for the main timeout, then 1s for the RESUME/STOP before resetting.

Retesting without DMESG enabled... With V2.0: hang after 110s, reset OK With V2.0: hang after 521s, reset OK With V2.20: hang after 119s, reset OK With V2.20: hang after 3066s, reset OK

scc110-2022 commented 2 years ago

OK, thanks @martinwork. Looks good, happy for this to be merged @JohnVidler.

Deciding if we want to override default timeout for micro:bit v2 is a policy question we can now easily take further down the road if this I2C "feature" affects the experience too much.

Thanks for the contribution!

finneyj commented 2 years ago

oops - Just for visibility, that was my comment above @martinwork @JohnVidler. I was accidentally using a University account by mistake - sorry. :)

-- Joe

lancaster-university / codal-nrf52

Avoid occasional hang in NRF52I2C::waitForStop #40