raspberrypi / pico-sdk

BSD 3-Clause "New" or "Revised" License
3.7k stars 918 forks source link

Intermittent i2c slave mode errors, interrupts fired in the wrong order #1102

Open chrisckc opened 1 year ago

chrisckc commented 1 year ago

Background:

I have been having trouble setting up reliable communications between 2 Pico boards using i2c

I started off using Earle Philhower's Arduino-Pico RP2040 core for testing this as having previously had issues with random delays using interrupts on input pins with ArduinoCore-MBED I had to abandon it for a previous use case.

I have been unable to achieve reliable i2c communications between the 2 Pico boards, the wires are short and I went from 4.7k pull-ups to 1k pull-ups with no difference. I am seeing missing data which is mostly the last 1 or 2 bytes of the transmission, very occasionally I also see a missing byte at the start of the data block. The error rate varies according to various other factors, I have seen anything from 10% to 0.06 % error rate, the most common being around 1 to 2%

I created a test harness project for this and replicated a minimal version of the circuit on a breadboard using 2 brand new Pico's. I have tried many things and so far been unable to get a zero failure rate. My scope confirms that all the data is always being sent correctly, the errors occur when the receiving Pico reports less than expected number of bytes in the onReceive interrupt.

The errors occur randomly, I have tried different i2c clock speeds, 100KHz, 400KHz default and 1MHz, each speed has an impact on the failure rate. Other factors such as if the serial port is transmitting data during i2c data reception also makes a difference. The gap between each transmission block also has an impact on the error rate, I have tried anything from zero delay to a 1000uS delay, which seems to reduce the error rate noticably.

I substituted a Teensy4.0 in place of the receiving Pico using the exact same code and managed to achieve a zero failure rate, even at 1MHz.

Everything I have tried so far is described over on the Arduino-Pico repo: https://github.com/earlephilhower/arduino-pico/issues/979

I became convinced that this is an SDK or even a hardware issue, so I modified my test code on the receiving Pico to achieve the same using the i2c slave functions in the SDK.

Having now explained how I got to this point, here is the separate issue I am logging in relation to the SDK:

I am sending the data to the receiving Pico at 10 Hz, the data consists of 2 blocks of data separated by a 1mS delay, the first block consists of 84 bytes, the second consists of 48 bytes.

To implement the i2c slave mode using the SDK functions, I found a small library which was written against the SDK, to save time in setting it up: https://github.com/vmilea/pico_i2c_slave

After doing this, I am still getting errors, now using the SDK directly so I have now effectively ruled out an issue with Arduino-Pico or the official ArduinoCore-MBED for which I was seeing these errors when using both. (although ArduinoCore-MBED had other issues)

This is my understanding off how i2c slave mode works using the SDK: Using the i2c functions in the SDK, the receive interrupt handler is fired immediately after each byte is received on the wire, with the occasional small delay observed on my scope. The Wire API as implemented in Arduino-Pico is aggregating these interrupts and firing a single interrupt (onReceive) the end of the transmission to match the intended Arduino Wire API behaviour.

The receive interrupt handler as used by the SDK provides an event variable to distinguish what the interrupt is for. Inside the receive interrupt handler there needs to be a switch case statement on this event to check if the data has been received, has been requested or has finished being sent. To debug this I setup 2 debug pins on each of these conditions so I can capture when each data byes arrives and when the transmission has finished. There is another debug pin to signal the error condition, the missing data.

The problem:

What I have found so far is that when the error condition occurs, there is an extra firing of the i2c receive interrupt immediately after the interrupt which signalled the end of the transmission. The extra interrupt firing is a "data received" event, indicating that the "end of transmission" event interrupt occurred before the interrupt signalling the reception of the last byte. This looks like an out-of-order interrupt firing which is happening sometimes.

After the error condition occurs, there is then a spurious "end of transmission" interrupt at the start of the reception of the next transmission.

For successful data reception, the "end of transmission" interrupt occurs 3.4uS after the interrupt for the last byte, just slightly longer than the i2c clock period of 2.7uS when using a 400KHz i2c clock (actually 365KHz as measured).

Scope traces for Successful data reception:

This is what a successful data reception looks like, the yellow trace is the error debug pin, the purple trace is the i2c clock, the blue trace is the interrupt firing for the event matching data reception, the green trace is the interrupt firing for the event matching the end of the transmission: ScreenImg-42 In the above scope image, the gap between each block of data is 1mS

Zoomed in further: ScreenImg-41

and further, here you can clearly see the green low pulse indicating the end of transmission occurring after the last blue pulse for the last data byte, which is correct and what should be expected: ScreenImg-40

The above looks sensible and allows data reception to be easily handled in user code.

Scope traces for failed data reception, out-of-order interrupt:

Here it can be seen the the green low pulse which is the interrupt firing for the event matching the end of the transmission is occurring before the last blue low pulse which is the interrupt firing for the last data reception event: ScreenImg-43

This is what happens for the next data transmission (after 1uS delay) after the out-of-order interrupt seen above: The cursors are measuring 9 clock pulses which is the number required to transmit 1 byte of data, or in this case the address byte: ScreenImg-44

As can be seen above, at some point during the reception of the address byte, the interrupt event for the end of data transmission is firing. (green low pulse)

Interestingly, at the end of the above transmission block, it then behaves normally: ScreenImg-45

Here is a zoomed out view of 2 sets of transmission block pairs, separated by the 1mS delay, it can be seen that when the yellow error marker occurs, there is an extra "end of transmission" interrupt firing at the start of then next transmission, after the 1mS delay: ScreenImg-38

chrisckc commented 1 year ago

I have updated my test harness code originally created for Arduino-Pico and the official ArduinoCore-MBED.

There is now a new project "i2c-receiver-sdk" which uses the SDK slave mode functions. It is still using Arduino-Pico to save time having to rewrite it for the pure SDK. The main thing is that the Wire API is non longer being used, the interrupt is being fired by the SDK slave handling code.

https://github.com/chrisckc/TestHarness-I2C/tree/master

chrisckc commented 1 year ago

UPDATE: Check this related issue: https://github.com/earlephilhower/arduino-pico/issues/979

This issue appears to be related to the the order of checking interrupt status flags and/or not capturing the value of hw->intr_stat at the start of the interrupt handler amongst other related possibilities.

andygpz11 commented 1 year ago

Hi Chris, we are now looking at this.

BRs ... Andrew

andygpz11 commented 1 year ago

Hi @chrisckc, I've have cloned TestHarness-I2C Is there a missing file with a main() that calls i2c-transmitter/src/main-sender.cpp::setup() and i2c-transmitter/src/main-sender.cpp::loop() please? I actually have a build going anyway but I'd rather use yours please. Apologies if I missed something. Thanks

chrisckc commented 1 year ago

Hi @andygpz11 , I originally wrote it for Earle Philhower's RP2040 core Arduino-Pico using PlatformIO. To rule out any issue with the Arduino Wire API implementation in Arduino-Pico, I created a version of i2c-receiver which bypassed that and used the SDK directly via a 3rd party library written for the Pico-SDK. As the issues are on the receiving side I left the i2c-transmitter code as it was.

If you use PlatformIO it should compile, it looks for a setup() and main() in a cpp file inside the src folder and does the Arduino manipulation on it. All SDK functions can be used in Arduino-Pico so I didn't see a good reason to make it run on pure SDK. Thanks!

andygpz11 commented 1 year ago

Hi Chris, thanks for your reply - all points noted.

I am not familiar with PlatformIO or indeed Arduino so my starting point is to use the Pico SDK (& Valentin's library), with your test case and debug, and work towards the above environments.

I will keep you posted and most likely push my work in due course. BRs.

chrisckc commented 1 year ago

For a pure SDK test harness, I thought it would be good to use the method in the SPI master-slave example, which I modified for testing the SPI issues that I logged after logging this issue. I was just going to swap-out the SPI code for i2c in that example so it just repeatedly transfers the sequential 256 byte buffer to the slave Pico at an adjustable rate and i2c speed.

The test harness I originally created for this issue has reached a limit now in terms of the level of debugging required, it was merely a snapshot recreation of what I was trying to achieve at the time.

chrisckc commented 1 year ago

I have been running this new setup at a 1MHz i2c clock, eliminating the breadboard: 20221206_154513

I noticed something strange with the results, if I start it up and leave it running on batteries for a new hours, then connect it to the MacBook to check the serial output, I see either zero, 1 or 2 errors reported initially, and then see errors appearing every 5 or 10 seconds continuously, suggesting that the USB data output is causing errors in the i2c communications.

I have seen the timing of USB comms in relation to i2c comms affecting the error rate previously when using Arduino-Pico, however that particular test became lost in the wash when other factors where found to have a much bigger impact.

This is using the latest pushed code in i2c-receiver-sdk which uses the modified i2c library where the IRQ handler inside i2c_slave.c has been modified to this:

#pragma GCC push_options
#pragma GCC optimize ("O0")
static void __not_in_flash_func(i2c_slave_irq_handler)(i2c_slave_t *slave) {
    i2c_inst_t *i2c = slave->i2c;
    i2c_hw_t *hw = i2c_get_hw(i2c);

    uint32_t intr_stat = hw->intr_stat;
    if (intr_stat == 0) {
        return;
    }
    if (intr_stat & I2C_IC_INTR_STAT_R_RX_FULL_BITS) {
        slave->transfer_in_progress = true;
        slave->handler(i2c, I2C_SLAVE_RECEIVE);
    }
    if (intr_stat & I2C_IC_INTR_STAT_R_RD_REQ_BITS) {
        hw->clr_rd_req;
        slave->transfer_in_progress = true;
        slave->handler(i2c, I2C_SLAVE_REQUEST);
    }
    if (intr_stat & I2C_IC_INTR_STAT_R_TX_ABRT_BITS) {
        hw->clr_tx_abrt;
        finish_transfer(slave);
    }
    if (intr_stat & I2C_IC_INTR_STAT_R_START_DET_BITS) {
        hw->clr_start_det;
        finish_transfer(slave);
    }
    if (intr_stat & I2C_IC_INTR_STAT_R_STOP_DET_BITS) {
        hw->clr_stop_det;
        finish_transfer(slave);
    }
}
#pragma GCC pop_options
andygpz11 commented 1 year ago

Hi Chris, thanks for the update, all noted.

(Not that you asked) I/we have been investigating/considering this and as a part of that we are looking to improve the Pico SDK I2C slave code, possibly by incorporating Valentin's I2C slave library. A master/slave I2C comms example, as you suggest, would definitely be useful.

I have a slight concern I might end up with a pure SDK example which is 100% reliable and we then have trouble bridging the gap between what I see and what you see on Arduino.

My current feeling, although I do not have a definitive answer, is that the errors you see are caused by a scheduling / concurrency issue(s) rather than electrical ones. Your hardware arrangements appear very satisfactory and I2C is not a fast protocol to start with.

I expect to provide a further update when I have something more concrete.

BRs ... Andrew

lurch commented 1 year ago

I have seen the timing of USB comms in relation to i2c comms affecting the error rate previously

The RP2040 is a dual-core microcontroller - perhaps running USB comms on one core and i2c comms on the other core, would resolve that problem? :shrug: (this is just a random uninformed guess)

chrisckc commented 1 year ago

When I originally spotted the issue on the project I was working on, I was using both cores, core0 for receiving i2c and core1 for outputting USB serial to the console.

This was the first thing I ditched before I wrote the test harness to try and rule out other factors.

Up until yesterday I thought the modifications made to Valentin's I2C slave library as posted above had solved the issues as I was seeing zero errors at 400KHz left running overnight on batteries or a power bank, previously this test had resulted in many errors reported. I also did not notice any errors when it was then plugged into the laptop for checking the USB serial output.

However after leaving it running at 1MHz overnight and also seeing no errors until the moment I plugged in the USB cable to monitor the output, it is clear that there is something else going on here as well.

As another test I have reverted back to 400KHz, left it running overnight with no errors, plugged the USB cable in and then left it running like that for much longer than I have done previously, periodically checking the output every now and then. I saw errors appearing every 30mins or so, I hadn't noticed this before because I was just plugging it in to quickly check the output, not leaving it plugged in for long enough to see errors appearing during the USB serial output.

It is clear that the USB Serial output is having an effect on the i2c data reception, the effect is much more obvious at 1MHz than at 400KHz.

chrisckc commented 1 year ago

I have a slight concern I might end up with a pure SDK example which is 100% reliable and we then have trouble bridging the gap between what I see and what you see on Arduino.

I did start work on a pure SDK test harness, similar to the SPI master-salve example. I will let you know what I find when that is done, I will of course push it up to GitHub as well.

chrisckc commented 1 year ago

I just want to add that we should not lose sight of the unexpected reason that the code below works:

static inline void finish_transfer(i2c_slave_t *slave) {
    if (slave->transfer_in_progress) {
        slave->handler(slave->i2c, I2C_SLAVE_FINISH);
        slave->transfer_in_progress = false;
    }
}

#pragma GCC push_options
#pragma GCC optimize ("O0")
static void __not_in_flash_func(i2c_slave_irq_handler)(i2c_slave_t *slave) {
    i2c_inst_t *i2c = slave->i2c;
    i2c_hw_t *hw = i2c_get_hw(i2c);

    uint32_t intr_stat = hw->intr_stat;
    if (intr_stat == 0) {
        return;
    }
    if (intr_stat & I2C_IC_INTR_STAT_R_RX_FULL_BITS) {
        slave->transfer_in_progress = true;
        slave->handler(i2c, I2C_SLAVE_RECEIVE);
    }
    if (intr_stat & I2C_IC_INTR_STAT_R_RD_REQ_BITS) {
        hw->clr_rd_req;
        slave->transfer_in_progress = true;
        slave->handler(i2c, I2C_SLAVE_REQUEST);
    }
    if (intr_stat & I2C_IC_INTR_STAT_R_TX_ABRT_BITS) {
        hw->clr_tx_abrt;
        finish_transfer(slave);
    }
    if (intr_stat & I2C_IC_INTR_STAT_R_START_DET_BITS) {
        hw->clr_start_det;
        finish_transfer(slave);
    }
    if (intr_stat & I2C_IC_INTR_STAT_R_STOP_DET_BITS) {
        hw->clr_stop_det;
        finish_transfer(slave);
    }
}
#pragma GCC pop_options

The above code effectively ignores the start flag when making the determination that a transfer has started, slave->transfer_in_progress = true; is set when a byte is received rather than being set by the start flag as was the case in the Arduino-Pico Wire API implementation. It was also important that the byte-received flag was checked first, especially before the stop flag is checked. My modification to Valentin's I2C slave library was the re-ordering of these checks

Ignoring the start flag is a bit hacky and probably not was was intended?

After the Arduino-Pico Wire implementation was updated in a recent commit to match the behaviour of the above code, the error rate dropped by several orders of magnitude. The gcc options wrapping as suggested by Earle Philhower (on the same issue I originally logged in that repo) then mopped up the remaining errors to get to the point it is at now where the errors caused by USB serial have become visible.

chrisckc commented 1 year ago

UPDATE: I wrote a test harness for this using pure SDK with works like the SPI master-slave example: https://github.com/chrisckc/TestHarness-I2C-Pico-SDK.git EDIT: I now have 2 the way communications working, using the technique demonstrated in the example included with the i2c-slave library.

The errors only exist on the receiver, slave Pico. The sender, master Pico uses "i2c_read_timeout_us" to read data back from the slave Pico which does not require interrupts in the user code, the same as "i2c_write_blocking". It is only the receiver, slave Pico which needs to use an ISR to receive data and respond to read requests from the master.

I have left the original i2c_slave library ISR code active inside the i2c slave library to demonstrate the errors appearing. To enable to working version of the IRQ handler, go to line 61 in the i2c_slave library: pico_i2c_slave/i2c_slave/i2c_slave.c and uncomment my modified version of the i2c_slave_irq_handler function, the re-ordering of the IRQ flag checks seems to be the main thing that dramatically reduces the error rate. It should not be considered as a true fix for this because it still does not handle "I2C_IC_INTR_STAT_R_RESTART_DET" which I presume is the i2c "repeated start" condition? I can't find any info on that in the docs. I have not added anything to i2c_slave_irq_handler, just re-ordered things.

It is not clear why this seems to fix the issue as surely it is not using the flags as intended? It looks like the way the handler is operating gets around the issue of the IRQ's sometimes appearing to fire in the wrong order, or maybe some IRQ's being delayed so much that it never happens and the next IRQ is fired with multiple flags set?

andygpz11 commented 1 year ago

Hi Chris, thanks for the update. Have you notified the original author of this code? Until I have my SDK version complete I can't test this and I don't want guess at what might be happening. I will certainly try your modification when I am able to though.

chrisckc commented 1 year ago

Ok, no problem, I will notify the library author, i will log an issue on their repo. I was originally waiting until I had a solution rather than an unsatisfying hack.

chrisckc commented 1 year ago

UPDATE:

Remember that the original version of the i2c_slave_irq_handler function provided by Valentin's I2C slave library resulted in many errors, even when there was no simultaneous USB Serial comms. The improved version of the i2c_slave_irq_handler made the error rate several orders of magnitude better, almost zero. When I enable an option to spit out data over USB Serial while I2C data is being received, I start to see a few errors again.

Ok with that said, using the original problematic version of the i2c_slave_irq_handler function, I modified my test code to use the second core for receiving I2C data. I have had it running for an hour and it is so far working flawlessly, not seen any errors yet.

I am not using any mutex protections, not ideal, just a quick and easy test. The way the code is structured I can see that there should be no chance of simultaneous writes to any vars shared between the 2 cores. There could be simultaneous reads, or instances where a var is modified by Core1 and read by Core0 at the same time but they should not cause any issues in regards to stale or corrupt data.

updated code here: https://github.com/chrisckc/TestHarness-I2C-Pico-SDK

To test this in I2C_receiver.c set USE_SECOND_CORE to true #define USE_SECOND_CORE (true) // enable the second core for I2C data reception

xadhoom commented 1 year ago

Just my 2c here. Was experiencing similar issue and stumbled into this report.

My setup is a rpi zero which acts as a master and reads from a pico which is the slave (mainly for adc and gpio counters stuff).

The pico was connected also via usb to my pc for code loading, but usb is not used at all at runtime (disabled stdio on usb and uart, no prints in the code).

My test program asked data in a loop, and I was not able to complete 2k iteration that failed with i2c timeouts.

After moving my i2c slave init stuff into core1 as above, everything works as expected, I'm not able to reproduce the issue anymore.

denisvm commented 1 year ago

This also happened to me when using both I2C and USB (as virtual serial port). The issue disappeared when I stopped using USB on the I2C_SLAVE_RECEIVE handler.