raspberrypi / pico-sdk

BSD 3-Clause "New" or "Revised" License
3.67k stars 915 forks source link

UART communication issues between 2 Pico's, random bytes missing and phantom byte at startup #1144

Open chrisckc opened 1 year ago

chrisckc commented 1 year ago

The Test Setup:

I have 2 Pico's wired together to test UART communications, I am using UART0, pins GP0 -> GP1, GP1 -> GP0 , GND and VSYS wired together.

I built a Test Harness based on these examples: https://github.com/raspberrypi/pico-examples/tree/master/spi/spi_master_slave https://github.com/raspberrypi/pico-examples/blob/master/uart/uart_advanced/uart_advanced.c

I modified the way it operates so that it first sends a separate, single byte, data transfer before the buffer is transferred. The output and input buffers have also been reduced from 256 bytes to 255 so I can send the buffer size as a single byte before sending the buffer.

I increased the send rate from 1 per second to 10 per second. There is also a lot of extra serial output and error capturing and reporting has been added. I deliberately did not add any start synchronisation to the transfers, it is simply a game of ping pong between the 2 Pico's, the sender sends the buffer size followed by the buffer, the receiver checks to make sure it has received what is expected and sends back its own buffer in the same way, which is a reversed copy of the buffer sent to it. The sender then checks to make sure the response from the receiver is as expected. After 100 milliseconds the process is repeated.

The code is here: https://github.com/chrisckc/TestHarness-UART-Pico-SDK In order to properly view the serial output you will need to use a proper terminal emulator, ie. one that supports ANSI control characters. I use iTerm2 on MacOS with a command to launch screen against the usb tty.

The issues:

There are 2 issues:

Issue 1, phantom byte at startup:

The receiver always sees an extraneous interrupt after startup, well before any data is actually sent to it. Inside the interrupt, uart_is_readable(uart0) returns true and then uart_getc(uart0) returns a null byte. This happens regardless of baud rate or whether the FIFO is enabled or not. This results in the receiver initially reporting an empty page being received before data is actually sent to it. Data only starts to be sent to the receiver 1 second after it is ready and waiting to receive it (refer to the delay code at the start of main for both the receiver and sender).

Issue 2, random bytes missing:

The sender is unable to reliably receive a response from the receiver at baud rates above 115200, the higher the baud rate the higher the error rate.

At 230400 baud the error rate is around 1.1%

At 460800 baud the error rate is around 95%

At 921600 baud the error rate is 100%

The error seems to be in some way related to the use of the USB serial during the reception of data on UART0, if I disable the part of the test code which is outputting some results from the UART0 send operation while the UART0 receive operation is in progress, the error rate drops significantly, but not completely at all supported baud rates.

For UART0 receive without any simultaneous USB serial output:

At 230400 baud the error rate is 0%

At 460800 baud the error rate is around 0%

At 921600 baud the error rate is 32%

For some reason these errors are only seen on the sender when it is receiving its response from the receiver. I have not seen these errors on the receiver, which is using the same code to receive data and receiving the same amount.

The only issue I have seen on the receiver is Issue 1, the phantom byte at that start. Strangely this does not seem to occur on the sender when it receives its first response from the receiver. The difference is that it sends data out on UART0 before before it checks if data has been received, I will need to make some modifications to investigate that aspect further.

Notes:

The above tests were conducted with the FIFO disabled as in the uart_advanced example.

With the FIFO enabled, issue 1 still occurs at all baud rates.

With the FIFO enabled, the error rate for issue 2 is zero for all of the above scenarios and baud rates, this solves the issue but is unexpected and should not be requirement?

Having the FIO disabled can be desirable or essential for certain use cases and there is nothing in the documentation that mentions any limits on the supported baud rates with the FIFO disabled?

Issue 2 Scope traces:

Here is a scope trace for Issue 2 at 460800 baud: The yellow trace is the Sender's TX line, Purple is the sender's RX line, Blue is a debug low pulse from the sender triggered for the duration of the on_uart_rx() IRQ. 460800_baud_missing_irq As can be seen, on the right hand side there is missing blue pulse indicating a missing IRQ.

Another trace for Issue 2: 460800_baud_missing_irq_2 As can be seen, near the start one of the blue pulses is shifted to the right and about 2/3 of the way along, there is a missing pulse.

Trace for Issue 2 with FIFO enabled: It can be seen here that then IRQ is firing every 4 bytes instead of after every bye in the above traces, suggesting that 4 bytes is either the FIFO size or maybe just how full it is before the IRQ fires. 921600 baud FIFO enabled I don't know why the first IRQ takes much longer than the rest, the only parts of the code which could cause this are uart_is_readable(UART_ID) or uart_getc(UART_ID) this also happens with the FIFO disabled on some occasions I have seen.

Issue 1 Scope traces:

The yellow trace is the Receiver's RX line, Purple is the Receiver's TX line, Blue is a debug low pulse from the Receiver triggered for the duration of the on_uart_rx() IRQ. The trace starts from shortly before the debug pins are configured during startup and before the UART is configured. The phantom IRQ can be seen here just after 4ms from DEBUG_PIN2 being configured as output High during startup. phantom start byte 2 It can be seen that there is no data on the RX or TX lines and the IRQ was still fired, its also strange that it is fired around 3.6ms after UART0 is configured

Further Notes: If I disconnect the Receiver's RX line the issue still occurs. If I tie the RX line to ground via a 10k resistor the issue still occurs, However if I tie the RX line to 3.3v via the 10k resistor the issue goes away, I don't see the phantom IRQ.

lurch commented 1 year ago

Does #1125 help with your first issue?

chrisckc commented 1 year ago

Does #1125 help with your first issue?

I modified the code such that the uart_init is after the pins are configured:

    // Set the TX and RX pins by using the function select on the GPIO
    // Set datasheet for more information on function select
    gpio_set_function(UART_TX_PIN, GPIO_FUNC_UART);
    gpio_set_function(UART_RX_PIN, GPIO_FUNC_UART);

    // Set up our UART with a basic baud rate.
    uart_init(UART_ID, 2400);

This made no difference, note that I was unable to add this which I saw in the commit: stdio_set_driver_enabled(&stdio_uart, true); as I am not using pico_stdio_uart, just hardware_uart.

chrisckc commented 1 year ago

@lurch Earle Philhower @earlephilhower has fixed Issue 1 which also occurs in Arduino-Pico by throwing away uart bytes with errors in reception. The issue is linked in my previous post.

andygpz11 commented 1 year ago

Hi Chris, you commented: With the FIFO enabled, the error rate for issue 2 is zero for all of the above scenarios and baud rates, this solves the issue but is unexpected and should not be requirement? We think this is not unexpected and we would expect to use the FIFO at high Baud rates. The example you started with was written to demonstrate a (deliberately) elemental approach to driving the UART.
At the 230 kBaud you start seeing issues with, a byte takes only 43 us to be transmitted. It appears that system interrupt latency occasionally exceeds that value and hence characters are lost. Since you are also using the USB interface the system will sometimes be busy with that. At the higher rates the interrupt latency required drops proportionately. This is why most UARTs provide some sort of FIFO – they are hard to reliably service if the required interrupt latency is only one character time at Baud rates one would be likely to use. I think we might add a comment to the example code to point out that there will be ‘lost character’ type issues at high Baud rates and use of the FIFO is (generally) recommended.

lurch commented 1 year ago

See also #719

chrisckc commented 1 year ago

Hi Andy, from what I have seen, the interrupt latency is around 1uS for most of the occurrences, each IRQ occurs 1uS after each byte is received, as observed on the scope. This is well below even the the 10.8uS byte transmission time at 921600 baud. I would expect this to vary when other interrupts need to be serviced at the same time such as when using USB Serial. What I am seeing, when there is nothing else going on other than looping code waiting for interrupts to fire, some interrupts are delayed by more than 10.8uS. I have an option in the test code on the sender: DEBUG_SERIAL_OUTPUT_DURING_UART_RECEIVE (false) to control this. I have rechecked the code, with this option disabled, there is no USB Serial usage after the sender has transmitted its buffer, it then just sets 2 gpio pins and keeps looping, waiting for data to arrive back from the receiver by repeatedly checking the uartDataReady flag and checking the time so it can know the to send the next buffer. None of those operations should interfere with interrupt handling? The receiver takes around 13uS to respond to the sender with its own buffer, by that time the sender is just in its looping state.

Here is a scope trace showing what a missing byte looks like, the location of the missing byte matches what is reported by the debugging output. The purple trace is the RX line of the sender, the blue trace is the debug output pin triggered at the start of the IRQ ScreenImg-56 It can be seen that the IRQ has been delayed beyond the 10.8uS byte transmission time. Apart from that instance and another where the IRQ is delayed by around 7uS, the rest of the IRQ's for the 255 byte transmission are very consistent.

It is this inconsistent behaviour that I find concerning, I first noticed it in i2c communications, so I abandoned that and tried SPI, I found issues with that even as low as 1MHz. Now I have decided to resort to UART and found issues there too. Could this be a wider issue with unreliable interrupt latency?

In this case I can enable to FIFO and the problems appear to go away, however it limits the ability of the Pico to take action on or respond to UART communications quickly.

chrisckc commented 1 year ago

This shows another strange behaviour, the start of every receive shows delays inside the IRQ, this is 921600 baud with the FIFO disabled. I have also measured the time between the end of TX (yellow) and the start of RX (purple) ScreenImg-57 The only parts of the code which could cause this delay are uart_is_readable(UART_ID) or uart_getc(UART_ID) The delays are not long enough to cause missing data as nothing is reported for this byte position in the output, but it doesn't makes sense as to why the delays are there.

chrisckc commented 1 year ago

This shows the end of the RX operation, as can be seen, there has been no overall shifting of the IRQ's, the latency is still 1uS: ScreenImg-58

chrisckc commented 1 year ago

See also #719

That would be useful, along with the other issues linked over there as well.

andygpz11 commented 1 year ago

Hi Chris, I have tried to “roll-up” comments I’m replying to from your various posts.

I have seen, the interrupt latency is around 1uS for most of the occurrences, each IRQ occurs 1uS after each byte is received, as observed on the scope.

Agreed. It looks like 1 us is the “minimum” time and will be the case when the CPU is not already engaged in another ISR. I think your characters are getting dropped because there is other interrupt activity going on...

In this case I can enable to FIFO and the problems appear to go away, however it limits the ability of the Pico to take action on or respond to UART communications quickly.

I am really not sure this is an “issue”, it’s more of a law of Physics resulting from the way you are currently expecting things to work.

Using high Baud rates with ‘per-character’ interrupts (your I2C and SPI examples also generate ‘per-character’ interrupts) is generally not done in common place system design.

If you really must have this there are a number of possible mitigations (and more so on the RP2040 than many parts):

I’m not sure that is functionally a whole lot different from using a FIFO in terms of max processing latency but you shouldn’t drop anything and it will save you making the CPU busy just moving stuff about and so is to be recommended in any case.

I think a USB Device is polled by the Host and so that can make the system busy even when there is no [apparent] traffic to be transferred across the interface. I suspect this is the underlying cause of the busy periods you observe.

Core 1 can sit in a tight loop and poll the UART, process the data as required then provide the data you need totally shrink-wrapped for your host. They can communicate both via shared RAM and/or the [very lightweight] FIFO mechanism.

If small enough, you can also locate all the Core 1 code in RAM which would decouple it from contending with the Core 0 fetches from XIP Flash (not that I have every seen issues relating from that in the hard real-time bit-bashing I’ve done).

HTH.

BRs ... Andrew

chrisckc commented 1 year ago

Agreed. It looks like 1 us is the “minimum” time and will be the case when the CPU is not already engaged in another ISR. I think your characters are getting dropped because there is other interrupt activity going on...

I think a USB Device is polled by the Host and so that can make the system busy even when there is no [apparent] traffic to be transferred across the interface. I suspect this is the underlying cause of the busy periods you observe.

Im not sure what the other ISR's could be, I am not in inadvertently using any others in my test code AFAIK. I am seeing errors building up regardless of whether the USB cable is connected or not. If I start it up on batteries, leave it running for a while and then plug in the USB cable I see that a great many errors have already accumulated. In the absence of other ISR's I would expect a single ISR to be reliable, having a consistent latency.

Using high Baud rates with ‘per-character’ interrupts (your I2C and SPI examples also generate ‘per-character’ interrupts) is generally not done in common place system design.

I originally found I2C issue doing something common place, sending data from one device to another, this requires per character interrupts on the slave side unless there is another way of doing it that I am not aware of? The SPI example is not using any interrupts, well not in my test code anyway, I have not looked what the SDK is doing underneath during spi_write_read_blocking.

After having issues with both I2C and SPI, I thought I would try the UART's, deliberately leaving the FIFO's disabled at first. The purpose of the test code is to try and evaluate the performance in non-ideal and non-mainstream use cases. If reliable UART communication can't be achieved with the FIFO disabled, this needs to be made clear in the documentation as well as in the example as you have mentioned. Issue 1 still needs to be resolved in the SDK, the phantom byte at startup.

  • Have you / can you use the second M0+ core for the really hard real-time scheduling work?

I was originally using the second core with I2C, but ditched that until I had got the the bottom of the issues.

Thanks for the various suggestions and mitigation ideas, I am new to the Pico really, this is the first time I have put it to some more serious use.

Encountering these issues so far has not filled me with confidence, the first thing it did when I encountered the i2c issue was swap out the slave Pico for a Teensy4.0 and had no issues receiving data up to the 1MHZ I had attempted. Ideally with hindsight it would have been better to try something else ARM based with an equivalent clock speed.

To be more comfortable with the Pico it would be good to know what other ISR's are going on (when there is no USB serial output during data reception and the cable is also unplugged) that would be causing these random interrupt delays. It's not like they are small delays, they are over an order of magnitude longer on some occasions. Also why does it sometimes also take an order of magnitude longer to either check for bytes or read a value from the UART at the start of data reception?

Thanks, Chris

sdbbs commented 1 year ago

Sorry to barge in on this, I just wanted to add a data point related to the title "random bytes missing"

So, I'm not using any UART TX interrupt, and at a certain point in my code I did this:

      volatile uint8_t uart_tx_cnt;
      for(uart_tx_cnt = 0; uart_tx_cnt < bytes_to_send; uart_tx_cnt++) {
        uart_putc_raw(MY_UART_ID, my_buffer[uart_tx_cnt]);
      }

As far as I know, this does not use interrupts - yet in my case, bytes_to_send is 24 - but the receiver persistently reports 22 bytes received; and observing with UART decoder on scope, also that decoder sees only 22 bytes on wire.

And the weird thing is like the first 18 bytes or so are correct (as seen on scope decoder), then one or two that appear random, then one or two that are also correct. I simply don't get this. Even if this was a "delayed system interrupt", it should have delayed the transmission, not corrupt and lose bytes.

Then I thought, I'm going to try this instead:

uart_write_blocking(MY_UART_ID, my_buffer, bytes_to_send);

.... exact same thing?! bytes_to_send = 24 bytes sent - 22 bytes observed on wire?

Wish I had the time to try making a minimal example for this, but I don't ... EDIT: here https://github.com/raspberrypi/pico-sdk/issues/1274