Safe API to recover from overrun condition when using DMA serial Rx

inodentry commented 4 years ago

EDIT: the issue has been determined to be an overrun.

I am trying to get two pills (a blackpill sending data and a bluepill receiving data) to communicate with each other using serial on USART1.

As an initial test to get it working, I loaded a program onto the sender blackpill that does a DMA Tx every second, writing a small buffer of 4 consecutive bytes, increasing every time. This works fine, it keeps sending data.

I am trying to receive those 4-byte transfers using DMA on the bluepill. I am printing the values using semihosting (with hprintln) to see what was received. The first transfer is successful, but then it stops receiving more bytes after the first byte of the second transfer.

The USART is configured identically on both boards:

let (mut dma_tx, mut dma_rx) = {
    use stm32f1xx_hal::serial::{Serial, Config};

    let tx_pin = gpioa.pa9.into_alternate_push_pull(&mut gpioa.crh);
    let rx_pin = gpioa.pa10;

    let (tx, rx) = Serial::usart1(
        device.USART1,
        (tx_pin, rx_pin),
        &mut afio.mapr,
        Config::default().baudrate(19_200.bps()),
        clocks,
        &mut rcc.apb2,
    ).split();

    let tx_buf = cortex_m::singleton!(: [u8; 4] = [0; 4]).unwrap();
    let rx_buf = cortex_m::singleton!(: [u8; 4] = [0; 4]).unwrap();

    (Some((tx_buf, tx.with_dma(dma.4))), Some((rx_buf, rx.with_dma(dma.5))))
};

Here is the receiving code:

let mut xfer = None;
loop {
    // no transfer started yet; start new transfer
    if let Some((buf, rx)) = dma_rx.take() {
        xfer = Some(rx.read(buf));
    }

    // transfer in progress
    if let Some(xf) = xfer.take() {
        if xf.is_done() {
            // transfer done; print the data and start new transfer
            let (buf, rx) = xf.wait();
            let buf2 = [buf[0], buf[1], buf[2], buf[3]];
            hprintln!("f {:?}", buf2);
            xfer = Some(rx.read(buf));
        } else {
            // transfer not done; print the partial data
            let buf = xf.peek();
            hprintln!("p {:?}", buf);
            xfer = Some(xf);
        }
    }
}

Here is the output I am getting:

...
p []
f [72, 73, 74, 75]
p [76]
p [76]
p [76]
...

(no more bytes are received and it keeps repeating that forever)

It consistently glitches after the first byte of the second transfer.

However, I suspect this is affected by the delay due to the slow debug print. And indeed, if I start the next read operation before printing the values, I can get a few more successful transfers.

With this simple modification to the code, just swapping the two lines:

/* ... */
        if xf.is_done() {
            // transfer done; print the data and start new transfer
            let (buf, rx) = xf.wait();
            let buf2 = [buf[0], buf[1], buf[2], buf[3]];
            xfer = Some(rx.read(buf)); // start new read before printing
            hprintln!("f {:?}", buf2); // print after starting new read
        } else {
/* ... */

I can now get a few more successful transfers (sometimes 3, sometimes 4), but it still glitches out after that:

p []
f [12, 13, 14, 15]
f [16, 17, 18, 19]
f [20, 21, 22, 23]
p [24]
p [24]
p [24]
p [24]

And here is another run:

p []
p []
f [112, 113, 114, 115]
f [116, 117, 118, 119]
f [120, 121, 122, 123]
f [124, 125, 126, 127]
p [128]
p [128]

Note that the other board has been kept running all along and it just keeps sending the successive numbers in the sequence. I keep reloading the receiving program, and every time, it starts OK, receives the first transfer(s), and then glitches out.

~This kind of inconsistent behaviour smells of unsoundness to me. Perhaps it is a bug in the DMA implementation?~ EDIT: the issue has been determined to be an overrun.

thalesfragoso commented 4 years ago

You're probably getting an overrun error, i.e. you're receiving at least two bytes while the DMA is off. Semihosting could definitely be the problem. Are you compiling with --release ? If so, how does your release profile look like ?

inodentry commented 4 years ago

Yes, I am.

[profile.release]
lto = "fat"
codegen-units = 1
opt-level = "z"
incremental = false
panic = "abort"

inodentry commented 4 years ago

OK, your information about overruns is helpful. It seems that that was indeed what is happening, because the debug print is quite slow (takes over a second) and data is coming in 4 bytes at a time (which is more than 2), every second, so all of them might come in while the print is ongoing.

I changed the print to only print the first byte of the received buffer, so that it happens faster, and now it is no longer glitching. EDIT: it actually still glitched after a minute or two of running.

Could you give me some info about how to recover from such an overrun situation? I'd like to be able to keep receiving data, even if I lose some data / the current transfer.

It probably won't be a problem when I deploy my project, because there will be no debug prints (obviously), but I don't want to have the risk of my communications getting stalled like this.

Also, perhaps we should close the issue, since this is not a bug like I thought.

thalesfragoso commented 4 years ago

You can check if you're getting an overrun with something like this:

use stm32f1xx_hal::pac::USART1;

let usart = unsafe { &*USART1::ptr() };
let overrun = usart.sr.read().ore().bit_is_set();
hprintln!("overrun ? {:?}, overrun);

// The bit is cleared by a read to SR follow by a read to DR,
// you might need to reset the DMA to get it going again, but not sure.

let _ = usart.dr.read();

This isn't a proper solution though, just a test.

inodentry commented 4 years ago

Hmm.. so the API of this crate does not provide any clean mechanism to clear the overrun. That's awful. So much for safe rust abstractions...

I feel uncomfortable poking the registers with unsafe code, when I am also going to be using the safe DMA API as well, because it seems like I could easily violate the invariants / the state that the abstraction expects the hardware to be in.

thalesfragoso commented 4 years ago

Err, there is nothing unsafe/unsound about getting an overrun. It seems like that the HAL automatically clears it when not using DMA. It's a bit harder to detect it when using the DMA though, the user would need to detect it somehow and clear it.

Right now the HAL doesn't provide a convenience method for clearing the error, but it should be easy to PR in, I guess no one had this problem before.

However, as an advice, getting this error with DMA is a bit of a design flaw, try using a circular transfer instead, it's more suitable for a continuous stream of data, since the DMA stays on all the time.

inodentry commented 4 years ago

Yeah, I get it. I assumed unsoundness before, because I didn't understand how DMA works and that I can get overruns like this. But now I do, I even edited the issue title to not be misleading ;)

Yes, it would be nice to have a safe API to deal with the overrun when using DMA. I don't understand how to add it though, so I can't make a PR myself.

Thanks for the useful information and pointing me in the right direction. I will look at how to do "circular transfer", it sounds like what I might need. I see there is an example in the repo.

stm32-rs / stm32f1xx-hal

Safe API to recover from overrun condition when using DMA serial Rx #272