Open arg08 opened 3 years ago
Documentation issue aside, it doesn't sound like your scenario should be affected by memory barriers (unless you are reading the registers directly without using the provided SDK structs/methods and have omitted volatile
) - please post the failing code...
Note that i would normally loop testing the FIFO directly fullness and not the IRQ status, so curious if there is an issue that needs investigation there, and given that you say dmb
fixes it that could either be a timing problem, or as I say that you skipped the volatile
and the compiler re-ordered the memory accesses
It's all using SDK headers, and doesn't behave like a volatile problem. I disassembled the compiler output and it looked plausible. It works fine with the DMB instruction in there, or with a putchar() to debug.
My application starts from an empty FIFO, loads 4 bytes 'blind', then enables the interrupt to load the rest, with the interrupt handler disabling the interrupt when it runs out of stuff to send. The test case is trying to transmit 10 bytes total: without the DMB (or it works with a putchar() in the same place!), only 7 bytes come out of the PIO and the FDEBUG register tells me I've overrun the FIFO. The PIO code is outputting the data to hardware significantly slower than the ARM can load it.
The first 5 bytes coming out are correct, and three are then lost from the remaining 5.
What I believe is happening is that I load the first 4 bytes and the first is immediately auto-popped by the PIO so the FIFO isn't full and the interrupt immediately fires; it then loads the 5th byte into the FIFO successfully, fails to detect FIFO full and loads the 6th byte that is lost, then exits the interrupt. Some time later interrupt fires again with one empty slot in the FIFO; handler loads byte 7 correctly and again loses byte 8; interrupt again, load byte 9 and lose byte 10 and it's all over.
I will re-test in a minute using your suggestion of using FSTAT instead of INTS0.
Here's an abridged version of the code - I can give you the whole app if you want, but it won't run without hardware.
`
#define SM_NO 0
static inline void __not_in_flash_func(tx_fifo_int)(PIO pio, EcoWkSpace *ws)
{
uint8_t b;
if (various stuff) b = buffer[somewhere];
else
{
b = something else;
// This is the last byte so disable interrupts.
hw_clear_bits(&pio->inte0,
(PIO_IRQ0_INTE_SM0_TXNFULL_BITS << SM_NO));
}
// Put the byte in the FIFO.
pio->txf[SM_NO] = b;
}
static void __not_in_flash_func(pio_irq0_handler)(void)
{
EcoWkSpace *ws = &workspace0;
PIO pio = pio0;
while (pio->ints0 & (PIO_IRQ0_INTS_SM0_RXNEMPTY_BITS << SM_NO))
{
uint32_t w = pio->rxf[SM_NO];
.... do various stuff not involving hardware. This all works OK
}
while (pio->ints0 & (PIO_IRQ0_INTS_SM0_TXNFULL_BITS << SM_NO))
{
tx_fifo_int(pio, ws);
__dmb(); // Ensure write to the Tx FIFO have taken effect
// before we sample the status again.
}
}
`
OK, I tried it with polling FSTAT instead of INTS0 and it worked without the DMB, but this isn't really meaningful since the loop test now has to get much more complex (the original relies on the interrupts getting disable when there's no more data to load, while this now needs to be tested explicitly) - in effect the DMB has got replaced by other instructions testing for data available.
So it's either a genuine memory barrier issue (write buffer allowing the register read to overtake the FIFO write), or perhaps a latency within the PIO logic that takes a few cycles to update the interrupt status after a FIFO write.
#if 0
while (pio->ints0 & (PIO_IRQ0_INTS_SM0_TXNFULL_BITS << SM_NO))
{
tx_fifo_int(pio, ws);
__dmb(); // Ensure write to the Tx FIFO have taken effect
// before we sample the status again.
}
#else
if (pio->ints0 & (PIO_IRQ0_INTS_SM0_TXNFULL_BITS << SM_NO))
{
while (((pio->fstat & (1 << PIO_FSTAT_TXFULL_LSB)) == 0) && ws->bleft)
{
tx_fifo_int(pio, ws);
}
}
#endif
https://gist.github.com/kilograham/0d8a996d1539958d9a93518f5e9e08ea
I was concerned, so made a test... seems to work fine for me - see where it differs from what you are doing
I think the difference is you don't have the pio->txf[SM_NO] = b; right at the foot of your loops (and you haven't taken advantage of the optimisation that if you disable the interrupt you don't then need to test the 'remain' variable again).
I'll pull your test program and try to cook up an irq_handler_D() that fails.
OK, here you go.
diff --git a/pio/irq_test/irq_test.c b/pio/irq_test/irq_test.c
index 8545b78..2143e5a 100644
--- a/pio/irq_test/irq_test.c
+++ b/pio/irq_test/irq_test.c
@@ -24,6 +24,7 @@ enum type {
IRQ_A,
IRQ_B,
IRQ_C,
+ IRQ_D,
LAST_TYPE,
};
@@ -61,6 +62,16 @@ static void __not_in_flash_func(irq_handler_C)(void) {
remaining = r;
}
+static void __not_in_flash_func(irq_handler_D)(void) {
+ int r = remaining;
+ while (pio->ints0 & (PIO_IRQ0_INTS_SM0_TXNFULL_BITS << sm)) {
+ if (r) r--;
+ else hw_clear_bits(&pio->inte0,PIO_IRQ0_INTE_SM0_TXNFULL_BITS << sm);
+ pio->txf[sm] = DELAY;
+ }
+ remaining = r;
+}
+
int __not_in_flash_func(main)() {
stdio_init_all();
uint offset = pio_add_program(pio, &irq_test_program);
@@ -112,6 +123,16 @@ int __not_in_flash_func(main)() {
irq_set_exclusive_handler(PIO0_IRQ_0, irq_handler_C);
irq_set_enabled(PIO0_IRQ_0, true);
break;
+ case IRQ_D:
+ puts("IRQ only checking irq status");
+ for (int i = 0; i < MIN(COUNT, 4); i++) {
+ pio_sm_put_blocking(pio, sm, DELAY);
+ remaining--;
+ }
+ pio0->inte0 |= 1u << (PIO_IRQ0_INTE_SM0_TXNFULL_LSB + sm);
+ irq_set_exclusive_handler(PIO0_IRQ_0, irq_handler_D);
+ irq_set_enabled(PIO0_IRQ_0, true);
+ break;
default:
panic_unsupported();
}
@@ -139,6 +160,10 @@ int __not_in_flash_func(main)() {
irq_remove_handler(PIO0_IRQ_0, irq_handler_C);
irq_set_enabled(PIO0_IRQ_0, false);
break;
+ case IRQ_D:
+ irq_remove_handler(PIO0_IRQ_0, irq_handler_D);
+ irq_set_enabled(PIO0_IRQ_0, false);
+ break;
default:
break;
}
Gives output:
Blocking push
Loops while remaining: 0
Output count: 20
IRQ with ints loop
Loops while remaining: 528
Output count: 20
IRQ with single ints
Loops while remaining: 579
Output count: 20
IRQ with fifo_full_check loop
Loops while remaining: 532
Output count: 20
IRQ only checking irq status
Loops while remaining: 241
Output count: 13
But add the __dmb() and it works:
+static void __not_in_flash_func(irq_handler_D)(void) {
+ int r = remaining;
+ while (pio->ints0 & (PIO_IRQ0_INTS_SM0_TXNFULL_BITS << sm)) {
+ if (r) r--;
+ else hw_clear_bits(&pio->inte0,PIO_IRQ0_INTE_SM0_TXNFULL_BITS << sm);
+ pio->txf[sm] = DELAY;
+ __dmb();
+ }
+ remaining = r;
+}
Well, actually it gives 21 since my last word is with remaining == 0 rather than remaining == 1 as for the other four cases, but anyhow it gives the expected result.
Actually,
asm volatile (" nop");
also works, though that doesn't really shine any more light on the issue..
yes for clarity i did
static void __not_in_flash_func(irq_handler_D)(void) {
int r = remaining;
while (pio->ints0 & (PIO_IRQ0_INTS_SM0_TXNFULL_BITS << sm)) {
if (r) {
r--;
} else {
hw_clear_bits(&pio->inte0, PIO_IRQ0_INTE_SM0_TXNFULL_BITS << sm);
break;
}
pio->txf[sm] = DELAY;
__asm__ volatile ("nop");
}
remaining = r;
}
Note that the pio_sm_is_tx_fifo_full
if used reduces to a single read and tst
instruction just like the other.
Still it seems that you are correct, that the hardware has an unexpected 1 cycle delay here - will have someone look at the chip
Yes, the reason for doing it the way I did was not any perceived inefficiency in pio_sm_is_tx_fifo_full(), but because testing the interrupt status combines the test for fifo full with the loop completion due to the interrupts turned off. Your version just above is equally efficient because you do it with break but don't then output the last byte - in the general case that means you are going to take an extra interrupt after the end of the loop just to do the enable.
Of course it will often be convenient to put the pio->txf[sm] = x; further up the loop and then the problem doesn't arise, but as it happened in my case the last byte needed special treatment and this way was most efficient.
Still, all of this is micro-optimisation at a level that doesn't really matter and I have no problem with being told "don't do that", it would be nice to know exactly what "that" is though.
@Wren6991 needs to explain why/if this happens and document accordingly
it would be nice to know exactly what "that" is though.
There is a 2-cycle delay from the raw status (which you can observe through FSTAT) to the INTS field. The M0+ can do back-to-back nonsequential load/stores in 2 cycles, but a single nop
will space them far enough apart to hide the delay hazard. 2 cycles come from:
INTR
INTS
is then assigned a registered version of (INTR & INTE) | INTF
So INTS
will eventually reflect changes to FSTAT
but not with the same timing, and is only really useful for "is this interrupt asserted"
Clearly a bit of a trap, so it needs documenting and we need to take a closer look at this in future.
Note this should only be visible on PIO, since all other peripherals with this type of interrupt controller are accessed over an APB bridge, which lengthens the bus cycle.
I think the documentation could use some extra information about the need for memory barriers: so far as I can see, there's currently neither explanatory material in the SDK, nor the raw information in the datasheet to figure it out for yourself.
As an example, I just got caught out where I had an interrupt handler for the PIO Tx FIFO, which looped loading bytes into the FIFO until the interrupt status (IRQ0_INTS) cleared - this overflowed the FIFO because the status read apparently overtook the FIFO write. Evidently there's a write buffer in there somewhere. Easily cured with a DMB instruction, but I couldn't find this write buffer documented anywhere.
Also, there's the mysterious __mem_fence_acquire() where the SDK documentation seems to assume you know what it does, and drilling down into the SDK source suggests it's just a compatibility veneer for a facility defined elsewhere - yet Google doesn't give any hits apart from the Pico SDK itself.