Compilation of w600 port III

rkompass commented 1 year ago

We continue here the discussions Compilation of w600 port and Compilation of w600 port II (closed now) of development/debugging the w60x port in MP branch w60x.

robert-hh commented 7 months ago

Just for fun: Adding a us timestamp to Pin.irq() and feeding the device with an external 10ms pulse, I get the following result with a modified timertest at a W600. That's what could be expected:

i_now: 403
tdif[0:3]: [-10045, -3566, 1]
Deviations (us) lowest: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Deviations (us) highest: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

Using ticks_us() in the handler I get:

i_now: 403
tdif[0:3]: [-9927, 2104073, -10]
Deviations (us) lowest: [-11, -11, -11, -11, -11, -11, -11, -11, -11, -11]
Deviations (us) highest: [10, 10, 10, 10, 10, 11, 11, 11, 12, 13]

rkompass commented 7 months ago

This is nice! Do you have the MP code?

robert-hh commented 7 months ago

Just updated.

rkompass commented 7 months ago

I mean, how to read this timestamp from the MP interpreter.

Perfect solution for a Non blocking HC-SR04 reading for example. !!!

Thinking it over: Could the timestamp be guarded against pin value bounce, so that a flag is set and the timestamp cannot be updated by later same interrupts, until the ISR reads it?

Of course: Give me a little finger -> I want the whole Hand :-)

robert-hh commented 7 months ago

The timestamp can be read with pin.timestamp(), where pin is the pin object, which is also supplied to the ISR as argument. Blocking: possible, but maybe not as useful as it might look at first glance. To make it robust you must then read the value. Otherwise the timestamp it that of the last recognized interrupt. The interrupt is cleared at the end of the ISR. So any retrigger happening while the ISR handler is active will be lost anyhow.

b.t.w. As part of the last push I added the tick_hz argument to Timer, to make it compatible. Not that I assume this argument as needed.

rkompass commented 7 months ago

I see (found it overlappingly): There is a Pin.timestamp() method.

Did you measure, how many bytes that did cost? I was so eager and have the old compilation deleted by now.

robert-hh commented 7 months ago

Did you measure, how many bytes that did cost?

64 Bytes.

robert-hh commented 7 months ago

I think I'll change that and add timestamp to the irq object, like flags and trigger.

rkompass commented 7 months ago

Blocking: possible, but maybe not as useful as it might look at first glance. To make it robust you must then read the value. Otherwise the timestamp it that of the last recognized interrupt. The interrupt is cleared at the end of the ISR. So any retrigger happening while the ISR handler is active will be lost anyhow.

I see (read the isr hander code): It works with hard interrupts and there is no chance that the timestamp is overwritten, until the MP callback can have read it. If it doesn't read, of course, it might be overwritten, as after the callback is finished the interrupts are restored. Is that correct?

Now a question would be what the true latency of the hard interrupt timestamp is. Can be measured with continuous polling against ticks_ns().

This is great in general, as it is easily extended to the other ports. As for the perfect naming (as this Pin method will be available everywhere): How about:

pin.timestamp_us()
pin.int_timestamp_us()
pin.lastint_ts_us()

?

rkompass commented 7 months ago

I think I'll change that and add timestamp to the irq object, like flags and trigger.

Now I'm curious what you mean (general idea clear, but the detail..?). Certainly a way to get a safe first timestamp even with soft interrupts. Which we exclusively have on some architectures like the esp8266, for example.

robert-hh commented 7 months ago

Now I'm curious what you mean

Just a style change, but not possible until the mainline code is changed. The mp_irq class itself is defined in shared/mpirq.*. So got now the call to retrieve the timestamp has to stay at the UART object.

I considered names like these, but I do not like long names. And there are no better synonyms. The best of the alternatives seemed time_tag. Maybe irq_tick_us.

rkompass commented 7 months ago

p.irq_tick_us() seems fine. p.timestamp() would be best unless the unit was missing, which I consider as important. So if p.timestamp_us() is too long, p.irq_tick_us() is second best.?. (it is just 1 character shorter). p.irq_t_us() ?

It would be nice, of course, if the timestamp would be available for all interrupts.

robert-hh commented 7 months ago

We can go for timestamp_us.

rkompass commented 7 months ago

A constant value may be subtracted from the timestamp, as the time measurement in the ISR takes a deterministic time (only dependent on CPU frequency).

rkompass commented 7 months ago

# version to test the new timestamp method

from machine import Pin, Timer
from time import ticks_us, ticks_diff
from array import array
from sys import platform

if platform == 'rp2':
    led = Pin("LED", Pin.OUT, value=1)
elif platform == 'w600':
    led = Pin(0, Pin.OUT, value=1)    # LED on Winner w600

tim = Timer(-1)                 # Software timer

N_Runs = const(1003)            # 5 seconds total
Period_ms = 5                   # no flicker at 200 Hz (not recognizable)

i_int, i_now = 0, 0
t_start, t_stop = 0, 0
tdif = array('I',(0 for _ in range(N_Runs)))

@micropython.viper
def pin_start(t):
    global t_start
    ledp = ptr32(0x40010C00)
    t_start = ticks_us()
    ledp[0] = 0   # all pin values of bank A = low, was: led(0)

@micropython.viper
def pin_isr(p):
    global t_stop, i_int
    # t_stop = ticks_us()
    t_stop = p.timestamp()
    led(1)
    i = int(i_int)
    if i < N_Runs:
        tdif[i] = ticks_diff(t_stop, t_start)
    i += 1
    i_int = i

led.irq(trigger=Pin.IRQ_FALLING, handler=pin_isr, hard=True)
print('Pin {}: Set up hard interrupt.'.format(led))

tim.init(period=5, mode=Timer.PERIODIC, callback=pin_start)

try:
    while i_now < N_Runs:
        if i_int > i_now:
            if i_now % 50 == 49:  # prevent longer USB inactivity (problematic on mimxrt)
                print('.', end='')
            i_now += 1
finally:
    tim.deinit()
    print('\nRuns:', i_now, 'tdif[0:3]:', tdif[0:3])
    tdif = list(tdif)
    tdif[0:2] = []
    tdif.sort()
    print('Deviations (us) lowest:', tdif[0:10])
    print('Deviations (us) highest:', tdif[-11:-1])

gives

Runs: 1003 tdif[0:3]: array('I', [32, 20, 16])
Deviations (us) lowest: [16, 16, 16, 16, 16, 16, 16, 16, 16, 16]
Deviations (us) highest: [24, 24, 24, 24, 24, 24, 24, 24, 24, 24]

and the

t_start = ticks_us()
ledp[0] = 0

certainly takes a few us,

And in the handler: parent->mp_irq_timestamp = mp_hal_ticks_us(); may be put before parent->mp_irq_flags = tls_get_gpio_irq_status(parent->id); ?

optimizing ....

rkompass commented 7 months ago

My optimization:

static void machine_pin_isr_handler(void *arg) {
    mp_uint_t timestamp = mp_hal_ticks_us();
    mp_irq_obj_t *self = arg;
    machine_pin_obj_t *parent = MP_OBJ_TO_PTR(self->parent);
    if (self->handler != mp_const_none) {
        parent->mp_irq_flags = tls_get_gpio_irq_status(parent->id);
        parent->mp_irq_timestamp = timestamp;

The above test now gives

Runs: 1003 tdif[0:3]: array('I', [26, 19, 16])
Deviations (us) lowest: [15, 15, 15, 15, 15, 15, 15, 15, 15, 15]
Deviations (us) highest: [20, 20, 20, 20, 20, 20, 20, 20, 20, 20]

Jitter is down to 5 us. What is the constant value of the interrupt -> ISR (including mp_hal_ticks_us()) ???

robert-hh commented 7 months ago

Taking the timestamp as first action is a good suggestion.

A constant value may be subtracted from the timestamp, as the time measurement in the ISR takes a deterministic time (only dependent on CPU frequency).

I do not know the lowest value of the interrupt response time. I could toggle a pin in the IRQ handler by direct port access and measure with an oscilloscope. Then we get a value close to it. I can do that. It may however be much longer, if e.g. if a higher priority interrupt is active or the IRQ handler code is not cached in the code execution cache. Then it has to be loaded from the serial SPI flash, which takes a while.

I renamed pin.timestamp to pin.timestamp_us, as suggested.

I do not know how you triggered the Pin IRQ. I used an external signal generator, which by chance matched well. One can as well use the chip's PWM. Then there are not clock difference.

rkompass commented 7 months ago

Good morning,

my understanding of C is so that

static void machine_pin_isr_handler(void *arg) {
..
    mp_irq_obj_t *self = arg;

can be changed for static void machine_pin_isr_handler(void *self) {

rkompass commented 7 months ago

In Platform/Drivers/gpio/wm_gpio.c

void GPIOA_IRQHandler(void)
{
    u8  i     = 0;
    u8  found = 0;
    u32 reg   = 0;

    reg = tls_reg_read32(HR_GPIO_MIS);

    for (i = 0; i <= WM_IO_PA_15; i++)
    {
        if (reg & BIT(i))
        {
            found = 1;
            break;
        }
    }
    if (found)
    {
        if (NULL != gpio_context[i].callback)
            gpio_context[i].callback(gpio_context[i].arg);
    }
    return;
}

is the first place where the timestamp could be taken.

Before the loop which takes time dependent of the number of pin in the bank....

robert-hh commented 7 months ago

Using:

static void machine_pin_isr_handler(void *arg) {
..
    mp_irq_obj_t *self = arg;

will not generate extra code, because the compiler is clever enough to just take the value of arg for self without making an assignment. BUT, when defining self as void *, something like self->parent or self->handler is not possible.

About moving the timestamp reading to the low level IRQ handler: that does not look right. Taking the value is one thing, but is has to be forwarded to the MP IRQ handler. So either a) the arg argument has to be opened, given the machine_pin_obj_t type and timestamp has to be assigned to the respective element, or b) timestamp has to be added as a second argument to the callback. Option a) is bad, because you have to mangle the SDK name space with the MicroPython namespace. Option b) also breaks the API convention. It's easier just to tell people, that for fastest response they should use PA0 as pin for IRQ.

rkompass commented 7 months ago

Anyway, the same program with Pin(15) instad of Pin(0) for the led leads to

Deviations (us) lowest: [17, 17, 17, 17, 17, 17, 17, 17, 17, 17]
Deviations (us) highest: [22, 22, 22, 22, 22, 22, 22, 22, 22, 22]

so the looping in the interrupt handler takes up to 2 us.

Yes, things are not perfect. We have to accept that.?. Not making the port more complicated by fuzzing around in the SDK. Or should we optimize the loop at least?

I assume BIT(i) is 1<<i.

robert-hh commented 7 months ago

I made a few measurement by toggling a pin in the IRQ handler:

At PA0, the delay between the external trigger slope and the echo signal is between 1 and 2.2µs, with 1.2µs as the average. Taking the timestamp takes between 2.3 and 11 µs. At PB18, the delay between the external trigger slope and the echo signal is between 6 and 18µs, with 6.3µs as the average. Taking the timestamp takes between 2.3 and 11 µs.

I could not take the time it takes to toggle the pin. The code is too fast for the bus logic to follow. So its far below 1 µs. The yellow trace below is the trace of the toggled pin with persistence enabled.

Using PA0

irq_internal_latency_ticks_timing_PA0

Using PB18

irq_internal_latency_ticks_timing_PB18

rkompass commented 7 months ago

I assume BIT(i) is 1<<i. This and the loop in GPIOA_IRQHandler(void) is the "nondeterministic" part of the timestamp latency, which now might easily be adjusted. <-- I thought.

Taking the timestamp takes between 2.3 and 11 µs.

Oh, that's puzzling. A much bigger source of jitter than one would expect. How can that be? There must be other interrupts with higher priority interfering (the tick??).

Do we agree that it makes sense to sort this out? That the timestamp_us() should be there, as there is plenty of use to it, especially if it's quite precise?

rkompass commented 7 months ago

if (reg & 0xffff0000) i+= 16;
if (reg & 0xff00ff00) i+=  8;
if (reg & 0xf0f0f0f0) i+=  4;
if (reg & 0xcccccccc) i+=  2;
if (reg & 0xaaaaaaaa) i+=  1;

or, perhaps more deterministic (constant number of operations):

i += (reg & 0xaaaaaaaa) ? 1 : 0;
i += (reg & 0xcccccccc) ? 2 : 0;
i += (reg & 0xf0f0f0f0) ? 4 : 0;
i += (reg & 0xff00ff00) ? 8 : 0;
i += (reg & 0xffff0000) ? 16 : 0;

comes into mind.

but that doesn't solve

Taking the timestamp takes between 2.3 and 11 µs.

robert-hh commented 7 months ago

Oh, that's puzzling. A much bigger source of jitter than one would expect. How can that be?

ticks_us() is not just reading a register. It shares the counter with the Watchdog timer, which runs at 40MHz. Code below. The division by 40 creates the large jitter. Could be avoided the WDT could run at 1 MHz. Then taking the timestamp needs less than 600 ns.

uint64_t mp_hal_ticks_us64(void) {
    return (ticks_total + (ticks_reload_value - tls_reg_read32(HR_WDG_CUR_VALUE))) / ticks_per_us;
}

rkompass commented 7 months ago

So the division is the culprit! How about running the WDT at a power of two and use a shift instead of divide. How about 32 MHz?

robert-hh commented 7 months ago

Seems like the WDT can only run at the undivided ABP clock. No prescaler. As an alternative one of the hardware timers can be used for ticks_us(). That would reduce the number of available hard timers for machine.Timer to 4.

rkompass commented 7 months ago

There must be something fast for the division by 5. Division by 8 is a shift.

Meanwhile I changed:

void GPIOA_IRQHandler(void)
{
    u32 reg   = tls_reg_read32(HR_GPIO_MIS) & 0xffff;
    u8  i = (reg & 0xaaaa) ? 1 : 0;
    i += (reg & 0xcccc) ? 2 : 0;
    i += (reg & 0xf0f0) ? 4 : 0;
    i += (reg & 0xff00) ? 8 : 0;

    if (reg)
        if (NULL != gpio_context[i].callback)
            gpio_context[i].callback(gpio_context[i].arg);
    return;
}

void GPIOB_IRQHandler(void)
{
    u32 reg  = tls_reg_read32(HR_GPIO_MIS + TLS_IO_AB_OFFSET);
    u8  i    = WM_IO_PB_00;
    i += (reg & 0xaaaaaaaa) ?  1 : 0;
    i += (reg & 0xcccccccc) ?  2 : 0;
    i += (reg & 0xf0f0f0f0) ?  4 : 0;
    i += (reg & 0xff00ff00) ?  8 : 0;
    i += (reg & 0xffff0000) ? 16 : 0;

    if (reg)
        if (NULL != gpio_context[i].callback)
            gpio_context[i].callback(gpio_context[i].arg);
    return;
}

and will try that with the above last MP code.

rkompass commented 7 months ago

Divide by 5: (((uint32_t)A * (uint32_t)0xCCCD) >> 16) >> 2 Recipe found here.

rkompass commented 7 months ago

Now with Pin(15):

Runs: 1003 tdif[0:3]: array('I', [27, 18, 16])
Deviations (us) lowest: [15, 15, 15, 15, 15, 15, 15, 15, 15, 15]
Deviations (us) highest: [20, 20, 20, 20, 20, 21, 21, 21, 21, 21]

and with Pin(0):

Runs: 1003 tdif[0:3]: array('I', [27, 18, 16])
Deviations (us) lowest: [15, 15, 15, 15, 15, 15, 15, 15, 15, 15]
Deviations (us) highest: [20, 20, 20, 20, 20, 20, 20, 20, 20, 21]

as well.

That's better.

rkompass commented 7 months ago

// Test of divide by 5, from: 
// https://embeddedgurus.com/stack-overflow/2009/06/division-of-integers-by-constants/

#include <stdio.h>
#include <stdint.h>

int main () {

    int k = 0;
    for (int i=0; i<20000000; i++)
        if (i/5 != (uint32_t)(((uint64_t)i * 0xcccdu) >> 16) >> 2) {
            printf ("%3d: %3d %3d\n", i, i/5, (uint32_t)(((uint64_t)i * 0xcccdu) >> 16) >> 2);
            k += 1;
            if (k>10)
                break;
        }
    return 0;
}

does not work perfectly:

./divideby5 
262144: 52428 52429
262149: 52429 52430
262154: 52430 52431
262159: 52431 52432
262164: 52432 52433
262169: 52433 52434
262174: 52434 52435
262179: 52435 52436
262184: 52436 52437
262189: 52437 52438
262194: 52438 52439

must think about it....

robert-hh commented 7 months ago

Divide by 5: (((uint32_t)A * (uint32_t)0xCCCD) >> 16) >> 2

With the changed division getting ticks_us() takes between 599ns and 925ns, mean 682. Only drawback: it's a fixed number now.

rkompass commented 7 months ago

There are only two versions of CPU clock: 80Mhz and 40Mhz. Same for the WDT?

The above division formula is not correct yet! But I'll find a correct version later.

robert-hh commented 7 months ago

Back at the desk: I confirm that with the above changes to the GPIO_IRQx handler the latency does not vary any more with the port number. For the B port, I see 5.6 to 6.8 µs, with an average of 5.86. For the A port it's slightly less, The time for taking the ticks_us() value is now almost precise 1µs. So the total average latency is 6-7 µs with rare events of 10µs.

At a gross figure, the us second times look plausible. The error you see would probably result in sometimes a value skipped or advanced by 2. The APB clock does not change when the CPU freq is changed. So 40 can reliably be assumed.

irq_internal_latency_ticks_timing_PA0_opt jpg

rkompass commented 7 months ago

Nice result!

Perhaps we should use volatile in the GPIOA_IRQHandler (and same with B) so that the ? : blocks cannot be jumped over.

inline uint32_t div5(uint32_t x) {
    return (x*0xcccccccduL) >> 34;
}

gives a nice division by 5, which is correct for all 32 bit values (up to input x == 0xffffffffuL).

I tested with:

#include <stdio.h>
#include <stdint.h>

inline uint32_t div5(uint32_t x) {
    return (x*0xcccccccduL) >> 34;
}

int main () {

    int k = 0;            // 4294967296 == 2^32
    for (uint32_t i = 0; i < 4294967295; i++)
        if (i/5 != div5(i)) {
            printf ("%3u: %3u %3u\n", i, i/5, div5(i));
            k += 1;
            if (k>10)
                break;
        }
    return 0;
}

that now can be composed to give the 64 bit division.

The time for taking the ticks_us() value is now almost precise 1µs.

What did you do to achieve that? Did you already solve the division by 40 problem?

robert-hh commented 7 months ago

Did you already solve the division by 40 problem?

I took the first suggestion for the test.

return (x*0xcccccccduL) >> 34;

That may be precise, but is not acceptable. The shift at the end must not be larger 22, otherwise the time range for ticks_ms() is ĺimited. ticks_ms() is calculated by dividing ticks_us() by 1000. That way the us and ms number are synchronous.

rkompass commented 7 months ago

The larger shift will be acceptable because it shifts a 64-bit intermediate value back to 32 bit. But the division in mp_hal_ticks_us64() is a division of a 64 bit value. So there is still to consider how to extend the above way to 64 bits.

Meanwhile an almost cosmetic improvement:

void GPIOB_IRQHandler(void)
{
    u32 reg  = tls_reg_read32(HR_GPIO_MIS + TLS_IO_AB_OFFSET);
    u8 i = (reg & 0xaaaaaaaa) ?  WM_IO_PB_01 : WM_IO_PB_00;
    i += (reg & 0xcccccccc) ?  2 : 0;
    i += (reg & 0xf0f0f0f0) ?  4 : 0;
    i += (reg & 0xff00ff00) ?  8 : 0;
    i += (reg & 0xffff0000) ? 16 : 0;

    if (reg)
        if (NULL != gpio_context[i].callback)
            gpio_context[i].callback(gpio_context[i].arg);
    return;
}

robert-hh commented 7 months ago

return (x*0xcccccccduL) >> 34;

That is acceptable for ticks_us(). Since it returns 30 significant bits, which is the range of ticks_ms(). But if that is divided by 1000 for ticks_ms(), it returns only 20 bits. So either a straight division by 40_000 is used for ticks_ms(), or a similar other multiply & shift method.

The call to ticks_us() takes not ~725 ns.

b.t.w.: mp_hal_ticks_us64 is only used inside mphalport.c

Edit: The output values of ticks_us() are indeed not reasonable, when using the above formula for it.

rkompass commented 7 months ago

that now can be composed to give the 64 bit division.

This was not yet the case. The above code only demonstrates a (hopefully faster) correct division by 5 in 32 bits. The division in 64 bits has yet to be done and then there will be no loss of precision in the end.

Meanwhile it ocurred to me that the changed GPIOA/B_IRQHandler code gives a wrong result, if two interrupts are happening concurrently, which may be the case sometimes. The addition of the positions then gives a wrong i and NULL == gpio_context[i].callback so there the callback is not executed.

I have now something almost as good in mind, that will give a correct result, even in case there are simultaneous interrupts..

robert-hh commented 7 months ago

We can still use a timer channel for ticks_us(). That will return a value in almost 0 time, leaving the WDT counter for the watchdog.

rkompass commented 7 months ago

Perhaps really not a bad idea.

Meanwhile I have this:

uint32_t div5_old(uint32_t x) {
    return (x*0xcccccccduL) >> 34;
}

uint32_t div5(uint32_t t) {
    uint32_t x = t & 0xffff;    // lower 16 bits
    uint32_t y = t >> 16;
    uint32_t xc = x * 0xccccu;
    uint32_t yc = y * 0x3333u;
    uint32_t r = (xc+x) >> 16;
    r += xc + y;
    r >>= 2;
    r += yc;
    r >>= 16;
    r += yc;
    return r;
}

uint64_t div5_new(uint64_t t) {
    uint64_t x = t & 0xffffffffull;    // lower 32 bits
    uint64_t y = t >> 32;
    uint64_t xc = x * 0xccccccccull;
    uint64_t yc = y * 0x33333333ull;
    uint64_t r = (xc+x) >> 32;
    r += xc + y;
    r >>= 2;
    r += yc;
    r >>= 32;
    r += yc;
    return r;
}

The first two functions are identical, the third is an analog extension to 64 bits. Like the second avoids a 64 bit intermediate result, the third avoids a 128 bit intermediate result. The second takes ~3 times that of the first. The third therefore probably also is not optimal. But you may try it for the division by 5. So

uint64_t mp_hal_ticks_us64(void) {
    uint64_t t = (ticks_total + (ticks_reload_value - tls_reg_read32(HR_WDG_CUR_VALUE))) >> 3;  // divide by 8
    uint64_t x = t & 0xffffffffull;    // lower 32 bits
    uint64_t y = t >> 32;
    uint64_t xc = x * 0xccccccccull;
    uint64_t yc = y * 0x33333333ull;
    uint64_t r = (xc+x) >> 32;
    r += xc + y;
    r >>= 2;
    r += yc;
    r >>= 32;
    r += yc;
    return r;
}

should work, but is it faster?

Do we have uint128_t on the platform? Another idea: Brute force multiplying with 0xcccccccccccccccd (to get a possibly 128 bit intermediate) probably is not the right way. Perhaps doing a step-wise multiplication with usage of the remainder is better. No need to hurry. Can you time/benchmark the latter function as you did with the original?

rkompass commented 7 months ago

void GPIOA_IRQHandler(void)
{
    u32 reg   = tls_reg_read32(HR_GPIO_MIS) & 0xffff;
    u8  i = 0;
    if (!(reg & 0x00ff)) {
        reg >>= 8; i += 8; }
    if (!(reg & 0x0f)) {
        reg >>= 4; i += 4; }
    if (!(reg & 0x3)) {
        reg >>= 2; i += 2; }
    if (!(reg & 0x1)) {
        reg >>= 1; i += 1; }

    if (reg && NULL != gpio_context[i].callback)
        gpio_context[i].callback(gpio_context[i].arg);

    return;
}

void GPIOB_IRQHandler(void)
{
    u32 reg  = tls_reg_read32(HR_GPIO_MIS + TLS_IO_AB_OFFSET);
    u8 i = WM_IO_PB_00;
    if (!(reg & 0xffff)) {
        reg >>= 16; i += 16; }
    if (!(reg & 0x00ff)) {
        reg >>= 8; i += 8; }
    if (!(reg & 0x0f)) {
        reg >>= 4; i += 4; }
    if (!(reg & 0x3)) {
        reg >>= 2; i += 2; }
    if (!(reg & 0x1)) {
        reg >>= 1; i += 1; }

    if (reg && NULL != gpio_context[i].callback)
        gpio_context[i].callback(gpio_context[i].arg);

    return;
}

should also work with simultaneous interrupts.

rkompass commented 7 months ago

Better: Simplified above code:

void GPIOA_IRQHandler(void)
{
    u32 reg   = tls_reg_read32(HR_GPIO_MIS) & 0xffff;
    u8  i = 0;
    if (reg & 0xff00) {
        reg >>= 8; i += 8; }
    if (reg & 0x00f0) {
        reg >>= 4; i += 4; }
    if (reg & 0x000c) {
        reg >>= 2; i += 2; }
    if (reg & 0x0002) {
        reg >>= 1; i += 1; }

    if (reg && NULL != gpio_context[i].callback)
        gpio_context[i].callback(gpio_context[i].arg);

    return;
}

void GPIOB_IRQHandler(void)
{
    u32 reg  = tls_reg_read32(HR_GPIO_MIS + TLS_IO_AB_OFFSET);
    u8 i = WM_IO_PB_00;
    if (reg & 0xffff0000) {
        reg >>= 16; i += 16; }
    if (reg & 0xff00) {
        reg >>= 8; i += 8; }
    if (reg & 0x00f0) {
        reg >>= 4; i += 4; }
    if (reg & 0x000c) {
        reg >>= 2; i += 2; }
    if (reg & 0x0002) {
        reg >>= 1; i += 1; }

    if (reg && NULL != gpio_context[i].callback)
        gpio_context[i].callback(gpio_context[i].arg);

    return;
}

This and the improved mp_hal_ticks_us64() (from above 9 hrs ago) gives substantial improvement:

Now: the script yields:

Runs: 1003 tdif[0:3]: array('I', [18, 6, 6])
Deviations (us) lowest: [6, 6, 6, 6, 6, 6, 6, 6, 6, 6]
Deviations (us) highest: [7, 7, 7, 7, 7, 7, 7, 7, 7, 7]

which is substantial faster and at the same time the variation is almost gone.

For the background: This was the MP program with a timer periodically (5ms) eliciting interrupt changing a pin, e.g. LED (the pin_start callback) and that eliciting a hard pin interrupt (the pin_isr):

@micropython.viper
def pin_start(t):
    global t_start
    ledp = ptr32(0x40010C00)
    t_start = ticks_us()
    ledp[0] = 0   # all pin values of bank A = low, was: led(0)

@micropython.viper
def pin_isr(p):
    global t_stop, i_int
    # t_stop = ticks_us()
    t_stop = p.timestamp()
    led(1)
    i = int(i_int)
    if i < N_Runs:
        tdif[i] = ticks_diff(t_stop, t_start)
    i += 1
    i_int = i

I'm looking forward to see confirmations from your measurements. The question is now, can we subtract a constant amout of us from Pin.timestamp_us()? I suppose a value like 3 would be quite correct..?.

The division by 40 in mp_hal_ticks_us64() might still seem a bit awkward, but I'm quite sure it is correct. It is tested for all 32 bit values. And the analogy to the 32-bit case should hold. I will perhaps review it later, simplify / speed-up a bit and test more thoroughly.

robert-hh commented 7 months ago

should work, but is it faster?

The execution time of the third version is about 1.22µs. Much better than expected. There maybe the overhead to Pin toggling is large, which extends the measured time.

Do we have uint128_t on the platform?

I've never seen it.

rkompass commented 7 months ago

I get this with active network connection:

Runs: 1003 tdif[0:3]: array('I', [18, 6, 8])
Deviations (us) lowest: [6, 6, 6, 6, 6, 6, 6, 6, 6, 6]
Deviations (us) highest: [10, 11, 11, 11, 11, 11, 11, 11, 12, 12]

That would still allow for ultrasonic distance measurement with millimeter precision.

robert-hh commented 7 months ago

The times I get: Execution time of ticks_us(): 1.22µs Latency of echo pulse to trigger: about 5.5 µs, with rare 10µs events, never less than 4.8 µs. Tested with PA0, PB15 and PB16.

So you subtract like 5µs. Or just ignore it. If it's about taking time differences, it does not matter.

rkompass commented 7 months ago

So let's subtract 4. Then there is always a little positive latency, as expected. And someone naive can still mix the timestamp with ordinary tick_us values without disappointment.

How about other interrputs? Can the same be done for hard timer interrupt?

robert-hh commented 7 months ago

I'm running now a simple test for ticks_us() being monotonic:

import time
then = 0
while True:
    now = time.ticks_us()
    if now < then:
        print(time.time(), "roll over", then, now)
    then = now

It should roll over every 1074 seconds. So far it looks good. I'll let it run for the nex hours, while I have family business. What do you expect for the hard timer interrupt? There is no external event that can be timed. You can rely on the hard time to trigger at the set times. The variations you see is caused by IRQ handler response jitter. That will not go away, and unlike the time difference you have seen with Pin IRQ it has both positive and negative values. These should add up to 0.

robert-hh commented 7 months ago

The update code is uploaded. There is no subtract of x in the C code. That should be left to the Python code. When just looking to time differences of signals and signal slopes, it does not matter anyhow.

robert-hh commented 7 months ago

There was an irregularity in the above test of ticks_us(). Happend about 4 AM.

time()  time() diff    then     now    ticks() diff
1075                1073741817  13          20
2150    1075        1073741813  10          21
3225    1075        1073741815  11          20
4300    1075        1073741808  5           21
5375    1075        1073741804  1           21
................
24721   1074        1073741808  5           21
25796   1075        1073741804  1           21
26870   1074        1073741803  0           21
27944   1074        1073741810  7           21
28052   108         214748358   107374196   966367662
29019   967         1073741818  15          21
30093   1074        1073741816  13          21
31167   1074        1073741812  9           21
32242   1075        1073741820  17          21
33316   1074        1073741794  3           33
34390   1074        1073741812  9           21
35465   1075        1073741810  7           21
36539   1074        1073741817  14          21
37613   1074        1073741804  1           21

The two short periods add up to 1075 again. Not sure what happened. I will continue the test,

robert-hh / micropython

Compilation of w600 port III #20