robert-hh / micropython

MicroPython - a lean and efficient Python implementation for microcontrollers and constrained systems
https://micropython.org
Other
15 stars 4 forks source link

Compilation of w600 port III #20

Closed rkompass closed 7 months ago

rkompass commented 1 year ago

We continue here the discussions Compilation of w600 port and Compilation of w600 port II (closed now) of development/debugging the w60x port in MP branch w60x.

rkompass commented 7 months ago

The algorithm to divide by 5 is from the attached source.

It is a multiplication of t with 0xcccccccccccccccd and then a shift by 66 bits to the right.

The values are in table 10-2.

They are also swiftly computed with this program (also from the source):

def magicgu(nmax, d):
    nc = (nmax + 1)//d*d - 1
    nbits = len(bin(nmax)) - 2
    for p in range(0, 2*nbits + 1):
        if 2**p > nc * (d -1 -(2**p-1)%d):
            m = (2**p + d -1 -(2**p-1)%d)//d
            return (m, p)
    print("Can't find p, something is wrong.")
    sys.exit(1)

bits=32
d=5
t = magicgu(2**bits-1, d)
print('div_by:',d, 'in:', bits, 'bits:', hex(t[0]), t[1])
bits=64
d=5
t = magicgu(2**bits-1, d)
print('div_by:',d, 'in:', bits, 'bits:', hex(t[0]), t[1])

One source of overflow might be the implementation of the multipication, which is done in parts, due to the limited int width. We have no carry bits, so have to take care.

description.zip

robert-hh commented 7 months ago

Could be a glitch in RAM as well. I used your PC test code and tested it already for numbers up to 8 * 2**32. No issue. For ticks_us() the upper 32 bits could be limited. For ticks_ms() the ordinary division is used, The same test should be executed on the board to see if there is a problem with the libs.

rkompass commented 7 months ago

For the code:

Description of division by 5 through multiplication

Goal: perform (t * 0xcccccccccccccccd) >> 66

t (unsigned 64 bit) is cut into y and x (both 32 bit, but represented in 64 bits)

uint64_t x = t & 0xffffffffull; // lower 32 bits uint64_t y = t >> 32; // lower 32 bits

Then the multiplication is arranged as follows:

 ------------------------------------------------------------------
  - - - 32 - - - | - - - 32 - - - | - - - 32 - - - | - - - 32 - - -
 ------------------------------------------------------------------
                                   --------------------------------
                                  |       x  *  0xcccccccd         |
                                   --------------------------------
                  ---------------------------------                 
                 |       x  *  0xcccccccc          |
                  ---------------------------------
                  ---------------------------------
                 |       y  *  0xcccccccd          |
                  ---------------------------------
 ---------------------------------
|       y  *  0xcccccccc          |
 ---------------------------------

We reduce the 4 multiplications to 2 like this and reorder a bit:

 ------------------------------------------------------------------
  - - - 32 - - - | - - - 32 - - - | - - - 32 - - - | - - - 32 - - -
 ------------------------------------------------------------------
                                   --------------------------------
                                  |       x  *  0xcccccccc         |
                                   --------------------------------
                                                    ---------------
                                                   |       x       |
                                                    ---------------
                  ---------------------------------                 
                 |       x  *  0xcccccccc          |
                  ---------------------------------
                                   ----------------
                                  |       y        |
                                   ----------------
                  ---------------------------------
                 |       y  *  0xcccccccc          |
                  ---------------------------------
 ---------------------------------
|       y  *  0xcccccccc          |
 ---------------------------------

Computation starts with the two multiplications: uint64_t xc = x * 0xccccccccull; uint64_t yc = y * 0x33333333ull; // this is (y * 0xccccccccull) >> 2

The result is set up with uint64_t r = (xc+x) >> 32; then updated with r += xc + y; and then shifted 2 bits to the right r >>= 2; and the already right-shifted yc is added: r += yc; then r is shifted to the right by 32: r >>= 32; and the 2-bit right shifted yc is added: r += yc;

The total amount of right-shift is 66. The early right-shift by 2 should prevent an overflow of r.

I will test overflow (in case I made a mistake with the above logic) with some high values of x and y later. If that gives correct results then the algorithm should be o.k..

robert-hh commented 7 months ago

It happened again:

Abs. Time   time()  time() diff     then    now         ticks() diff
[07:27:36]  1074                1073741810  7           21
[07:45:30]  2149    1075        1073741810  7           21
[08:03:24]  3223    1074        1073741822  19          21
[08:21:17]  4298    1075        1073741821  17          20
[08:39:11]  5373    1075        1073741820  17          21
[08:57:05]  6447    1074        1073741809  6           21
[09:14:58]  7522    1075        1073741817  14          21
[09:16:46]  7630    108         214748362   107374201   966367663
[09:32:52]  8597    967         1073741820  17          21

Interestingly the now values differ only by 5 and the time diff values are identical.

rkompass commented 7 months ago

Can we see from these values what went into the division? Or: are the then or now values the result of the division (then we had only a few candiated for the dividend)?

Can you post your script?

robert-hh commented 7 months ago

The now value is the result of the division, and it adds to 7 2"30 for the seven cycles before The time 7650 is exactly 7 1/8 steps

robert-hh commented 7 months ago

There is a general timing problem in this line:

    uint64_t t = (ticks_total + (ticks_reload_value - tls_reg_read32(HR_WDG_CUR_VALUE))) >> 3;  // divide by 8

Since the register increments fast, it will happen that ticks_total and the value taken from the register do not match. This will lead to drastic error when at that time the register rolls over from it's min value to 0xffffffff. I wonder if that may be a problem for each 40-count segment.

edit: I just started the divide test on a w600. Will take a day (at least).

rkompass commented 7 months ago

That is something I was curious about too. So the ticks value t would then be about 0xffffffff too small. And about 0xffffffff//40 = 107374182 would be the deviation. Which you observed these two times quite precisely!

To proof this an older binary running your test should exhibit the same flaw.

From the above I still can only guess what kind of test you ran.

rkompass commented 7 months ago
uint64_t mp_hal_ticks_us64(void) {
    uint64_t tt = ticks_total;
    uint64_t t = (tt + (ticks_reload_value - tls_reg_read32(HR_WDG_CUR_VALUE))) >> 3;  // divide by 8
    // we do now a fast division by 5 by performing (t * 0xcccccccccccccccdull) >> 66  in 32bit parts
    uint64_t x = t & 0xffffffffull;    // lower 32 bits
    uint64_t y = t >> 32;              // higher 32 bits of t
    uint64_t xc = x * 0xccccccccull;
    uint64_t yc = y * 0x33333333ull;
    uint64_t r = (xc+x) >> 32;
    r += xc + y;
    r >>= 2;
    r += yc;
    r >>= 32;
    r += yc;
    if (ticks_total > tt)
        r +=107374182;    // assuming 40MHz tick: 0xffffffff//40 = 107374182            
    return r;
}

would be a weak remedy, that assumes that the ticks interrupt was performed during the division process. One should also query the ticks interrupt pending bit somewhere. If we can be sure that the ticks (WDT) interrupt interrupts also this calculation, then this should be fine though.

rkompass commented 7 months ago

I tested the division algorithm with very high values in x and y and it is correct (no missing carry / overflow).

uint64_t div5_new(uint64_t t) {
    uint64_t x = t & 0xffffffffull;    // lower 32 bits
    uint64_t y = t >> 32;
    uint64_t xc = x * 0xccccccccull;
    uint64_t yc = y * 0x33333333ull;
    uint64_t r = (xc+x) >> 32;
    r += xc + y;
    r >>= 2;
    r += yc;
    r >>= 32;
    r += yc;
    return r;
}

int main () {
    int k = 0;                     // 4294967296 == 2^32
    for (uint64_t x = 4294867295; x < 4294967296; x++)        // from 100_000 below to max
        for (uint64_t y = 4294867295; y < 4294967296; y++) {  // from 100_000 below to max
            uint64_t i = y<<32 | x;
            if (i/5 != div5_new(i)) {
                printf ("%3llu: %3llu %3llu\n",
(long long unsigned) i, (long long unsigned) i/5, (long long unsigned) div5_new2(i));
                k += 1;
                if (k>10)
                    break;
            }
    }
    return 0;
}
robert-hh commented 7 months ago

And about 0xffffffff//40 = 107374182 would be the deviation. Which you observed these two times quite precisely!

Then probably this is the reason. There is a common method to avoid it, and I implemented that already for the SAMD and MIMXRT port, but it furthr slows down the code. I'll try it.

rkompass commented 7 months ago

Searchin for that code I find that in the SAMD code this function is called mp_hal_ticks_us_64() here: mp_hal_ticks_us64()

rkompass commented 7 months ago

One should also query the ticks interrupt pending bit somewhere.

In SAMD you did exactly that.

rkompass commented 7 months ago

Meanwhile I explore different methods for distance measurements with HC-SR04 a bit to evaluate our new timestamp a bit.

robert-hh commented 7 months ago

In SAMD you did exactly that.

I tried that here, but it does not seem to work. Only it extends the time for ticks_us() from 1µs to 3µs. Disabling IRQ is slow, using the provided functions. The WDT allows to query the interrupt pending bit, but it does not allow to disable IRQ temporary. And with the RTOS, simply disabling all interrupts does not seem to work either. Maybe WDT is simply not affected by enable/disable, which would be reasonable.

robert-hh commented 7 months ago

I just started the divide test on a w600. Will take a day (at least).

That test looks good, showing not problems with the divide mechanism compiled for a W600.

rkompass commented 7 months ago

If we are able to distiguish between a pending/occurring WDT update interrupt and a previous one then we can use that.

Given the WDT update interrupt occurs every 107 seconds, then the second value (of hour, minute, second) of the RTC could be stored during this interrupt in a variable. Immediately after the WDT counter reload (which we recognice by the low register count) the RTC seconds freshly read by ticks_us() should either be close equal or after the stored value, indicating that the WDT update interrupt had happened already, or not, indicating that the interrupt is yet about to occurr (or occurs during the tick_us() computation). The latter may lead to error, but may be checked by reading the WDT-stored RTC second value twice, at the entry of the tick_us() computation and at it's exit. If they differ, the ticks_total value has to be read again.

Makes only sense if the RTC second value may be read fast.

rkompass commented 7 months ago

I noted: the function tls_watchdog_stop_cal_elapsed_time() is already wrongly programmed in Platform/Drivers/watchdog/wm_watchdog.c.

As for the last idea:

Makes only sense if the RTC second value may be read fast.

In Platform/Drivers/rtc/wm_rtc.c we have:

void tls_get_rtc(struct tm *tblock)
{
    int ctrl1 = 0;
    int ctrl2 = 0;

    ctrl1 = tls_reg_read32(HR_PMU_RTC_CTRL1);
    ctrl2 = tls_reg_read32(HR_PMU_RTC_CTRL2);
    tblock->tm_year = ((int)((int)ctrl2 & 0x00007f00) >> 8);
    tblock->tm_mon  = (ctrl2 & 0x0000000f);
    tblock->tm_mday = (ctrl1 & 0x1f000000) >> 24;
    tblock->tm_hour = (ctrl1 & 0x001f0000) >> 16;
    tblock->tm_min  = (ctrl1 & 0x00003f00) >>  8;
    tblock->tm_sec  =  ctrl1 & 0x0000003f;
}

So to get the seconds it would suffice to do something like a int wdt_int_rtc_secs = tls_reg_read32(HR_PMU_RTC_CTRL1) & 0x0000003f; within the WDT update interrupt into the global variable wdt_int_rtc_secs.

This is fast. Then this may be done in mp_hal_ticks_us64() too for comparison and we check the state of the WDT update interrupt like just outlined.

What do you think? I may make a code suggestion for that later.

robert-hh commented 7 months ago

What do you think? I may make a code suggestion for that later.

I don't get yet the mechanism you suggest. Meanwhile, I tried another variant of the IQRdisable/enable variant. Did not work either. More and more I'm leaning to use a timer for ticks_us(), separating timer and watchdog for simplicity and clarity. Your work on div5() is great, but the using a 40MHz timer with a 80MHz clock is hard to deal with. Even with the timer we have to deal with the synchronous update of µs-counter and overflow counter. But that's less critical. The timer interrupt should behave more reasonable, and it's more time to spend.

robert-hh commented 7 months ago

The timer cannot be used for that purpose, because the actual timer counter value cannot be read back. Too sad.

robert-hh commented 7 months ago

I'm trying something along these lines:

    uint64_t t1 = (ticks_total + (ticks_reload_value - tls_reg_read32(HR_WDG_CUR_VALUE)));
    uint64_t t2 = (ticks_total + (ticks_reload_value - tls_reg_read32(HR_WDG_CUR_VALUE)));
    uint64_t t;
    if ((t1 >> 16) != (t2 >> 16)) {
        t = (ticks_total + (ticks_reload_value - tls_reg_read32(HR_WDG_CUR_VALUE))) >> 3;  // divide by 8
    } else {
        t = t1 >> 3;
    }

The problem can only happen if during reading the value of t an overflow happens, so the value of t is read twice. If these differ by a large amount, then the timer overflow happened during one of the reads, and a new read is done which is not subject to an overflow. If they are similar, then any of them is good.

rkompass commented 7 months ago

Your logic seems correct.

The problem can only happen if during reading the value of t an overflow happens.

Agreed, up to the consideration below. The above code might be slightly corrected, because ((t1 >> 16) != (t2 >> 16)) may fail, even if t1 and t2 are very close to each other. OTOH the costs of reading a third time are not high, compared to the division!

Certainly I was thinking too complicated, but I assumed that after the HR_WDG_CUR_VALUE reload the mp_hal_ticks_us64() function still might read an not-updated ticks_total value. And that a second reading of that still might happen without the WDT interrupt having taken effect.

If that cannot happen, as the WDT interrupt is immediate, then your logic seems correct.

I did (paranoid, it seems):

uint64_t mp_hal_ticks_us64(void) {
    uint64_t tt = ticks_total;
    uint32_t rr = tls_reg_read32(HR_WDG_CUR_VALUE);
    if (rr > 0xfffdffff) {  // < 2.5 ms after WDT ticks reload, perhaps tt still not updated: check
        int abs_sec_dif = ABS((tls_reg_read32(HR_PMU_RTC_CTRL1) & 0x0000003f) - WDT_INT_RTC_SECS);
        if (abs_sec_dif == 0 || abs_sec_dif == 1 || abs_sec_dif == 59)  // update has happened
            tt = ticks_total;                   // it might have happened just now, so take new value
        else:
            tt++;     // it had not happened yet when we took the first ticks_total, so correct that value
    }
    uint64_t t = (tt + (ticks_reload_value - rr)) >> 3;  // divide by 8
    // we do now a fast division by 5 by performing (t * 0xcccccccccccccccdull) >> 66  in 32bit parts
    uint64_t x = t & 0xffffffffull;    // lower 32 bits
    uint64_t y = t >> 32;              // higher 32 bits of t
    uint64_t xc = x * 0xccccccccull;
    uint64_t yc = y * 0x33333333ull;
    uint64_t r = (xc+x) >> 32;
    r += xc + y;
    r >>= 2;
    r += yc;
    r >>= 32;
    r += yc;
    return r;
}

assuming ABS .. absolute value is there, The WDT interrupt would store its seconds value in a global variable WDT_INT_RTC_SECS (to be added to it's interrupt)

robert-hh commented 7 months ago

My logic may be correct, but the code optimizer has a different attitude. It eliminates the second call to tls_reg_read32(). And so the approach fails. Edit: Placing the register reading into a separate non.-optimized function ensures, that the register is read twice. But still the approach may fail, if the WDT interrupt is delayed.

rkompass commented 7 months ago

How about

volatile uint64_t t1 = (ticks_total + (ticks_reload_value - tls_reg_read32(HR_WDG_CUR_VALUE)));
volatile uint64_t t2 = (ticks_total + (ticks_reload_value - tls_reg_read32(HR_WDG_CUR_VALUE)));

reference here.

But still the approach may fail, if the WDT interrupt is delayed.

I found here: Processors Cycles with zero wait state memory
Cortex-M0 16
Cortex-M0+ 15
Cortex-M3 12
Cortex-M4 12

Table 1: Interrupt latency of Cortex-M processors with zero wait state memory systems

(and happend to do a Markdown table !) by:

Processors | Cycles with zero wait state memory
-- | --
Cortex-M0 | 16
Cortex-M0+ | 15
Cortex-M3 | 12
Cortex-M4 | 12

Table 1: Interrupt latency of Cortex-M processors with zero wait state memory systems

So if the priority of the WDT interrupt is high enough your approach might almost work. Putting the readings in extra function calls might help to guarantee the readings are > 12 clock cycles apart. You studied the assembly and can see, how many clock cycles are involved?

rkompass commented 7 months ago

I would just try that and then do another many-hour test. I would also test, if you send the script.

robert-hh commented 7 months ago

Using volatile does not work. And yes, I look into the generated assembly code. I have now placed the reading ls_reg_read32(HR_WDG_CUR_VALUE); into a separate non-optimized function, which creates quite a few instructions:

#pragma GCC push_options
#pragma GCC optimize ("O0")
static uint32_t ticks_read_value(void) {
    return ticks_reload_value - tls_reg_read32(HR_WDG_CUR_VALUE);
}
#pragma GCC pop_options

uint64_t mp_hal_ticks_us64(void) {
    uint32_t t1 = ticks_read_value();
    uint32_t t2 = ticks_read_value();
    uint64_t t;
    if ((t1 >> 16) != (t2 >> 16)) {
        t = (ticks_total + ticks_read_value()) >> 3;  // divide by 8
    } else {
        t = (ticks_total + t1) >> 3;
    }
    uint64_t x = t & 0xffffffffull;    // lower 32 bits
    uint64_t y = t >> 32;
    uint64_t xc = x * 0xccccccccull;
    uint64_t yc = y * 0x33333333ull;
    uint64_t r = (xc + x) >> 32;
    r += xc + y;
    r >>= 2;
    r += yc;
    r >>= 32;
    r += yc;
    return r;
}

The Python test script is still:

import time
then = 0
while True:
    now = time.ticks_us()
    if now < then:
        print(time.time(), "roll over", then, now)
    then = now

It generates a message every ~17 minutes with the intended roll-over with a small value as the last number.

rkompass commented 7 months ago

I observed the error from last nights run but then realized I hadn't compiled the new mp_hal_ticks_us64() in before.

robert-hh commented 7 months ago

No success over night. I had 4 faults. The fault rate is almost the same. Only the time for executing ticks_us89 is increased from 1µs to 2µs. So I'll stop with the attempt. How many faults did you see?

In any case there is the code caching problem. If the code for the WDT IRQ handler has to be moved in from flash, the execution can be delayed by like 300µs. I could cover that by moving that function to IRAM, like I did with the bitstream function. Technically it works, but code execution from RAM three times slower than from flash. For the WDT IRQ, that might not matter.

rkompass commented 7 months ago

Such a little nasty problem !

How many faults did you see?

Two, but that was still from the old verion of mp_hal_ticks_us64() , as I had forgotten to make the changes and recompile! :-(

In any case there is the code caching problem. If the code for the WDT IRQ handler has to be moved in from flash, the execution can be delayed by like 300µs.

Then my assumption of the worst case WDT IRQ latency was probably the right approach.

My above solution with the RTC seconds then should work.

There is an
if (abs_sec_dif == 0 || abs_sec_dif == 1 || abs_sec_dif == 59) // update has happened where I hope it correctly represents all cases where the stored seconds value WDT_INT_RTC_SECS differs at most by 1 from the just read value.

Storage of seconds into the WDT_INT_RTC_SECS global variable still would have to be worked into the WDT IRQ code, and the availability of ABS() for absolute value be checked.

robert-hh commented 7 months ago

b.t.w. In you code it must not be tt++ but tt += ticks_relaod_value.

I started a new series with the WDG IRQ in RAM.

Edit: No improvement.

rkompass commented 7 months ago

In you code it must not be tt++ but tt += ticks_relaod_value.

Yes.

robert-hh commented 7 months ago

I just made a strange observation. I wanted to test, if in my logic for comparing the differences of two samples these ever are different. I had expected that this happens rarely. But to no avail. It happens frequently, with times as short as 3 ms between two events. Still, the returned µs values look fine. This is the code for testing:

    if ((t1 >> 16) != (t2 >> 16)) {
        *(uint32_t *)0x40010c00 |= 2;
        t = get_wdg_counter_value() >> 3;  // divide by 8
        *(uint32_t *)0x40010c00 &= 0xfffd;
    } else {
        t = t1 >> 3;
    }

And this the screenshot. The yellow trace shows the events, and with the -Width statistics.

wdg_reading_compare_fail

robert-hh commented 7 months ago

Another strange effect: Just calling the function below to read a 64 bit value for t lasts usually 750ns, which is already slow, but sometimes up to 11 µs.

uint64_t get_wdg_counter_value(void) {
    return wdg_ticks_total + (uint64_t)(wdg_ticks_reload_value - *(TLS_REG *)HR_WDG_CUR_VALUE);
}
rkompass commented 7 months ago

There are more sources of interrupts, it seems.

robert-hh commented 7 months ago

Like once a second the reading loop of ticks_us() stops for about 250µs. And like every 100ms for about 150µs.

rkompass commented 7 months ago

Are you aware that if ((t1 >> 16) != (t2 >> 16)) may become true whenever t1 and t2 cross the 16 bit boundary one after the other? This condition does not mean that t1 and t2 have to be > 1<<16 ticks apart! This might happen every (1<<16) / 40_000_000 = 0.0016384, i.e. every 1.6 milliseconds. So in the Siglent screenshot with timebase 150 ms, there are about 90 chances that this happens. I count 29 such events. Many of the events should be in a grid of these 1.638 ms spacings.

rkompass commented 7 months ago

And recently the fact that we are in a FreeRTOS came to my mind. Is this preemptive multitasking? Realized by interrupts? I found this, have not studied it.

robert-hh commented 7 months ago

Are you aware that if ((t1 >> 16) != (t2 >> 16)) may become true whenever t1 and t2 cross the 16 bit boundary one after the other?

That matches later states, where the minimum time distance as like 1.9 ms. That means, that a third value is taken too often. Nevertheless, a board is running the test now since about 16 hours without any unexpected roll-over. So it looks as if the mechanism works, at least for a silent board.

eRTOS came to my mind. Is this preemptive multitasking? Realized by interrupts?

Yes.

robert-hh commented 7 months ago

Instead of if ((t1 >> 16) != (t2 >> 16)) I could write the test as e.g. if ((t2 - t1) > 400). That has still the chance of a faulty double draw, but at a much lower frequency. And codewise it's equivalent. Most of the time getting the t1 and t2 plus test takes like 1.5µs, sometimes 6.2 µs. The whole call to mp_hal_ticks_us64 takes 2µs, sometimes 6.8µs. That's with the actual reading of the WDG timer and the WDT IRQ function in RAM. Reading and comparing just the lower 32 bit of t1 and t2, the execution time can be reduces somewhat to 1.7µ and 6.5µs rarely, 10.7µs even more rare, and besides that 249µs

The 249µs events seem to be caused by code cache misses.

rkompass commented 7 months ago

Nevertheless, a board is running the test now since about 16 hours without any unexpected roll-over. So it looks as if the mechanism works, at least for a silent board.

What code do you refer to here?

robert-hh commented 7 months ago

The last one pushed to the repository. That's the one under long term test. I simplified it a little bit now, saving a few instructions. But the method is the same.

rkompass commented 7 months ago

So it's the combination of the movement if WDT IRQ to IRAM and the if ((t1 >> 16) != (t2 >> 16)) check ! Then let's hope that that runs through.

The IRAM: I would have never been able to guess that as a remedy. I thought this was only a detail with the esp8266.

robert-hh commented 7 months ago

The IRAM: I would have never been able to guess that as a remedy. I thought this was only a detail with the esp8266.

Moving critical code to RAM is quite common in MicroPython. Some MCUs have special RAM sections for it, like IRAM of ESP8266, or ITC RAM of MIMXRT, which is configured to be at least 64k big, holding code like gc or the MP runtime engine.

rkompass commented 7 months ago

I have now made some measurements with ultrasonic distance sensor HC-SR04:

The sensor is pitched against a book, standing at about 22 cm away. You have to know that the sensor has a cheap Chinese µC on it that polls for the echo signal and sometimes misses the first wave front of it. This leads to an error of about 25µS with 40kHz signal. (I measured ~22µS). So you have clusters of measured values, like e.g. in this graph (from elsewhere, where the poster did not know the source yet): hc_sr04_errors

This implies for a useful analysis of the data you have to cluster the echo-times first. Which is what I have done.

The measurement uses code adapted from the recent [discussion on Github](# code from: https://github.com/orgs/micropython/discussions/14155)

The ISR is either:

    def echo_isr(self, pin):
        t_us = ticks_us()
        if self.echo():             # Rising edge means start of echo
            self.echo_start = t_us
        else:                        # Falling edge means end of echo
            self.echo_end = t_us

or

    def echo_isr(self, pin):
        t_us = pin.timestamp()
        if self.echo():             # Rising edge means start of echo
            self.echo_start = t_us
        else:                        # Falling edge means end of echo
            self.echo_end = t_us

The results:

############## using ticks_us() ##################

cluster: 0 n: 6  mean: 1249.667  stdev: 1.21106
cluster: 1 n: 328  mean: 1274.125  stdev: 3.020761
cluster: 2 n: 160  mean: 1295.0  stdev: 2.389705
outliers: 6

cluster: 0 n: 6  mean: 1250.333  stdev: 2.160247
cluster: 1 n: 326  mean: 1274.061  stdev: 3.263206
cluster: 2 n: 168  mean: 1295.369  stdev: 2.532564
outliers: 0

cluster: 0 n: 6  mean: 1250.167  stdev: 1.834848
cluster: 1 n: 277  mean: 1274.556  stdev: 2.986925
cluster: 2 n: 191  mean: 1295.728  stdev: 2.998106
cluster: 3 n: 5  mean: 1432.6  stdev: 1.140175
cluster: 4 n: 7  mean: 1442.571  stdev: 0.9759001
outliers: 14

cluster: 0 n: 192  mean: 1274.401  stdev: 3.229539
cluster: 1 n: 285  mean: 1295.572  stdev: 2.741032
cluster: 2 n: 13  mean: 1315.692  stdev: 1.974193
outliers: 10

cluster: 0 n: 5  mean: 1251.2  stdev: 2.04939
cluster: 1 n: 329  mean: 1274.115  stdev: 3.112529
cluster: 2 n: 159  mean: 1295.208  stdev: 2.897134
outliers: 7

cluster: 0 n: 5  mean: 1247.8  stdev: 1.643168
cluster: 1 n: 373  mean: 1274.257  stdev: 2.791728
cluster: 2 n: 114  mean: 1294.842  stdev: 3.501146
cluster: 3 n: 5  mean: 1314.0  stdev: 2.645751
outliers: 3

cluster: 0 n: 6  mean: 1250.0  stdev: 1.67332
cluster: 1 n: 393  mean: 1274.135  stdev: 3.05806
cluster: 2 n: 95  mean: 1295.0  stdev: 3.319831
outliers: 6

cluster: 0 n: 7  mean: 1250.857  stdev: 2.478479
cluster: 1 n: 361  mean: 1274.08  stdev: 3.218281
cluster: 2 n: 132  mean: 1294.924  stdev: 2.977321
outliers: 0

############## using pin.timestamp_us() ##################

cluster: 0 n: 340  mean: 1281.409  stdev: 0.504182
cluster: 1 n: 157  mean: 1302.924  stdev: 0.9166652
outliers: 3

cluster: 0 n: 242  mean: 1281.401  stdev: 0.491082
cluster: 1 n: 258  mean: 1302.717  stdev: 0.4845679
outliers: 0

cluster: 0 n: 353  mean: 1281.564  stdev: 0.7127402
cluster: 1 n: 147  mean: 1302.81  stdev: 0.5146255
outliers: 0

cluster: 0 n: 290  mean: 1281.421  stdev: 0.501471
cluster: 1 n: 207  mean: 1302.831  stdev: 0.6195915
outliers: 3

cluster: 0 n: 436  mean: 1281.381  stdev: 0.5729496
cluster: 1 n: 61  mean: 1303.098  stdev: 1.179052
outliers: 3

cluster: 0 n: 445  mean: 1281.44  stdev: 0.5234835
cluster: 1 n: 52  mean: 1302.846  stdev: 0.7767355
outliers: 3

cluster: 0 n: 479  mean: 1281.39  stdev: 0.4883492
cluster: 1 n: 21  mean: 1303.333  stdev: 1.197219
outliers: 0

cluster: 0 n: 450  mean: 1281.504  stdev: 0.5553782
cluster: 1 n: 49  mean: 1303.122  stdev: 1.092258
outliers: 1

Much reduced reduced variance: Std deviation of the main cluster goes down from ~3µS to ~0.5µS. But also a systematic change of values is observed. This may be due to IRAM latency of the first (echo onset) time_us() call in the original ISR???

Also there are more clusters in the original ISR version, which now clearly appear to be artifacts. So: Big success so far.

With a larger number of measurements (which recognizes clusters) now a precision of distance to flat surfaces of < 0.2mm should be possible.

Of course I will have to repeat the measurements with the latest PRs.

robert-hh commented 7 months ago

That looks good. So it's about 3 µs between the two clusters. When I take not the time it takes for getting ticks_us(), I have the majority of samples at about 2µs, and a rare set at 6.8 µs. The test sampled ticks_us() every 28 µs. The period of the 6.8µs was about 100ms. The 11µs event occurred like one. The oscilloscope was set to infinite persistence, so even single event stay visible. There was once/sec a 249µs duration, which is not shown. So it might be that with the other software you get a second cluster at about 5 µs distance,

call_ticks_us64

Note: ticks test update: After 27 hours still no fault. I stopped it and loaded the slightly simplified version.

rkompass commented 7 months ago

So it's about 3 µs between the two clusters.

I'm afraid we refer to different things. The clusters of echo pin durations (aka ultrasonic wavefront latencies) of the HC-SR04 were 21µs apart. Which is a HC-SR04 internal effect of its using about 48kHz for the ultrasonic signal.

But the standard deviations within a single cluster went down from 3 µs to 0.5 µs when using pin.timestamp() instead of ticks_us()n n in the interruptp hander. The measurements were then like 1281, 1281, 1282, 1281, 1281, 1282, 1282, .. The pin.timestamp() was used twice for the measurement, so the variance of it's timing was doubled (variances of random events add up). Standard deviation therefore (as it is the square root of the variance) is a factor 1.4 (sqrt(2)) higher. On the reverse calculation one may say, that the standard deviation of a single pin.timestamp() latency must have been below 0.5µs/sqrt(2) = 0.35 µs on average. Which is very good. But outliers were filtered out before for that, of course, in the clustering post-processing.

To be honest, I don't understand your above statements. What I suppose you did: You elicited an interrupt every 28 µs, then used testing code like

*(uint32_t *)0x40010c00 |= 2;
t = get_wdg_counter_value() >> 3;  // divide by 8
*(uint32_t *)0x40010c00 &= 0xfffd;

in the C interrupt handler and observed a latency of 1.9 µs? Is that correct?

The test sampled ticks_us() every 28 µs. The period of the 6.8µs was about 100ms. The 11µs event occurred like one.

I don't get it, but would like to :-) Can you elaborate a bit on it?

robert-hh commented 7 months ago

Way simpler: the test program I use for checking the call to tick_us() returning proper increasing numbers calls time.ticks_us() about every 28µs. In mp_hal_ticks_us() I toggle a Pin on and off to tell the duration of mp_hal_ticks_us64(). That's what the oscilloscope screen shot shows. Most of the time (99.9%) the call takes ~2µs, but like every 100ms it takes about 6.8µs, and like once a second about 250µs. So when using timestamp(), these variations in the time to call ticks_us() could as well show up in your results.

Note: In the new firmware, timestamp() is now a method of the irq object. irq.timestamp(). For uart.irq() it returns None.

Note 2: Of the ~2 µs for mp_hal_ticks_us64, reading the 64 bit counter takes ~840ns each. So calculating the division by 40 takes about 300ns. That's great! It consists of only 18 assembly instructions.

rkompass commented 7 months ago

Thank you. Got it now.

So when using timestamp(), these variations in the time to call ticks_us() could as well show up in your results.

Probably. Were filtered out as outliers. I can print the outliers too in the next test runs.

I just studied the commit. And saw you made some more changes.

It's still called timestamp_us() isn't it? (I used timestamp() because I hadn't updated the firmware yet).

How do I call it? I suppose with p.timestamp_us() in the ISR? When the ISR is defined as def isr(p):. Is that correct?

That's clever, as I also had a bad feeling about it being a general pin method. That way another routine could consume the timestamp wrongly. Which is now not possible anymore.

I think the timestamp is a generally very valuable addition. How to proceed? Trying timestamp with esp32 and rp2 here and testing with the HC-SR04 (which I would do)? I expect the same improvements. Then discussing it and after getting feedback doing general PRs? So much work for you, but big part of it is done already. Although I wouldn't be surprised If there would occurr the same issues with tick_us() on these platforms (at least the esp ones).

rkompass commented 7 months ago

Note 2: Of the ~2 µs for mp_hal_ticks_us64, reading the 64 bit counter takes ~840ns each. So calculating the division by 40 takes about 300ns. That's great! It consists of only 18 assembly instructions.

I take that as a mention of honour from a great programmer!

I noted that all the firmware is optimized with -Os throughout. Wonder if gcc would optimize the division likewise if -O2 was selected for these functions.

robert-hh commented 7 months ago

How do I call it? I suppose with p.timestamp_us() in the ISR? When the ISR is defined as def isr(p):. Is that correct?

in def isr(p): p is the pin object. you get the timestamp with p.irq().timestamp_us(). Just calling p.irq() returns the irq object of the pin.

When moving the function get_wdg_counter_value() back from RAM to flash, the time for calling it is reduced to about 640ns, and the total time for calling mp_hal_ticks_us64 drops from ~2µs to about 1.6µs. I'll let that run for a while. If that is stable as well, I keep it.