Closed rkompass closed 7 months ago
The algorithm to divide by 5 is from the attached source.
It is a multiplication of t with 0xcccccccccccccccd and then a shift by 66 bits to the right.
The values are in table 10-2.
They are also swiftly computed with this program (also from the source):
def magicgu(nmax, d):
nc = (nmax + 1)//d*d - 1
nbits = len(bin(nmax)) - 2
for p in range(0, 2*nbits + 1):
if 2**p > nc * (d -1 -(2**p-1)%d):
m = (2**p + d -1 -(2**p-1)%d)//d
return (m, p)
print("Can't find p, something is wrong.")
sys.exit(1)
bits=32
d=5
t = magicgu(2**bits-1, d)
print('div_by:',d, 'in:', bits, 'bits:', hex(t[0]), t[1])
bits=64
d=5
t = magicgu(2**bits-1, d)
print('div_by:',d, 'in:', bits, 'bits:', hex(t[0]), t[1])
One source of overflow might be the implementation of the multipication, which is done in parts, due to the limited int width. We have no carry bits, so have to take care.
Could be a glitch in RAM as well. I used your PC test code and tested it already for numbers up to 8 * 2**32. No issue. For ticks_us() the upper 32 bits could be limited. For ticks_ms() the ordinary division is used, The same test should be executed on the board to see if there is a problem with the libs.
For the code:
Goal: perform (t * 0xcccccccccccccccd) >> 66
t (unsigned 64 bit) is cut into y and x (both 32 bit, but represented in 64 bits)
uint64_t x = t & 0xffffffffull; // lower 32 bits
uint64_t y = t >> 32; // lower 32 bits
Then the multiplication is arranged as follows:
------------------------------------------------------------------
- - - 32 - - - | - - - 32 - - - | - - - 32 - - - | - - - 32 - - -
------------------------------------------------------------------
--------------------------------
| x * 0xcccccccd |
--------------------------------
---------------------------------
| x * 0xcccccccc |
---------------------------------
---------------------------------
| y * 0xcccccccd |
---------------------------------
---------------------------------
| y * 0xcccccccc |
---------------------------------
We reduce the 4 multiplications to 2 like this and reorder a bit:
------------------------------------------------------------------
- - - 32 - - - | - - - 32 - - - | - - - 32 - - - | - - - 32 - - -
------------------------------------------------------------------
--------------------------------
| x * 0xcccccccc |
--------------------------------
---------------
| x |
---------------
---------------------------------
| x * 0xcccccccc |
---------------------------------
----------------
| y |
----------------
---------------------------------
| y * 0xcccccccc |
---------------------------------
---------------------------------
| y * 0xcccccccc |
---------------------------------
Computation starts with the two multiplications:
uint64_t xc = x * 0xccccccccull;
uint64_t yc = y * 0x33333333ull; // this is (y * 0xccccccccull) >> 2
The result is set up with uint64_t r = (xc+x) >> 32;
then updated with r += xc + y;
and then shifted 2 bits to the right r >>= 2;
and the already right-shifted yc is added: r += yc;
then r is shifted to the right by 32: r >>= 32;
and the 2-bit right shifted yc is added: r += yc;
The total amount of right-shift is 66. The early right-shift by 2 should prevent an overflow of r.
I will test overflow (in case I made a mistake with the above logic) with some high values of x and y later. If that gives correct results then the algorithm should be o.k..
It happened again:
Abs. Time time() time() diff then now ticks() diff
[07:27:36] 1074 1073741810 7 21
[07:45:30] 2149 1075 1073741810 7 21
[08:03:24] 3223 1074 1073741822 19 21
[08:21:17] 4298 1075 1073741821 17 20
[08:39:11] 5373 1075 1073741820 17 21
[08:57:05] 6447 1074 1073741809 6 21
[09:14:58] 7522 1075 1073741817 14 21
[09:16:46] 7630 108 214748362 107374201 966367663
[09:32:52] 8597 967 1073741820 17 21
Interestingly the now
values differ only by 5 and the time diff
values are identical.
Can we see from these values what went into the division? Or: are the then or now values the result of the division (then we had only a few candiated for the dividend)?
Can you post your script?
The now value is the result of the division, and it adds to 7 2"30 for the seven cycles before The time 7650 is exactly 7 1/8 steps
There is a general timing problem in this line:
uint64_t t = (ticks_total + (ticks_reload_value - tls_reg_read32(HR_WDG_CUR_VALUE))) >> 3; // divide by 8
Since the register increments fast, it will happen that ticks_total
and the value taken from the register do not match. This will lead to drastic error when at that time the register rolls over from it's min value to 0xffffffff. I wonder if that may be a problem for each 40-count segment.
edit: I just started the divide test on a w600. Will take a day (at least).
That is something I was curious about too. So the ticks value t would then be about 0xffffffff too small. And about 0xffffffff//40 = 107374182 would be the deviation. Which you observed these two times quite precisely!
To proof this an older binary running your test should exhibit the same flaw.
From the above I still can only guess what kind of test you ran.
uint64_t mp_hal_ticks_us64(void) {
uint64_t tt = ticks_total;
uint64_t t = (tt + (ticks_reload_value - tls_reg_read32(HR_WDG_CUR_VALUE))) >> 3; // divide by 8
// we do now a fast division by 5 by performing (t * 0xcccccccccccccccdull) >> 66 in 32bit parts
uint64_t x = t & 0xffffffffull; // lower 32 bits
uint64_t y = t >> 32; // higher 32 bits of t
uint64_t xc = x * 0xccccccccull;
uint64_t yc = y * 0x33333333ull;
uint64_t r = (xc+x) >> 32;
r += xc + y;
r >>= 2;
r += yc;
r >>= 32;
r += yc;
if (ticks_total > tt)
r +=107374182; // assuming 40MHz tick: 0xffffffff//40 = 107374182
return r;
}
would be a weak remedy, that assumes that the ticks interrupt was performed during the division process. One should also query the ticks interrupt pending bit somewhere. If we can be sure that the ticks (WDT) interrupt interrupts also this calculation, then this should be fine though.
I tested the division algorithm with very high values in x and y and it is correct (no missing carry / overflow).
uint64_t div5_new(uint64_t t) {
uint64_t x = t & 0xffffffffull; // lower 32 bits
uint64_t y = t >> 32;
uint64_t xc = x * 0xccccccccull;
uint64_t yc = y * 0x33333333ull;
uint64_t r = (xc+x) >> 32;
r += xc + y;
r >>= 2;
r += yc;
r >>= 32;
r += yc;
return r;
}
int main () {
int k = 0; // 4294967296 == 2^32
for (uint64_t x = 4294867295; x < 4294967296; x++) // from 100_000 below to max
for (uint64_t y = 4294867295; y < 4294967296; y++) { // from 100_000 below to max
uint64_t i = y<<32 | x;
if (i/5 != div5_new(i)) {
printf ("%3llu: %3llu %3llu\n",
(long long unsigned) i, (long long unsigned) i/5, (long long unsigned) div5_new2(i));
k += 1;
if (k>10)
break;
}
}
return 0;
}
And about 0xffffffff//40 = 107374182 would be the deviation. Which you observed these two times quite precisely!
Then probably this is the reason. There is a common method to avoid it, and I implemented that already for the SAMD and MIMXRT port, but it furthr slows down the code. I'll try it.
Searchin for that code I find that in the SAMD code this function is called
mp_hal_ticks_us_64()
here:
mp_hal_ticks_us64()
One should also query the ticks interrupt pending bit somewhere.
In SAMD you did exactly that.
Meanwhile I explore different methods for distance measurements with HC-SR04 a bit to evaluate our new timestamp a bit.
In SAMD you did exactly that.
I tried that here, but it does not seem to work. Only it extends the time for ticks_us() from 1µs to 3µs. Disabling IRQ is slow, using the provided functions. The WDT allows to query the interrupt pending bit, but it does not allow to disable IRQ temporary. And with the RTOS, simply disabling all interrupts does not seem to work either. Maybe WDT is simply not affected by enable/disable, which would be reasonable.
I just started the divide test on a w600. Will take a day (at least).
That test looks good, showing not problems with the divide mechanism compiled for a W600.
If we are able to distiguish between a pending/occurring WDT update interrupt and a previous one then we can use that.
Given the WDT update interrupt occurs every 107 seconds, then the second value (of hour, minute, second) of the RTC could be stored during this interrupt in a variable. Immediately after the WDT counter reload (which we recognice by the low register count) the RTC seconds freshly read by ticks_us()
should either be close equal or after the stored value, indicating that the WDT update interrupt had happened already, or not, indicating that the interrupt is yet about to occurr (or occurs during the tick_us()
computation). The latter may lead to error, but may be checked by reading the WDT-stored RTC second value twice, at the entry of the tick_us()
computation and at it's exit. If they differ, the ticks_total value has to be read again.
Makes only sense if the RTC second value may be read fast.
I noted: the function tls_watchdog_stop_cal_elapsed_time()
is already wrongly programmed in Platform/Drivers/watchdog/wm_watchdog.c
.
As for the last idea:
Makes only sense if the RTC second value may be read fast.
In Platform/Drivers/rtc/wm_rtc.c
we have:
void tls_get_rtc(struct tm *tblock)
{
int ctrl1 = 0;
int ctrl2 = 0;
ctrl1 = tls_reg_read32(HR_PMU_RTC_CTRL1);
ctrl2 = tls_reg_read32(HR_PMU_RTC_CTRL2);
tblock->tm_year = ((int)((int)ctrl2 & 0x00007f00) >> 8);
tblock->tm_mon = (ctrl2 & 0x0000000f);
tblock->tm_mday = (ctrl1 & 0x1f000000) >> 24;
tblock->tm_hour = (ctrl1 & 0x001f0000) >> 16;
tblock->tm_min = (ctrl1 & 0x00003f00) >> 8;
tblock->tm_sec = ctrl1 & 0x0000003f;
}
So to get the seconds it would suffice to do something like a
int wdt_int_rtc_secs = tls_reg_read32(HR_PMU_RTC_CTRL1) & 0x0000003f;
within the
WDT update interrupt into the global variable wdt_int_rtc_secs
.
This is fast. Then this may be done in mp_hal_ticks_us64() too for comparison and we check the state of the WDT update interrupt like just outlined.
What do you think? I may make a code suggestion for that later.
What do you think? I may make a code suggestion for that later.
I don't get yet the mechanism you suggest. Meanwhile, I tried another variant of the IQRdisable/enable variant. Did not work either. More and more I'm leaning to use a timer for ticks_us(), separating timer and watchdog for simplicity and clarity. Your work on div5() is great, but the using a 40MHz timer with a 80MHz clock is hard to deal with. Even with the timer we have to deal with the synchronous update of µs-counter and overflow counter. But that's less critical. The timer interrupt should behave more reasonable, and it's more time to spend.
The timer cannot be used for that purpose, because the actual timer counter value cannot be read back. Too sad.
I'm trying something along these lines:
uint64_t t1 = (ticks_total + (ticks_reload_value - tls_reg_read32(HR_WDG_CUR_VALUE)));
uint64_t t2 = (ticks_total + (ticks_reload_value - tls_reg_read32(HR_WDG_CUR_VALUE)));
uint64_t t;
if ((t1 >> 16) != (t2 >> 16)) {
t = (ticks_total + (ticks_reload_value - tls_reg_read32(HR_WDG_CUR_VALUE))) >> 3; // divide by 8
} else {
t = t1 >> 3;
}
The problem can only happen if during reading the value of t an overflow happens, so the value of t is read twice. If these differ by a large amount, then the timer overflow happened during one of the reads, and a new read is done which is not subject to an overflow. If they are similar, then any of them is good.
Your logic seems correct.
The problem can only happen if during reading the value of t an overflow happens.
Agreed, up to the consideration below. The above code might be slightly corrected, because ((t1 >> 16) != (t2 >> 16)) may fail, even if t1 and t2 are very close to each other. OTOH the costs of reading a third time are not high, compared to the division!
Certainly I was thinking too complicated, but I assumed that after the HR_WDG_CUR_VALUE reload the mp_hal_ticks_us64() function still might read an not-updated ticks_total value. And that a second reading of that still might happen without the WDT interrupt having taken effect.
If that cannot happen, as the WDT interrupt is immediate, then your logic seems correct.
I did (paranoid, it seems):
uint64_t mp_hal_ticks_us64(void) {
uint64_t tt = ticks_total;
uint32_t rr = tls_reg_read32(HR_WDG_CUR_VALUE);
if (rr > 0xfffdffff) { // < 2.5 ms after WDT ticks reload, perhaps tt still not updated: check
int abs_sec_dif = ABS((tls_reg_read32(HR_PMU_RTC_CTRL1) & 0x0000003f) - WDT_INT_RTC_SECS);
if (abs_sec_dif == 0 || abs_sec_dif == 1 || abs_sec_dif == 59) // update has happened
tt = ticks_total; // it might have happened just now, so take new value
else:
tt++; // it had not happened yet when we took the first ticks_total, so correct that value
}
uint64_t t = (tt + (ticks_reload_value - rr)) >> 3; // divide by 8
// we do now a fast division by 5 by performing (t * 0xcccccccccccccccdull) >> 66 in 32bit parts
uint64_t x = t & 0xffffffffull; // lower 32 bits
uint64_t y = t >> 32; // higher 32 bits of t
uint64_t xc = x * 0xccccccccull;
uint64_t yc = y * 0x33333333ull;
uint64_t r = (xc+x) >> 32;
r += xc + y;
r >>= 2;
r += yc;
r >>= 32;
r += yc;
return r;
}
assuming ABS .. absolute value is there, The WDT interrupt would store its seconds value in a global variable
WDT_INT_RTC_SECS
(to be added to it's interrupt)
My logic may be correct, but the code optimizer has a different attitude. It eliminates the second call to tls_reg_read32(). And so the approach fails. Edit: Placing the register reading into a separate non.-optimized function ensures, that the register is read twice. But still the approach may fail, if the WDT interrupt is delayed.
How about
volatile uint64_t t1 = (ticks_total + (ticks_reload_value - tls_reg_read32(HR_WDG_CUR_VALUE)));
volatile uint64_t t2 = (ticks_total + (ticks_reload_value - tls_reg_read32(HR_WDG_CUR_VALUE)));
reference here.
But still the approach may fail, if the WDT interrupt is delayed.
I found here: Processors | Cycles with zero wait state memory |
---|---|
Cortex-M0 | 16 |
Cortex-M0+ | 15 |
Cortex-M3 | 12 |
Cortex-M4 | 12 |
Table 1: Interrupt latency of Cortex-M processors with zero wait state memory systems
(and happend to do a Markdown table !) by:
Processors | Cycles with zero wait state memory
-- | --
Cortex-M0 | 16
Cortex-M0+ | 15
Cortex-M3 | 12
Cortex-M4 | 12
Table 1: Interrupt latency of Cortex-M processors with zero wait state memory systems
So if the priority of the WDT interrupt is high enough your approach might almost work. Putting the readings in extra function calls might help to guarantee the readings are > 12 clock cycles apart. You studied the assembly and can see, how many clock cycles are involved?
I would just try that and then do another many-hour test. I would also test, if you send the script.
Using volatile
does not work. And yes, I look into the generated assembly code. I have now placed the reading ls_reg_read32(HR_WDG_CUR_VALUE);
into a separate non-optimized function, which creates quite a few instructions:
#pragma GCC push_options
#pragma GCC optimize ("O0")
static uint32_t ticks_read_value(void) {
return ticks_reload_value - tls_reg_read32(HR_WDG_CUR_VALUE);
}
#pragma GCC pop_options
uint64_t mp_hal_ticks_us64(void) {
uint32_t t1 = ticks_read_value();
uint32_t t2 = ticks_read_value();
uint64_t t;
if ((t1 >> 16) != (t2 >> 16)) {
t = (ticks_total + ticks_read_value()) >> 3; // divide by 8
} else {
t = (ticks_total + t1) >> 3;
}
uint64_t x = t & 0xffffffffull; // lower 32 bits
uint64_t y = t >> 32;
uint64_t xc = x * 0xccccccccull;
uint64_t yc = y * 0x33333333ull;
uint64_t r = (xc + x) >> 32;
r += xc + y;
r >>= 2;
r += yc;
r >>= 32;
r += yc;
return r;
}
The Python test script is still:
import time
then = 0
while True:
now = time.ticks_us()
if now < then:
print(time.time(), "roll over", then, now)
then = now
It generates a message every ~17 minutes with the intended roll-over with a small value as the last number.
I observed the error from last nights run but then realized I hadn't compiled the new mp_hal_ticks_us64()
in before.
No success over night. I had 4 faults. The fault rate is almost the same. Only the time for executing ticks_us89 is increased from 1µs to 2µs. So I'll stop with the attempt. How many faults did you see?
In any case there is the code caching problem. If the code for the WDT IRQ handler has to be moved in from flash, the execution can be delayed by like 300µs. I could cover that by moving that function to IRAM, like I did with the bitstream function. Technically it works, but code execution from RAM three times slower than from flash. For the WDT IRQ, that might not matter.
Such a little nasty problem !
How many faults did you see?
Two, but that was still from the old verion of mp_hal_ticks_us64()
, as I had forgotten to make the changes and recompile! :-(
In any case there is the code caching problem. If the code for the WDT IRQ handler has to be moved in from flash, the execution can be delayed by like 300µs.
Then my assumption of the worst case WDT IRQ latency was probably the right approach.
My above solution with the RTC seconds then should work.
There is an
if (abs_sec_dif == 0 || abs_sec_dif == 1 || abs_sec_dif == 59) // update has happened
where I hope it correctly represents all cases where the stored seconds value WDT_INT_RTC_SECS differs at most by 1 from the just read value.
Storage of seconds into the WDT_INT_RTC_SECS global variable still would have to be worked into the WDT IRQ code, and the availability of ABS() for absolute value be checked.
b.t.w. In you code it must not be tt++
but tt += ticks_relaod_value
.
I started a new series with the WDG IRQ in RAM.
Edit: No improvement.
In you code it must not be tt++ but tt += ticks_relaod_value.
Yes.
I just made a strange observation. I wanted to test, if in my logic for comparing the differences of two samples these ever are different. I had expected that this happens rarely. But to no avail. It happens frequently, with times as short as 3 ms between two events. Still, the returned µs values look fine. This is the code for testing:
if ((t1 >> 16) != (t2 >> 16)) {
*(uint32_t *)0x40010c00 |= 2;
t = get_wdg_counter_value() >> 3; // divide by 8
*(uint32_t *)0x40010c00 &= 0xfffd;
} else {
t = t1 >> 3;
}
And this the screenshot. The yellow trace shows the events, and with the -Width statistics.
Another strange effect: Just calling the function below to read a 64 bit value for t
lasts usually 750ns, which is already slow, but sometimes up to 11 µs.
uint64_t get_wdg_counter_value(void) {
return wdg_ticks_total + (uint64_t)(wdg_ticks_reload_value - *(TLS_REG *)HR_WDG_CUR_VALUE);
}
There are more sources of interrupts, it seems.
Like once a second the reading loop of ticks_us() stops for about 250µs. And like every 100ms for about 150µs.
Are you aware that if ((t1 >> 16) != (t2 >> 16))
may become true whenever t1 and t2 cross the 16 bit boundary one after the other? This condition does not mean that t1 and t2 have to be > 1<<16 ticks apart!
This might happen every (1<<16) / 40_000_000 = 0.0016384, i.e. every 1.6 milliseconds.
So in the Siglent screenshot with timebase 150 ms, there are about 90 chances that this happens.
I count 29 such events. Many of the events should be in a grid of these 1.638 ms spacings.
And recently the fact that we are in a FreeRTOS came to my mind. Is this preemptive multitasking? Realized by interrupts? I found this, have not studied it.
Are you aware that if ((t1 >> 16) != (t2 >> 16)) may become true whenever t1 and t2 cross the 16 bit boundary one after the other?
That matches later states, where the minimum time distance as like 1.9 ms. That means, that a third value is taken too often. Nevertheless, a board is running the test now since about 16 hours without any unexpected roll-over. So it looks as if the mechanism works, at least for a silent board.
eRTOS came to my mind. Is this preemptive multitasking? Realized by interrupts?
Yes.
Instead of if ((t1 >> 16) != (t2 >> 16))
I could write the test as e.g. if ((t2 - t1) > 400)
. That has still the chance of a faulty double draw, but at a much lower frequency. And codewise it's equivalent. Most of the time getting the t1 and t2 plus test takes like 1.5µs, sometimes 6.2 µs. The whole call to mp_hal_ticks_us64 takes 2µs, sometimes 6.8µs.
That's with the actual reading of the WDG timer and the WDT IRQ function in RAM.
Reading and comparing just the lower 32 bit of t1 and t2, the execution time can be reduces somewhat to 1.7µ and 6.5µs rarely, 10.7µs even more rare, and besides that 249µs
The 249µs events seem to be caused by code cache misses.
Nevertheless, a board is running the test now since about 16 hours without any unexpected roll-over. So it looks as if the mechanism works, at least for a silent board.
What code do you refer to here?
The last one pushed to the repository. That's the one under long term test. I simplified it a little bit now, saving a few instructions. But the method is the same.
So it's the combination of the movement if WDT IRQ to IRAM and the if ((t1 >> 16) != (t2 >> 16)) check ! Then let's hope that that runs through.
The IRAM: I would have never been able to guess that as a remedy. I thought this was only a detail with the esp8266.
The IRAM: I would have never been able to guess that as a remedy. I thought this was only a detail with the esp8266.
Moving critical code to RAM is quite common in MicroPython. Some MCUs have special RAM sections for it, like IRAM of ESP8266, or ITC RAM of MIMXRT, which is configured to be at least 64k big, holding code like gc or the MP runtime engine.
I have now made some measurements with ultrasonic distance sensor HC-SR04:
The sensor is pitched against a book, standing at about 22 cm away. You have to know that the sensor has a cheap Chinese µC on it that polls for the echo signal and sometimes misses the first wave front of it. This leads to an error of about 25µS with 40kHz signal. (I measured ~22µS). So you have clusters of measured values, like e.g. in this graph (from elsewhere, where the poster did not know the source yet):
This implies for a useful analysis of the data you have to cluster the echo-times first. Which is what I have done.
The measurement uses code adapted from the recent [discussion on Github](# code from: https://github.com/orgs/micropython/discussions/14155)
The ISR is either:
def echo_isr(self, pin):
t_us = ticks_us()
if self.echo(): # Rising edge means start of echo
self.echo_start = t_us
else: # Falling edge means end of echo
self.echo_end = t_us
or
def echo_isr(self, pin):
t_us = pin.timestamp()
if self.echo(): # Rising edge means start of echo
self.echo_start = t_us
else: # Falling edge means end of echo
self.echo_end = t_us
The results:
############## using ticks_us() ##################
cluster: 0 n: 6 mean: 1249.667 stdev: 1.21106
cluster: 1 n: 328 mean: 1274.125 stdev: 3.020761
cluster: 2 n: 160 mean: 1295.0 stdev: 2.389705
outliers: 6
cluster: 0 n: 6 mean: 1250.333 stdev: 2.160247
cluster: 1 n: 326 mean: 1274.061 stdev: 3.263206
cluster: 2 n: 168 mean: 1295.369 stdev: 2.532564
outliers: 0
cluster: 0 n: 6 mean: 1250.167 stdev: 1.834848
cluster: 1 n: 277 mean: 1274.556 stdev: 2.986925
cluster: 2 n: 191 mean: 1295.728 stdev: 2.998106
cluster: 3 n: 5 mean: 1432.6 stdev: 1.140175
cluster: 4 n: 7 mean: 1442.571 stdev: 0.9759001
outliers: 14
cluster: 0 n: 192 mean: 1274.401 stdev: 3.229539
cluster: 1 n: 285 mean: 1295.572 stdev: 2.741032
cluster: 2 n: 13 mean: 1315.692 stdev: 1.974193
outliers: 10
cluster: 0 n: 5 mean: 1251.2 stdev: 2.04939
cluster: 1 n: 329 mean: 1274.115 stdev: 3.112529
cluster: 2 n: 159 mean: 1295.208 stdev: 2.897134
outliers: 7
cluster: 0 n: 5 mean: 1247.8 stdev: 1.643168
cluster: 1 n: 373 mean: 1274.257 stdev: 2.791728
cluster: 2 n: 114 mean: 1294.842 stdev: 3.501146
cluster: 3 n: 5 mean: 1314.0 stdev: 2.645751
outliers: 3
cluster: 0 n: 6 mean: 1250.0 stdev: 1.67332
cluster: 1 n: 393 mean: 1274.135 stdev: 3.05806
cluster: 2 n: 95 mean: 1295.0 stdev: 3.319831
outliers: 6
cluster: 0 n: 7 mean: 1250.857 stdev: 2.478479
cluster: 1 n: 361 mean: 1274.08 stdev: 3.218281
cluster: 2 n: 132 mean: 1294.924 stdev: 2.977321
outliers: 0
############## using pin.timestamp_us() ##################
cluster: 0 n: 340 mean: 1281.409 stdev: 0.504182
cluster: 1 n: 157 mean: 1302.924 stdev: 0.9166652
outliers: 3
cluster: 0 n: 242 mean: 1281.401 stdev: 0.491082
cluster: 1 n: 258 mean: 1302.717 stdev: 0.4845679
outliers: 0
cluster: 0 n: 353 mean: 1281.564 stdev: 0.7127402
cluster: 1 n: 147 mean: 1302.81 stdev: 0.5146255
outliers: 0
cluster: 0 n: 290 mean: 1281.421 stdev: 0.501471
cluster: 1 n: 207 mean: 1302.831 stdev: 0.6195915
outliers: 3
cluster: 0 n: 436 mean: 1281.381 stdev: 0.5729496
cluster: 1 n: 61 mean: 1303.098 stdev: 1.179052
outliers: 3
cluster: 0 n: 445 mean: 1281.44 stdev: 0.5234835
cluster: 1 n: 52 mean: 1302.846 stdev: 0.7767355
outliers: 3
cluster: 0 n: 479 mean: 1281.39 stdev: 0.4883492
cluster: 1 n: 21 mean: 1303.333 stdev: 1.197219
outliers: 0
cluster: 0 n: 450 mean: 1281.504 stdev: 0.5553782
cluster: 1 n: 49 mean: 1303.122 stdev: 1.092258
outliers: 1
Much reduced reduced variance: Std deviation of the main cluster goes down from ~3µS to ~0.5µS. But also a systematic change of values is observed. This may be due to IRAM latency of the first (echo onset) time_us() call in the original ISR???
Also there are more clusters in the original ISR version, which now clearly appear to be artifacts. So: Big success so far.
With a larger number of measurements (which recognizes clusters) now a precision of distance to flat surfaces of < 0.2mm should be possible.
Of course I will have to repeat the measurements with the latest PRs.
That looks good. So it's about 3 µs between the two clusters. When I take not the time it takes for getting ticks_us(), I have the majority of samples at about 2µs, and a rare set at 6.8 µs. The test sampled ticks_us() every 28 µs. The period of the 6.8µs was about 100ms. The 11µs event occurred like one. The oscilloscope was set to infinite persistence, so even single event stay visible. There was once/sec a 249µs duration, which is not shown. So it might be that with the other software you get a second cluster at about 5 µs distance,
Note: ticks test update: After 27 hours still no fault. I stopped it and loaded the slightly simplified version.
So it's about 3 µs between the two clusters.
I'm afraid we refer to different things. The clusters of echo pin durations (aka ultrasonic wavefront latencies) of the HC-SR04 were 21µs apart. Which is a HC-SR04 internal effect of its using about 48kHz for the ultrasonic signal.
But the standard deviations within a single cluster went down from 3 µs to 0.5 µs when using pin.timestamp() instead of ticks_us()n n in the interruptp hander. The measurements were then like 1281, 1281, 1282, 1281, 1281, 1282, 1282, .. The pin.timestamp() was used twice for the measurement, so the variance of it's timing was doubled (variances of random events add up). Standard deviation therefore (as it is the square root of the variance) is a factor 1.4 (sqrt(2)) higher. On the reverse calculation one may say, that the standard deviation of a single pin.timestamp() latency must have been below 0.5µs/sqrt(2) = 0.35 µs on average. Which is very good. But outliers were filtered out before for that, of course, in the clustering post-processing.
To be honest, I don't understand your above statements. What I suppose you did: You elicited an interrupt every 28 µs, then used testing code like
*(uint32_t *)0x40010c00 |= 2;
t = get_wdg_counter_value() >> 3; // divide by 8
*(uint32_t *)0x40010c00 &= 0xfffd;
in the C interrupt handler and observed a latency of 1.9 µs? Is that correct?
The test sampled ticks_us() every 28 µs. The period of the 6.8µs was about 100ms. The 11µs event occurred like one.
I don't get it, but would like to :-) Can you elaborate a bit on it?
Way simpler: the test program I use for checking the call to tick_us() returning proper increasing numbers calls time.ticks_us() about every 28µs. In mp_hal_ticks_us() I toggle a Pin on and off to tell the duration of mp_hal_ticks_us64(). That's what the oscilloscope screen shot shows. Most of the time (99.9%) the call takes ~2µs, but like every 100ms it takes about 6.8µs, and like once a second about 250µs. So when using timestamp(), these variations in the time to call ticks_us() could as well show up in your results.
Note: In the new firmware, timestamp() is now a method of the irq object. irq.timestamp(). For uart.irq() it returns None
.
Note 2: Of the ~2 µs for mp_hal_ticks_us64, reading the 64 bit counter takes ~840ns each. So calculating the division by 40 takes about 300ns. That's great! It consists of only 18 assembly instructions.
Thank you. Got it now.
So when using timestamp(), these variations in the time to call ticks_us() could as well show up in your results.
Probably. Were filtered out as outliers. I can print the outliers too in the next test runs.
I just studied the commit. And saw you made some more changes.
It's still called timestamp_us() isn't it? (I used timestamp() because I hadn't updated the firmware yet).
How do I call it? I suppose with p.timestamp_us()
in the ISR? When the ISR is defined as def isr(p):
. Is that correct?
That's clever, as I also had a bad feeling about it being a general pin method. That way another routine could consume the timestamp wrongly. Which is now not possible anymore.
I think the timestamp is a generally very valuable addition.
How to proceed?
Trying timestamp with esp32 and rp2 here and testing with the HC-SR04 (which I would do)?
I expect the same improvements.
Then discussing it and after getting feedback doing general PRs?
So much work for you, but big part of it is done already. Although I wouldn't be surprised If there would occurr the same issues with tick_us()
on these platforms (at least the esp ones).
Note 2: Of the ~2 µs for mp_hal_ticks_us64, reading the 64 bit counter takes ~840ns each. So calculating the division by 40 takes about 300ns. That's great! It consists of only 18 assembly instructions.
I take that as a mention of honour from a great programmer!
I noted that all the firmware is optimized with -Os throughout. Wonder if gcc would optimize the division likewise if -O2 was selected for these functions.
How do I call it? I suppose with p.timestamp_us() in the ISR? When the ISR is defined as def isr(p):. Is that correct?
in def isr(p):
p is the pin object. you get the timestamp with p.irq().timestamp_us()
. Just calling p.irq() returns the irq object of the pin.
When moving the function get_wdg_counter_value()
back from RAM to flash, the time for calling it is reduced to about 640ns, and the total time for calling mp_hal_ticks_us64 drops from ~2µs to about 1.6µs. I'll let that run for a while. If that is stable as well, I keep it.
We continue here the discussions Compilation of w600 port and Compilation of w600 port II (closed now) of development/debugging the w60x port in MP branch w60x.