ESP32 micros() bug - Githubissues

peufeu2 commented 2 years ago

On ESP32, micros() takes about 150 cycles.

I'm using 80MHz CPU clock to avoid wasting power,

delayMicroseconds() calls micros() at least twice. This means delayMicroseconds() takes at least 330 cycles, corresponding to a delay of more than 4µs... even if the requested delay was 1µs or 2µs or 3µs. For longer delays, the actual delay will be a bit longer than what was requested. At 80MHz, delayMicroseconds() is about 2µs longer than requested.

In other words, delayMicroseconds() does not work, which is a problem for bit-banging a timing-critical protocol such as onewire... and I found this out because it was causing an impressive number of read errors on a rather long string of DS18B20 (about one error in 100 reads). I'm not using OneWireNg, I'm using Paul Stoffregen's OneWire library, modified to use an active pullup to drive my cables. So this may or may not apply to OneWireNg, but since your library also uses delayMicroseconds(), I thought you might want to know about this.

I switched to a more accurate delayMicroseconds() implementation based on a timer, and the number of read errors dropped to zero.

pstolarz commented 2 years ago

There are several notes here:

Current (2.0.2) implementation of delayMicroseconds() for all types of ESPs (classic, S2, C3) also uses internal timer to track timings in a similar way as in your variant (cores/esp32/esp32-hal-misc.c):

void ARDUINO_ISR_ATTR delayMicroseconds(uint32_t us)
{
uint64_t m = (uint64_t)esp_timer_get_time();
if(us){
    uint64_t e = (m + us);
    if(m > e){ //overflow
        while((uint64_t)esp_timer_get_time() > e){
            NOP();
        }
    }
    while((uint64_t)esp_timer_get_time() < e){
        NOP();
    }
}
}

so, I'm not sure this issue is relevant anymore

EDIT: I've probably got your point here. You replaced esp_timer_get_time() by fastmicros() into separate, IRAMed routine with direct timers registers implementation, reaching better time accuracy. You may probably get even better results by making fastmicros() inlined and storing vars in HW regs.

I'm aware delayMicroseconds() is not a perfect routine especially for small delays. I observed 1-2us discrepancies also for other platforms. But the good thing here is, the 1-wire protocol is quite tolerant in regards of timings and usually such difference is not a big deal. The only exception is the over-drive (OD) mode, which is roughly about 10 times faster than the standard mode. In this mode perfect timings is required to communicate with over-driven slaves. This is a reason I added touch1Overdrive() as separate method for OneWireNg_BitBang class to handle touch-1 activity in OD mode which is most timing critical activity on the 1-wire bus. So (as a conclusion) I haven't got feedbacks related to 1-wire timing inaccuracies causing invalid reads for OneWireNg library, while (as far as I know) the lib is used by its users on many different (sometimes sophisticated) setups.
I'm not a big fan of fixing problems of external libs (as ESP cores) in dependent libs like this one. Sometimes it's unavoidable but in this case it's not - see the explanation above. I'd prefer to have good, working external API I'm using in my lib, rather than patching the API in my lib, since for some reason it doesn't work as expected.

uzi18 commented 2 years ago

@pstolarz according to https://docs.espressif.com/projects/esp-idf/en/latest/esp32/api-reference/system/esp_timer.html#_CPPv418esp_timer_get_timev return type is int64

Discussion about delays on esp32 is here: https://github.com/espressif/arduino-esp32/issues/1335

pstolarz commented 2 years ago

@uzi18 Yes, I know this thread. But I've also got a feedback the problem described there didn't appear on OneWireNg. See: https://github.com/SignalK/SensESP/issues/377 https://github.com/SignalK/SensESP/pull/399

uzi18 commented 2 years ago

@pstolarz you can attach here logic analyser files/screens that shows condition of this lib.

peufeu2 commented 2 years ago

I'm aware delayMicroseconds() is not a perfect routine especially for small delays. I observed 1-2us discrepancies also for other platforms.

I got problems because I run ESP32 at 80MHz instead of the usual 240, and at this frequency the slow 64 bit division in esp_timer_get_time() means delayMicroseconds() has a minimum delay of about 4µs. Since the time window to acquire signals on reads is about 15µs, that caused trouble. At 240MHz it would probably work fine.

pstolarz commented 2 years ago

The problem is not with the delay nor frequency. 80MHz is quite high freq in the MCUs world and the 15us time window is not an issue either. For example in over-driven touch-1 activity the window is 2us max from starting pulse to bus sampling and this lib works quite stable in this condition on 8-bit 16MHz AVR platform.

The issue I see is the OneWire lib itself, especially ESP32's low level GPIO handling (mainly GPIO modes switching).

peufeu2 commented 2 years ago

The issue is the delay and the frequency, because the default implementation of delayMicroseconds() on ESP32 takes at least 330 cycles. So if you run the cpu at 240MHz, it's not that noticeable, if you request 2µs delay you will get quite close to 2µs. If you run it at a lower clock like 80MHz, then if you request 2µs you get a bit more than 4µs and the timings are wrong. So you get this weird situation where onewire works at 240MHz CPU clock, but not at 80MHz...

The low level GPIO code is also quite bad.

pstolarz commented 2 years ago

The following screens present write-1 activity on 240MHz, 80MHz, 26MHz, 13MHz (all frequencies work w/o errors observed).

Prerequisites:

DallasTemperature example. Single DS18B20 sensor.
Standard delayMicroseconds() to work with microsec delays.
Board: ESP32 dev-kit.

Standard mode write-1 bit-banged as:

Low state: 5us, observed state on the LA varies from 5.5us to 6.5us regardless of frequency
Bus sampling at 8us staring from low-to-high slope (can't be observed on LA)
Trailing high state after sampling 56us (not-protected). 8+56 = 64us high min. Observed on the LA varies from 72us to 74.5us

F_CPU: 240MHz f_240_s

F_CPU: 80MHz f_80_s

F_CPU: 26MHz (26MHz XTAL) f_26_s

F_CPU: 13MHz (26MHz XTAL) f_13_s

Conclusions

As stated above - delayMicroseconds() is not perfect but for 1-wire bit-banging is enough (at least in the standard mode).
What you saying regarding 330 cycles simply doesn't make sense. In case of 13MHz, 330 cycles lasts 25us.

EDIT: I haven't checked if the software configured frequencies were currently set on the CPU. I don't have an ability to do this. But the purpose of the examination was other - regardless of the freq set (240 or 80MHz) the delays work fine in regard of 1-wire bit-banging.

peufeu2 commented 2 years ago

Thanks! Now, that's really interesting. This ESP32 has been logging plenty of stuff, including onewire read errors, so I knew something was wrong, but didn't know why. You've given me the motivation to look deeper into this, and I found what the problem was.

Anyway. I'll start from the end, with the version that works.

I used the profiler I just wrote for something else to profile the onewire read function. It makes histograms of time intervals, which means I can also plot the histogram of level transition times on the onewire bus, kinda using the ESP32 as a logic analyzer.

Next plot is a read bit. At t=0, it pulls the wire low, wait for 9µs, then releases the pin and lets the pullup do its job. Either the wire will be pulled up, or the DS18B20 will pull it down and then release it. So it counts the number of CPU cycles from t=0 to the time when the wire rises to logic level 1 and makes a histogram with that.

The hump on the left is the 0-to-1 transition time when DS18B20 wants to respond with 1. It doesn't pull the wire down, so the wire rises immediately when the ESP releases it, so we have a cluster of transition around 10µs. The hump on the right is when DS18B20 holds the wire low because it's responding with a 0, so we have a cluster of transitions when it lets go around 30µs.

onewire transition delay multiDelay

That corresponds to datasheet timings perfectly. I put the sampling point at t=18µs, and it did about a million reads since that with zero errors.

Next plot is with delayMicroseconds(). CPU runs at 80MHz. It drives the wire low at t=0, then delayMicroseconds(6), releases the wire, delayMicroseconds(6), and samples. The "0-1 transition read bit 1" looks to be after sampling, because it's not measured during the delay.

This works... barely.

The reason I was getting lots of errors was because I was using active pull-up, which needs another call to delayMicroseconds().

That's what the next plot shows: it drives the wire low at t=0, then delayMicroseconds(6), then activates the pull-up, delayMicroseconds(2), then releases the wire, delayMicroseconds(4), and samples. And because delayMicroseconds takes so long and is so inaccurate, having three calls was enough to sometimes pull the sampling point into the 0-1 transition that occurs around 30µs, thus reading the bit incorrectly.

So now I know why I had to use off-spec timings, with a sampling interval that looked too short: that's because delayMicroseconds() sometimes takes way too long, so I was compensating for that. That also explains why you're not getting many errors and I was: with only two calls to delayMicroseconds(), the timing error is not high enough to produce lots of errors.

So I removed the active pull-up. I had used it because I thought the errors came from signal integrity issues, which was a mistake, because the errors in fact came from inaccurate delayMicroseconds(), so it didn't actually help at all! It works fine without the active pullup, and with the new accurate delay function, timing margins are huge.

That doesn't explain why it works at 13MHz on your hardware though. I tried using a lower frequency than 80MHz, but my ESP triggered the watchdog, so I didn't investigate further. I suspect you're not really getting 13MHz. You can check by printing the difference between two calls to xthal_get_ccount() which counts cpu cycles, over a 1 second delay for example.

pstolarz commented 2 years ago

Thanks for the info and I'm glad I could help (at least as a motivator to look at the issue once again). If time allows I'll try to look at this 13MHz case and check its accuracy (I'm a little suspicious if it's really set to this freq).

pstolarz commented 2 years ago

Indeed there was an issue with frequiencies.

With the following test code:

unsigned esp_int_level[portNUM_PROCESSORS];
#define timeCriticalEnter() \
    esp_int_level[xPortGetCoreID()] = portSET_INTERRUPT_MASK_FROM_ISR()
#define timeCriticalExit() \
    portCLEAR_INTERRUPT_MASK_FROM_ISR(esp_int_level[xPortGetCoreID()])

void setup() {
    unsigned c1, c2;
    unsigned long m1, m2;

    Serial.begin(115200);

    timeCriticalEnter();

    m1 = micros();
    c1 = xthal_get_ccount();

    delayMicroseconds(1000);

    c2 = xthal_get_ccount();
    m2 = micros();

    timeCriticalExit();

    Serial.println(c2 - c1);
    Serial.println(m2 - m1);
}

void loop() {
    delay(1000);
}

I'm confirming that: 240, 80, 40 MHz F_CPU are set as configured.

26, 13 MHz and 20, 10 MHz are a little bit weird: I'm able to set 26 and 13 MHz OR 20, 10 MHz but not both of these pairs. In other words, when I set 26 MHz I'm also able to set 13 but not 20 nor 10 (in this case they are both set to 160MHz event though configured differently). And vice versa - when I set 20MHz I'm also able to set 10 but not 26 nor 13.

This explains the result from LA provided earlier. 13 MHz was in fact measured on 160 MHz.

Regarding to delayMicroseconds() issue. I agree the routine is burden with the inaccuracy issue. My lib is able to work up to 40MHz w/o issues with delays using this routine.

When I change the delay routine to:

inline void delayMicroseconds(unsigned us) {
    for (register unsigned stop = xthal_get_ccount() + us*(F_CPU / 1000000);
        xthal_get_ccount() < stop; );
}

I'm able to work on all freqs up to 10MHz. But I'm not happy with timings observed on such low freq. In this case the CPU can perform max 10 cycles per us, therefore the timings are disturbed by func calls, returns, etc. In such case only optimized assembler code, with minimal overhead could provide satisfactory results. So the decent results end at 26-20 MHz in this case. I think it's quite acceptable though.

peufeu2 commented 2 years ago

I think ESP32 PLL can only work with a limited set of division/multiplication ratios so, depending on the frequency of the oscillator on your board, some cpu clock frequencies will fit the allowed dividers and some won't. That would explain why you're not getting the frequency you asked for.

I made a delay class using the cpu cycle counter too (link). It grabs the cycle counter at the beginning of the function when the wire is driven low, then all delays are referred to this initial value (link to code). So I didn't call it "delay", I called it "waitUntil" since it waits until a duration has elapsed from this first reference time.

That avoids accumulating the error in the first delay, plus the time to execute other instructions in the function, so the sampling time on reads is referred to the time when the wire goes low, which is what the sensor at the other end of the wire cares about.

The function should be IRAM_ATTR, because if it executes from flash you will get random delays.

Also I had trouble with the interrupt disabling code above, but this one worked:

#define timeCriticalEnter() {portMUX_TYPE mux = portMUX_INITIALIZER_UNLOCKED;portENTER_CRITICAL(&mux);
#define timeCriticalExit() portEXIT_CRITICAL(&mux);}

pstolarz commented 2 years ago

I think ESP32 PLL can only work with a limited set of division/multiplication ratios so, depending on the frequency of the oscillator on your board, some cpu clock frequencies will fit the allowed dividers and some won't. That would explain why you're not getting the frequency you asked for.

20, 10 MHz (40 MHz XTAL) use other oscillator that 26, 13 MHz (26 MHz XTAL). Interesting thing is, it looks like I'm able to use one of them only, and the choice is random. Sometimes works the 1st one, sometimes 2nd.

That avoids accumulating the error in the first delay, plus the time to execute other instructions in the function, so the sampling time on reads is referred to the time when the wire goes low, which is what the sensor at the other end of the wire cares about.

I may not be a bad idea. In my case I always use delay with combination on setBus() routine. Combining them together to have setBusWithDelay() for example could solve the problem with virt. function calls via vtables which I use in the lib. The only problem is the lib is multiplatform and I need to solve the problem globally. I don't have a comfort to work with ESP32 only.

pstolarz commented 2 years ago

Also I had trouble with the interrupt disabling code above, but this one worked:

Interesting... portENTER_CRITICAL() and portEXIT_CRITICAL() don't disable interrupts. These are spinlocks used mainly in the SMP environment.

portSET_INTERRUPT_MASK_FROM_ISR() and portCLEAR_INTERRUPT_MASK_FROM_ISR() disable interrupts on local core therefore shall be more effective. 1 core handling bit-banging may be locked at time-critical activity while other still working normally (e.g. hanling WiFi messages) wit its interrupts enabled. I found these routines working in OD mode when if disabled the lib doesn't fulfill time requirements.

pstolarz / OneWireNg

ESP32 micros() bug #38