meshtastic / firmware

Meshtastic device firmware
https://meshtastic.org
GNU General Public License v3.0
3.3k stars 799 forks source link

[Bug]: Powerloss during early boot phase kills/resets device configuration on ESP32 devices #4555

Open dm5tt opened 2 weeks ago

dm5tt commented 2 weeks ago

Category

Other

Hardware

Other

Firmware Version

v2.5.0.ab7de7f and 2.4.x

Description

Device under test

Heltect HT-CT62 but other ESP32 device seem to be affected too. See Meshtastic Discourse.

I only modified the pin configuration a bit - no other functional changes.

Affected versions

2.4.x and v2.5.0.ab7de7f

People wrote that that last working version was 2.3.15. Didn't test this yet.

How to reproduce

  1. Set a configuration (Router, LoRa, etc.)
  2. Restart fully operational device by switching the device off/on
  3. Toggle power again during early init phase (power off/on)

The device resets itself to a default configuration after ~2-3 attempts.

Is this bug relevant

Yes. When using solar driven stations we quite often can run into a brown out situation when the battery isn't fully yet charged and/or sun radiation isn't yet able to fully power the station.

Relevant log output

ESP-ROM:esp32c3-api1-20210207
Build:Feb  7 2021
rst:0x1 (POWERON),boot:0xc (SPI_FAST_FLASH_BOOT)
SPIWP:0xee
mode:DIO, clock div:1
load:0x3fcd5810,len:0x438
load:0x403cc710,len:0x90c
load:0x403ce710,len:0x2624
entry 0x403cc710
E (318) esp_core_dump_flash: No core dump partition found!
E (318) esp_core_dump_flash: No core dump partition found!
[   319][D][esp32-hal-cpu.c:244] setCpuFrequencyMhz(): PLL: 480 / 6 = 80 Mhz, APB: 80000000 Hz
��@INFO  | ??:??:?? 0 

//\ E S H T /\ S T / C

INFO  | ??:??:?? 0 Booted, wake cause 0 (boot count 1), reset_reason=reset
DEBUG | ??:??:?? 0 Filesystem files (16384/1048576 Bytes):
DEBUG | ??:??:?? 0  /prefs/channels.proto (57 Bytes)
DEBUG | ??:??:?? 0  /prefs/config.proto (181 Bytes)
DEBUG | ??:??:?? 0  /prefs/db.proto.tmp (0 Bytes)
DEBUG | ??:??:?? 0  /prefs/module.proto (95 Bytes)
[   423][I][esp32-hal-i2c.c:75] i2cInit(): Initialising I2C Master: sda=18 scl=19 freq=100000
DEBUG | ??:??:?? 0 Using analog input 1 for battery level
INFO  | ??:??:?? 0 ADCmod: ADC characterization based on Two Point values stored in eFuse
INFO  | ??:??:?? 0 Scanning for i2c devices...
DEBUG | ??:??:?? 0 Scanning for I2C devices on port 1
INFO  | ??:??:?? 0 No I2C devices found
DEBUG | ??:??:?? 0 acc_info = 0
INFO  | ??:??:?? 0 S:B:53,2.5.0.d420433c
DEBUG | ??:??:?? 0 Total heap: 231364
DEBUG | ??:??:?? 0 Free heap: 199860
DEBUG | ??:??:?? 0 Total PSRAM: 0
DEBUG | ??:??:?? 0 Free PSRAM: 0
DEBUG | ??:??:?? 0 NVS: UsedEntries 137, FreeEntries 493, AllEntries 630, NameSpaces 7
DEBUG | ??:??:?? 0 Setup Preferences in Flash Storage
DEBUG | ??:??:?? 0 Number of Device Reboots: 19
ESP_ERROR_CHECK_WITHOUT_ABORT failed: esp_err_t 0x105 (ESP_ERR_NOT_FOUND) at 0x4203701b
file: "src/platform/esp32/BleOta.cpp" line 16
func: static const esp_partition_t* BleOta::findEspOtaAppPartition()
expression: esp_ota_get_partition_description(part, &app_desc)
ESP_ERROR_CHECK_WITHOUT_ABORT failed: esp_err_t 0x102 (ESP_ERR_INVALID_ARG) at 0x4203b043
file: "src/platform/esp32/BleOta.cpp" line 30
func: static String BleOta::getOtaAppVersion()
expression: esp_ota_get_partition_description(part, &app_desc)
DEBUG | ??:??:?? 0 No OTA firmware available
INFO  | ??:??:?? 0 Initializing NodeDB
[   585][E][vfs_api.cpp:105] open(): /littlefs/prefs/db.proto does not exist, no permits for creation
ERROR | ??:??:?? 0 Could not open / read /prefs/db.proto
WARN  | ??:??:?? 0 Devicestate 0 is old, discarding
INFO  | ??:??:?? 0 Performing factory reset!
DEBUG | ??:??:?? 0 Deleting /prefs/channels.proto
DEBUG | ??:??:?? 0 Deleting /prefs/config.proto
DEBUG | ??:??:?? 0 Deleting /prefs/db.proto.tmp
DEBUG | ??:??:?? 0 Deleting /prefs/module.proto
DEBUG | ??:??:?? 0 Removing /prefs
[   736][E][vfs_api.cpp:105] open(): /littlefs/static/rangetest.csv does not exist, no permits for creation
INFO  | ??:??:?? 0 Installing default DeviceState
DEBUG | ??:??:?? 0 Initial packet id 1800775807
DEBUG | ??:??:?? 0 Partially randomized packet id 613883008
DEBUG | ??:??:?? 0 Using nodenum 0xbdc17f48 
INFO  | ??:??:?? 0 Installing default LocalConfig
INFO  | ??:??:?? 0 Setting default channel and radio preferences!
DEBUG | ??:??:?? 0 Expanding short PSK #1
INFO  | ??:??:?? 0 Wanted region 0, using UNSET
INFO  | ??:??:?? 0 Installing default ModuleConfig
INFO  | ??:??:?? 0 Installing default ChannelFile
INFO  | ??:??:?? 0 Saving /prefs/config.proto
[   882][E][vfs_api.cpp:105] open(): /littlefs/prefs/config.proto does not exist, no permits for creation
INFO  | ??:??:?? 0 Saving /prefs/module.proto
[   940][E][vfs_api.cpp:105] open(): /littlefs/prefs/module.proto does not exist, no permits for creation
INFO  | ??:??:?? 0 Saving /prefs/channels.proto
[  1006][E][vfs_api.cpp:105] open(): /littlefs/prefs/channels.proto does not exist, no permits for creation
[  1043][E][vfs_api.cpp:182] remove(): /prefs/db.proto does not exists or is directory
INFO  | ??:??:?? 1 Saving /prefs/db.proto
INFO  | ??:??:?? 1 Loading /prefs/config.proto
INFO  | ??:??:?? 1 Loaded /prefs/config.proto successfully
INFO  | ??:??:?? 1 Loaded saved config version 23
INFO  | ??:??:?? 1 Loading /prefs/module.proto
INFO  | ??:??:?? 1 Loaded /prefs/module.proto successfully
INFO  | ??:??:?? 1 Loaded saved moduleConfig version 23
INFO  | ??:??:?? 1 Loading /prefs/channels.proto
INFO  | ??:??:?? 1 Loaded /prefs/channels.proto successfully
INFO  | ??:??:?? 1 Loaded saved channelFile version 23
[  1204][E][vfs_api.cpp:105] open(): /littlefs/oem/oem.proto does not exist, no permits for creation
ERROR | ??:??:?? 1 Could not open / read /oem/oem.proto
DEBUG | ??:??:?? 1 cleanupMeshDB purged 0 entries
DEBUG | ??:??:?? 1 Using nodenum 0xbdc17f48 
INFO  | ??:??:?? 1 Adding node to database with 1 nodes and 180768 bytes free!
DEBUG | ??:??:?? 1 Old Pubkey: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
INFO  | ??:??:?? 1 Generating new PKI keys
DEBUG | ??:??:?? 1 Generating Curve25519 key pair...
DEBUG | ??:??:?? 1 New Pubkey: ef de 4e 80 79 b8 49 d6 3c 66 60 9a d2 2f 40 e6 ff a9 5a e9 db 27 74 dc 0c 43 b6 81 ff 7d 8b 06 
DEBUG | ??:??:?? 1 Number of Device Reboots: 19
INFO  | ??:??:?? 1 Setting default channel and radio preferences!
DEBUG | ??:??:?? 1 Expanding short PSK #1
INFO  | ??:??:?? 1 Wanted region 0, using UNSET
INFO  | ??:??:?? 1 Saving /prefs/config.proto
INFO  | ??:??:?? 1 Saving /prefs/channels.proto
INFO  | ??:??:?? 1 Saving /prefs/db.proto
DEBUG | ??:??:?? 1 SPI.begin(SCK=10, MISO=6, MOSI=7, NSS=8)
DEBUG | ??:??:?? 1 Set Timezone to GMT0
DEBUG | ??:??:?? 1 Read RTC time as 1
DEBUG | ??:??:?? 1 NeighborInfoModule is disabled
DEBUG | ??:??:?? 1 Rescanning for I2C keyboard
DEBUG | ??:??:?? 1 Scanning for I2C devices on port 1
DEBUG | ??:??:?? 1 Scanning address 0x1f
DEBUG | ??:??:?? 1 Scanning address 0x55
DEBUG | ??:??:?? 1 Scanning address 0x5f
INFO  | ??:??:?? 1 External Notification Module Disabled
DEBUG | ??:??:?? 1 SX126xInterface(cs=8, irq=3, rst=5, busy=4)
DEBUG | ??:??:?? 1 SX126X_DIO3_TCXO_VOLTAGE defined, using DIO3 as TCXO reference voltage at 1.800000 V
INFO  | ??:??:?? 1 Starting meshradio init...
DEBUG | ??:??:?? 1 (bw=250, sf=11, cr=4/5) packet symLen=8 ms, payloadSize=0, time 231 ms
DEBUG | ??:??:?? 1 (bw=250, sf=11, cr=4/5) packet symLen=8 ms, payloadSize=253, time 2115 ms
INFO  | ??:??:?? 1 Radio freq=906.875, config.lora.frequency_offset=0.000
INFO  | ??:??:?? 2 Set radio: region=UNSET, name=LongFast, config=0, ch=19, power=30
INFO  | ??:??:?? 2 Radio myRegion->freqStart -> myRegion->freqEnd: 902.000000 -> 928.000000 (26.000000 MHz)
INFO  | ??:??:?? 2 Radio myRegion->numChannels: 104 x 250.000kHz
INFO  | ??:??:?? 2 Radio channel_num: 20
INFO  | ??:??:?? 2 Radio frequency: 906.875000
INFO  | ??:??:?? 2 Slot time: 77 msec
INFO  | ??:??:?? 2 Set radio: final power level=22
INFO  | ??:??:?? 2 SX126x init result 0
INFO  | ??:??:?? 2 Frequency set to 906.875000
INFO  | ??:??:?? 2 Bandwidth set to 250.000000
INFO  | ??:??:?? 2 Power output set to 22
DEBUG | ??:??:?? 2 Current limit set to 140.000000
DEBUG | ??:??:?? 2 Current limit set result 0
DEBUG | ??:??:?? 2 Setting DIO2 as RF switch
DEBUG | ??:??:?? 2 SX126X_RXEN not defined, defaulting to RADIOLIB_NC
DEBUG | ??:??:?? 2 SX126X_TXEN not defined, defaulting to RADIOLIB_NC
DEBUG | ??:??:?? 2 Using MCU pin -1 as RXEN and pin -1 as TXEN to control RF switching
INFO  | ??:??:?? 2 Set RX gain to boosted mode; result: 0
E (3923) gpio: gpio_isr_handler_remove(482): GPIO isr service is not installed, call gpio_install_isr_service() first
INFO  | ??:??:?? 2 SX1262 Radio init succeeded, using SX1262 radio
INFO  | ??:??:?? 2 Not using WIFI
DEBUG | ??:??:?? 2 (bw=250, sf=11, cr=4/5) packet symLen=8 ms, payloadSize=237, time 2000 ms
DEBUG | ??:??:?? 2 LoRA bitrate = 118.500000 bytes / sec
INFO  | ??:??:?? 2 PowerFSM init, USB power=1
DEBUG | ??:??:?? 2 Enter state: BOOT
DEBUG | ??:??:?? 2 [Power] Battery: usbPower=0, isCharging=0, batMv=3100, batPct=0
INFO  | ??:??:?? 2 [RangeTestModule] Range Test Module - Disabled
meshtastic-bot commented 2 weeks ago

This issue has been mentioned on Meshtastic. There might be relevant details there:

https://meshtastic.discourse.group/t/heltec-v3-random-factory-resets/13769/16

dm5tt commented 2 weeks ago

Played around a bit more.

Seems like when the ESP32 is running into a power loss during the early phase of the first file system operations it will reset the configuration:

INFO  | ??:??:?? 0 Booted, wake cause 0 (boot count 1), reset_reason=reset
DEBUG | ??:??:?? 0 Filesystem files (16384/1048576 Bytes):
DEBUG | ??:??:?? 0  /prefs/channels.proto (57 Bytes)
DEBUG | ??:??:?? 0  /prefs/config.proto (173 Bytes)
DEBUG | ??:??:?? 0  /prefs/db.proto.tmp (365 Bytes)
DEBUG | ??:??:?? 0  /prefs/module.proto (95 Bytes)
[   473][I][esp32-hal-i2c.c:75] i2cInit(): Initialising I2C Master: sda=18 scl=19 freq=100000
DEBUG | ??:??:?? 0 Using analog input 1 for battery level
INFO  | ??:??:?? 0 ADCmod: ADC characterization based on Two Point values stored in eFuse
INFO  | ??:??:?? 0 Scanning for i2c devices...
DEBUG | ??:??:?? 0 Scanning for I2C devices on port 1
INFO  | ??:??:?? 0 No I2C devices found
DEBUG | ??:??:?? 0 acc_info = 0
INFO  | ??:??:?? 0 S:B:53,2.5.0.d420433c
DEBUG | ??:??:?? 0 Total heap: 231364
DEBUG | ??:??:?? 0 Free heap: 199860
DEBUG | ??:??:?? 0 Total PSRAM: 0
DEBUG | ??:??:?? 0 Free PSRAM: 0
DEBUG | ??:??:?? 0 NVS: UsedEntries 137, FreeEntries 493, AllEntries 630, NameSpaces 7
DEBUG | ??:??:?? 0 Setup Preferences in Flash Storage
DEBUG | ??:??:?? 0 Number of Device Reboots: 26
ESP_ERROR_CHECK_WITHOUT_ABORT failed: esp_err_t 0x105 (ESP_ERR_NOT_FOUND) at 0x4203701b
file: "src/platform/esp32/BleOta.cpp" line 16
func: static const esp_partition_t* BleOta::findEspOtaAppPartition()
expression: esp_ota_get_partition_description(part, &app_desc)
ESP_ERROR_CHECK_WITHOUT_ABORT failed: esp_err_t 0x102 (ESP_ERR_INVALID_ARG) at 0x4203b043
file: "src/platform/esp32/BleOta.cpp" line 30
func: static String BleOta::getOtaAppVersion()
expression: esp_ota_get_partition_description(part, &app_desc)

!Power-loss roughly happened here!

DEBUG | ??:??:?? 0 No OTA firmware available
INFO  | ??:??:?? 0 Initializing NodeDB
[   637][E][vfs_api.cpp:105] open(): /littlefs/prefs/db.proto does not exist, no permits for creation
ERROR | ??:??:?? 0 Could not open / read /prefs/db.proto
WARN  | ??:??:?? 0 Devicestate 0 is old, discarding
INFO  | ??:??:?? 0 Performing factory reset!
DEBUG | ??:??:?? 0 Deleting /prefs/channels.proto
panaceya commented 2 weeks ago

as i see, on every boot looking in internal db, found self, and generate new src/mesh/NodeDB.cpp#L622-L627 Log example:

WARN  | ??:??:?? 1 NOTE! Our desired nodenum 0x51e8f344 is invalid or in use, so trying for 0x41412e1
dm5tt commented 2 weeks ago

I just tested this rather ugly hack:

NodeDB.cpp, loadProto(...):

 uint32_t retry = 0;

    while(state == LoadFileResult::OTHER_FAILURE && retry < 5)
    {
    if (f) {
        LOG_INFO("Loading %s\n", filename);
        pb_istream_t stream = {&readcb, &f, protoSize};

        memset(dest_struct, 0, objSize);
        if (!pb_decode(&stream, fields, dest_struct)) {
            LOG_ERROR("Error: can't decode protobuf %s\n", PB_GET_ERROR(&stream));
            state = LoadFileResult::DECODE_FAILED;
        } else {
            LOG_INFO("Loaded %s successfully\n", filename);
            state = LoadFileResult::LOAD_SUCCESS;
        }
        f.close();
        break;
    } else {
        LOG_ERROR("Could not open / read %s, attempt %u\n", filename, retry);
        retry++;
        delay(500);
    }

Basically giving it a few retries to read the files. The trick is to be slow enough so we can be 100% sure that we are leaving the short duration where the ESP32 is dying away because of power loss. Most likely it will fail if the devices are hovers around long enough.

geeksville commented 2 weeks ago

Really great investingation @dm5tt! I think you've found two problems.

I'm finishing up some heltec tracker power improvements today (and possibly tomorrow). But I'll eagerly work on these tue or weds.

raxpww commented 2 weeks ago

The most glaring problem is the problem of the controller freezing when powered by the sun. The effect is not known to everyone and it is mostly encountered by designers of weather stations or small repeaters. When the night time of the day or cloudy weeks comes, the moment comes when the power source exhausts its resource and the power supply decreases so much that it begins to be insufficient for stable operation, and when the sun appears, the power supply voltage begins to increase slowly (not in a jump), however, the controller does not go to working mode To eliminate this effect, there are power supervisors... But this is a hardware problem caused by ill-conceived Chinese circuitry. But software has the ability to hang for various reasons. And when the device is physically nearby, the problem is small - I clicked RESET or switched the power and it's on the air again. But with remote use, especially in street conditions, in conditions of the impossibility of repeated physical access, for example to the roof, the problem comes to the fore. The solution methods are as old as the world and are known since the time of operation of external access points, weather stations and repeaters

  1. https://ur8lv.com/1621786922
  2. https://ur8lv.com/1621788126
JumpingCatRabbit commented 2 weeks ago

When the night time of the day or cloudy weeks comes, the moment comes when the power source exhausts its resource and the power supply decreases so much that it begins to be insufficient for stable operation, and when the sun appears, the power supply voltage begins to increase slowly (not in a jump), however, the controller does not go to working mode To eliminate this effect, there are power supervisors...

The ESP32 has an integrate brown-out detection for this. And it can be set high enough (~2.9-3.2V) so that it can reset the entire device to a safe reset state while still being fully functional before coming even close to dangerous areas where the SPI flash starts dying.

raxpww commented 2 weeks ago

When the night time of the day or cloudy weeks comes, the moment comes when the power source exhausts its resource and the power supply decreases so much that it begins to be insufficient for stable operation, and when the sun appears, the power supply voltage begins to increase slowly (not in a jump), however, the controller does not go to working mode To eliminate this effect, there are power supervisors...

The ESP32 has an integrate brown-out detection for this. And it can be set high enough (~2.9-3.2V) so that it can reset the entire device to a safe reset state while still being fully functional before coming even close to dangerous areas where the SPI flash starts dying.

I know about brown-out detection. It has false positives. To increase reliability and avoid cyclic reboots in my projects, I disable brown-out detection altogether

(((uint32_t volatile)ETS_UNCACHED_ADDR((DR_REG_RTCCNTL_BASE + 0xd4)))) = 0

And judging by this study https://www.esp32.com/viewtopic.php?t=38178 - the peripheral low voltage reset device is practically useless. An external hardware supervisor (3-pin case) will provide reliability and a reset guarantee than the crooked implementation in the ESP32 itself

geeksville commented 2 weeks ago

btw - I'm starting work on investigating/fixing this one tomorrow. will be a few days before a PR or resolution tho.

S5NC commented 2 weeks ago

ESP32C3?

geeksville commented 1 week ago

Alas - I ran out of time for meshtastic futzing before leaving on a long bike trip (with no computer/minimum cell-phone access).

I won't be able to work on this until I return to my desk (on Oct 1, but realistically need a few days to complete a move - so should be back at meshtastic about Oct 7). If someone wants to work on it before then then great - but if still open when I return I'll continue my investigation. Have a nice September ya'll!

dm5tt commented 1 week ago

I'm trying to force the Brownout-detection right now:

REG_SET_BIT(RTC_CNTL_BROWN_OUT_REG, RTC_CNTL_BROWN_OUT_ENA); // Enable brownout
WRITE_PERI_REG(RTC_CNTL_BROWN_OUT_REG, 7);                   // Set threshold to max (3.0V)

But as the device in placed on the roof there are no live logs from it. Just rough uptime values into my InfluxDB.

S5NC commented 1 day ago

@dm5tt any luck?

dm5tt commented 1 day ago

Not Yet.

As I'm using pretty big batteries (3000...4000mAh) this doesn't happen that often to me. The patch above (RTC_CNTL_BROWN_OUT_REG) is integrated into my Fork but not properly tested yet.

Lets wait for geeksville to return. Then we maybe can start a coordinated effort hunting this issue down.

meshtastic-bot commented 21 hours ago

This issue has been mentioned on Meshtastic. There might be relevant details there:

https://meshtastic.discourse.group/t/three-rak4631-lose-their-minds/11928/8

dm5tt commented 21 hours ago

Looks like the RAK4631 have a similar problem.