[Bug]: ESP32 flash issue using device Lora32 T3_V1.6.1 with FW 2.1.10, 2.1.15, 2.1.17

tmarkson commented 1 year ago

Hardware

T-Lora v2 1.6

Firmware Version

2.1.10; 2.1.15; 2.1.17

Description

I have multiple nodes of the same hardware that experience a panic on the ESP.

The issue is intermittent -- sometimes the device can go days without this happening. The tell tale is the green LED on board stays on and the nodes that still have an LCD screen installed are stuck on splash screen. At the time of this particular issue capture, there 2 other nodes with the same channel setup broadcasting normal, idle traffic. That is to say the environment doesn't have other nodes sending packets at a high interval.

Please note this happens on multiple devices so I tried to rule out a bad ESP with corrupt flash memory. I've seen the issue happen while nodes are powered on battery and from benchtop power too.

I have another node of same hardware as this issue; that has been running as Router role for months now without issue. That device is using v2.0.23. I also own some Tbeam devices, which have not exhibited this behavior.

I mention firmware v2.1.10 here but I've seen this behavior on 2.1.15 and 2.1.17 too; this is the first time I had a serial log connected to capture the issue. Unfortunately, I don't have the log section to coincide with the first reboot cycle. I have setup the same node that produced the error today, updated to firmware v2.1.17, with serial output connected to a Linux PC that's saving to disk; hope that captures it eventually.

Full log

Here is full log showing multiple reboots at roughly every 2 seconds. uart_log_output.txt

A screenshot of nRF Connect Power Profiler showing the power consumption around time of issue

issue_boot_loop_lora32_fw2-1-10_stripped_nrf_power

Any thoughts are appreciated!

Relevant log output

...
INFO  | ??:??:?? 0 Saving /prefs/db.proto
Guru Meditation Error: Core  1 panic'ed (Interrupt wdt timeout on CPU1).

Core  1 register dump:
PC      : 0x40081a71  PS      : 0x00050035  A0      : 0x40085df4  A1      : 0x3ffc000c
A2      : 0x00000002  A3      : 0x3ffbe62c  A4      : 0x8008486c  A5      : 0x4009698a
...

garthvh commented 1 year ago

If the splash screen stays on it is likely the device is boot looping in the background.

tmarkson commented 1 year ago

If the splash screen stays on it is likely the device is boot looping in the background.

I think so too; the power draw and UART log also indicate a boot loop

rpsainio commented 1 year ago

your console output seems to be partial and hopefully there is a Backtrace after CPU register dump. So, can you add that one if available. It would be also interesting to see some 10 lines before guru meditation. I would like to see if this one is back from the past: https://github.com/meshtastic/firmware/issues/2084

tmarkson commented 1 year ago

your console output seems to be partial and hopefully there is a Backtrace after CPU register dump. So, can you add that one if available. It would be also interesting to see some 10 lines before guru meditation. I would like to see if this one is back from the past: #2084

You can see all the text that I had in the full log This shows multiple boots including backtrace!

rpsainio commented 1 year ago

Sorry, my bad - I was looking for attachments as I use ServiceNow in my work.. No, I think it looks different, however backtrace needs to be analysed

tmarkson commented 1 year ago

No worries --and I agree the backtrace of this issue in comparison with #2084 is different.

For this issue, yes, some analysis is need by someone more expert on ESP32 flash space.

I suspect something is happening in runtime to the registers that hold db.proto. Then during boot the device has an issue saving to that that space and panics.

I have recently recalled a handful of Lora32 nodes to my bench since they were deployed and failing intermittently with this issue. Now with them in hand, I am trying to reproduce the issue with firmware v2.1.17 by logging serial data to disk this time, instead of console with limited scrollback (amateur hour over here..) Of course I will post any updates soon as

rpsainio commented 1 year ago

Well, since one of my tlora v2... goes quite fast in reboot loop with 2.1.17 - I will compile 2.1.18 after I have kicked out in PowerFSM.cpp one line that lits up LED for 100ms and the line waiting for 100ms - let's see what happens

rpsainio commented 1 year ago

Final question to distinguish #2084 from yours: do you have power.is_power_saving true

tmarkson commented 1 year ago

Final question to distinguish #2084 from yours: do you have power.is_power_saving true

That value is False. However the node role is Router, which I think this implies some power saving actions despite power.is_power_saving being False.

tmarkson commented 1 year ago

The system did capture the first panic.

I trimmed the serial log to keep it short and remove personal info. Download it here. lora32_boot_loop_minicom_usb0_clean.txt

Below is a snippet where you can see the first panic point:

DEBUG | 16:45:30 396 [Router] rx_snr found. hop_limit:2 rx_snr:8.750000
DEBUG | 16:45:30 396 [Router] rx_snr found in packet. As a router, setting tx delay:294
DEBUG | 16:45:30 396 [Router] Delivering rx packet (id=0x612c5bfc fr=0x3c to=0xff, WantAck=0, HopLim=3 Ch=0x0 Portnum=67 rxtime=1688402730 rxSNR=8.75 rxRSSI=-85)
DEBUG | 16:45:30 396 [Router] Update DB node 0x25a7753c, rx_time=1688402730, channel=0
DEBUG | 16:45:30 396 [Router] Forwarding to phone (id=0x612c5bfc fr=0x3c to=0xff, WantAck=0, HopLim=3 Ch=0x0 Portnum=67 rxtime=1688402730 rxSNR=8.75 rxRSSI=-85)
DEBUG | 16:45:30 396 [Router] Module 'routing' considered
Guru Meditation Error: Core  1 panic'ed (Interrupt wdt timeout on CPU1). 

Core  1 register dump:
PC      : 0x40081aa0  PS      : 0x00050035  A0      : 0x40085e2c  A1      : 0x3ffc000c  
A2      : 0x3ff44000  A3      : 0x3ffbe62c  A4      : 0x3ffd37bc  A5      : 0x40096e12  
A6      : 0x00000001  A7      : 0x00000000  A8      : 0x00000001  A9      : 0xffffff00  
A10     : 0x00000000  A11     : 0x00000020  A12     : 0x00000001  A13     : 0x00000004  
A14     : 0x3ffdc694  A15     : 0x80000001  SAR     : 0x0000001f  EXCCAUSE: 0x00000006  
EXCVADDR: 0x00000000  LBEG    : 0x4009ac54  LEND    : 0x4009ac62  LCOUNT  : 0x00000001  
Core  1 was running in ISR context:
EPC1    : 0x4014373f  EPC2    : 0x00000000  EPC3    : 0x00000000  EPC4    : 0x00000000

Backtrace: 0x40081a9d:0x3ffc000c |<-CORRUPTED

garthvh commented 1 year ago

Back trace is the important part https://maximeborges.github.io/esp-stacktrace-decoder/

garthvh commented 1 year ago

There is no reason to set power.is_power_saving it is managed by the role.

tmarkson commented 1 year ago

Thanks for the tip @garthvh

Below is the output from the stacktrace decoder you mentioned

Any more clues about where to look specifically?

0x40081aa0: gpio_ll_get_intr_status_high at /Users/ficeto/Desktop/ESP32/ESP32S2/esp-idf-public/components/hal/esp32/include/hal/gpio_ll.h:247
0x40081aa0: gpio_intr_service at /Users/ficeto/Desktop/ESP32/ESP32S2/esp-idf-public/components/driver/gpio.c:439
0x4009ac54: cpu_ll_set_watchpoint at /Users/ficeto/Desktop/ESP32/ESP32S2/esp-idf-public/components/hal/esp32/include/hal/cpu_ll.h:124
0x4009ac54: cpu_hal_set_watchpoint at /Users/ficeto/Desktop/ESP32/ESP32S2/esp-idf-public/components/hal/cpu_hal.c:52
0x4009ac62: cpu_ll_set_watchpoint at /Users/ficeto/Desktop/ESP32/ESP32S2/esp-idf-public/components/hal/esp32/include/hal/cpu_ll.h:130
0x4009ac62: cpu_hal_set_watchpoint at /Users/ficeto/Desktop/ESP32/ESP32S2/esp-idf-public/components/hal/cpu_hal.c:52
0x4014373f: uart_hal_write_txfifo at /Users/ficeto/Desktop/ESP32/ESP32S2/esp-idf-public/components/hal/uart_hal_iram.c:35
0x40081a9d: gpio_ll_get_intr_status at /Users/ficeto/Desktop/ESP32/ESP32S2/esp-idf-public/components/hal/esp32/include/hal/gpio_ll.h:235
0x40081a9d: gpio_intr_service at /Users/ficeto/Desktop/ESP32/ESP32S2/esp-idf-public/components/driver/gpio.c:430
0x4021f45e: esp_pm_impl_waiti at /Users/ficeto/Desktop/ESP32/ESP32S2/esp-idf-public/components/esp_pm/pm_impl.c:839
0x4021f45b: cpu_ll_waiti at /Users/ficeto/Desktop/ESP32/ESP32S2/esp-idf-public/components/hal/esp32/include/hal/cpu_ll.h:183
0x4021f45b: esp_pm_impl_waiti at /Users/ficeto/Desktop/ESP32/ESP32S2/esp-idf-public/components/esp_pm/pm_impl.c:837
0x401414cd: esp_vApplicationIdleHook at /Users/ficeto/Desktop/ESP32/ESP32S2/esp-idf-public/components/esp_system/freertos_hooks.c:63
0x4009836c: prvIdleTask at /Users/ficeto/Desktop/ESP32/ESP32S2/esp-idf-public/components/freertos/tasks.c:3987

garthvh commented 7 months ago

All of those seem to be in the esp32 sdk

tmarkson commented 7 months ago

It's true. Could it be missing part of the trace?

I only ask because I have started to see the same freeze behavior recently on Heltec V3 devices since upgrading to v2.2.14. Whereas I did not see this issue on Heltec v3 hardware using any v2.1 firmware versions. Below is picture of a device after some hours of it not reporting on a mesh. Fortunately it freezes on the splash screen with firmware version! That's helpful. IMG_20231218_212456_HDR

Maybe it's my mesh? I'd like to track it down but I'm lost for next step. Any guidance or suggestions would be appreciated!

tmarkson commented 4 months ago

TLDR: assuming this issue is due to multiple routers in close proximity.

I've been able to reproduce the original issue behavior ie. node freeze with every device I've tried, with v1 and v2 firmware:

Tlora32 1.1.6
Heltec V3
RAK4631 nrf52 core

For whatever reason the nodes will hang in ROUTER roles (see earlier comments.) I'm guessing messages are stomping on each other, ie. the random timing slots are aligning with other routers in range.

But it seems this issue is an edge case so I will close this thread!

meshtastic / firmware