Closed tmarkson closed 4 months ago
If the splash screen stays on it is likely the device is boot looping in the background.
If the splash screen stays on it is likely the device is boot looping in the background.
I think so too; the power draw and UART log also indicate a boot loop
your console output seems to be partial and hopefully there is a Backtrace after CPU register dump. So, can you add that one if available. It would be also interesting to see some 10 lines before guru meditation. I would like to see if this one is back from the past: https://github.com/meshtastic/firmware/issues/2084
your console output seems to be partial and hopefully there is a Backtrace after CPU register dump. So, can you add that one if available. It would be also interesting to see some 10 lines before guru meditation. I would like to see if this one is back from the past: #2084
You can see all the text that I had in the full log This shows multiple boots including backtrace!
Sorry, my bad - I was looking for attachments as I use ServiceNow in my work.. No, I think it looks different, however backtrace needs to be analysed
No worries --and I agree the backtrace of this issue in comparison with #2084 is different.
For this issue, yes, some analysis is need by someone more expert on ESP32 flash space.
I suspect something is happening in runtime to the registers that hold db.proto
. Then during boot the device has an issue saving to that that space and panics.
I have recently recalled a handful of Lora32 nodes to my bench since they were deployed and failing intermittently with this issue. Now with them in hand, I am trying to reproduce the issue with firmware v2.1.17 by logging serial data to disk this time, instead of console with limited scrollback (amateur hour over here..) Of course I will post any updates soon as
Well, since one of my tlora v2... goes quite fast in reboot loop with 2.1.17 - I will compile 2.1.18 after I have kicked out in PowerFSM.cpp one line that lits up LED for 100ms and the line waiting for 100ms - let's see what happens
Final question to distinguish #2084 from yours: do you have power.is_power_saving true
Final question to distinguish #2084 from yours: do you have power.is_power_saving true
That value is False.
However the node role is Router, which I think this implies some power saving actions despite power.is_power_saving
being False.
The system did capture the first panic.
I trimmed the serial log to keep it short and remove personal info. Download it here. lora32_boot_loop_minicom_usb0_clean.txt
Below is a snippet where you can see the first panic point:
DEBUG | 16:45:30 396 [Router] rx_snr found. hop_limit:2 rx_snr:8.750000
DEBUG | 16:45:30 396 [Router] rx_snr found in packet. As a router, setting tx delay:294
DEBUG | 16:45:30 396 [Router] Delivering rx packet (id=0x612c5bfc fr=0x3c to=0xff, WantAck=0, HopLim=3 Ch=0x0 Portnum=67 rxtime=1688402730 rxSNR=8.75 rxRSSI=-85)
DEBUG | 16:45:30 396 [Router] Update DB node 0x25a7753c, rx_time=1688402730, channel=0
DEBUG | 16:45:30 396 [Router] Forwarding to phone (id=0x612c5bfc fr=0x3c to=0xff, WantAck=0, HopLim=3 Ch=0x0 Portnum=67 rxtime=1688402730 rxSNR=8.75 rxRSSI=-85)
DEBUG | 16:45:30 396 [Router] Module 'routing' considered
Guru Meditation Error: Core 1 panic'ed (Interrupt wdt timeout on CPU1).
Core 1 register dump:
PC : 0x40081aa0 PS : 0x00050035 A0 : 0x40085e2c A1 : 0x3ffc000c
A2 : 0x3ff44000 A3 : 0x3ffbe62c A4 : 0x3ffd37bc A5 : 0x40096e12
A6 : 0x00000001 A7 : 0x00000000 A8 : 0x00000001 A9 : 0xffffff00
A10 : 0x00000000 A11 : 0x00000020 A12 : 0x00000001 A13 : 0x00000004
A14 : 0x3ffdc694 A15 : 0x80000001 SAR : 0x0000001f EXCCAUSE: 0x00000006
EXCVADDR: 0x00000000 LBEG : 0x4009ac54 LEND : 0x4009ac62 LCOUNT : 0x00000001
Core 1 was running in ISR context:
EPC1 : 0x4014373f EPC2 : 0x00000000 EPC3 : 0x00000000 EPC4 : 0x00000000
Backtrace: 0x40081a9d:0x3ffc000c |<-CORRUPTED
Back trace is the important part https://maximeborges.github.io/esp-stacktrace-decoder/
There is no reason to set power.is_power_saving it is managed by the role.
Thanks for the tip @garthvh
Below is the output from the stacktrace decoder you mentioned
Any more clues about where to look specifically?
0x40081aa0: gpio_ll_get_intr_status_high at /Users/ficeto/Desktop/ESP32/ESP32S2/esp-idf-public/components/hal/esp32/include/hal/gpio_ll.h:247
0x40081aa0: gpio_intr_service at /Users/ficeto/Desktop/ESP32/ESP32S2/esp-idf-public/components/driver/gpio.c:439
0x4009ac54: cpu_ll_set_watchpoint at /Users/ficeto/Desktop/ESP32/ESP32S2/esp-idf-public/components/hal/esp32/include/hal/cpu_ll.h:124
0x4009ac54: cpu_hal_set_watchpoint at /Users/ficeto/Desktop/ESP32/ESP32S2/esp-idf-public/components/hal/cpu_hal.c:52
0x4009ac62: cpu_ll_set_watchpoint at /Users/ficeto/Desktop/ESP32/ESP32S2/esp-idf-public/components/hal/esp32/include/hal/cpu_ll.h:130
0x4009ac62: cpu_hal_set_watchpoint at /Users/ficeto/Desktop/ESP32/ESP32S2/esp-idf-public/components/hal/cpu_hal.c:52
0x4014373f: uart_hal_write_txfifo at /Users/ficeto/Desktop/ESP32/ESP32S2/esp-idf-public/components/hal/uart_hal_iram.c:35
0x40081a9d: gpio_ll_get_intr_status at /Users/ficeto/Desktop/ESP32/ESP32S2/esp-idf-public/components/hal/esp32/include/hal/gpio_ll.h:235
0x40081a9d: gpio_intr_service at /Users/ficeto/Desktop/ESP32/ESP32S2/esp-idf-public/components/driver/gpio.c:430
0x4021f45e: esp_pm_impl_waiti at /Users/ficeto/Desktop/ESP32/ESP32S2/esp-idf-public/components/esp_pm/pm_impl.c:839
0x4021f45b: cpu_ll_waiti at /Users/ficeto/Desktop/ESP32/ESP32S2/esp-idf-public/components/hal/esp32/include/hal/cpu_ll.h:183
0x4021f45b: esp_pm_impl_waiti at /Users/ficeto/Desktop/ESP32/ESP32S2/esp-idf-public/components/esp_pm/pm_impl.c:837
0x401414cd: esp_vApplicationIdleHook at /Users/ficeto/Desktop/ESP32/ESP32S2/esp-idf-public/components/esp_system/freertos_hooks.c:63
0x4009836c: prvIdleTask at /Users/ficeto/Desktop/ESP32/ESP32S2/esp-idf-public/components/freertos/tasks.c:3987
All of those seem to be in the esp32 sdk
It's true. Could it be missing part of the trace?
I only ask because I have started to see the same freeze behavior recently on Heltec V3 devices since upgrading to v2.2.14. Whereas I did not see this issue on Heltec v3 hardware using any v2.1 firmware versions.
Below is picture of a device after some hours of it not reporting on a mesh. Fortunately it freezes on the splash screen with firmware version! That's helpful.
Maybe it's my mesh? I'd like to track it down but I'm lost for next step. Any guidance or suggestions would be appreciated!
TLDR: assuming this issue is due to multiple routers in close proximity.
I've been able to reproduce the original issue behavior ie. node freeze with every device I've tried, with v1 and v2 firmware:
For whatever reason the nodes will hang in ROUTER roles (see earlier comments.) I'm guessing messages are stomping on each other, ie. the random timing slots are aligning with other routers in range.
But it seems this issue is an edge case so I will close this thread!
Category
Other
Hardware
T-Lora v2 1.6
Firmware Version
2.1.10; 2.1.15; 2.1.17
Description
I have multiple nodes of the same hardware that experience a panic on the ESP.
The issue is intermittent -- sometimes the device can go days without this happening. The tell tale is the green LED on board stays on and the nodes that still have an LCD screen installed are stuck on splash screen. At the time of this particular issue capture, there 2 other nodes with the same channel setup broadcasting normal, idle traffic. That is to say the environment doesn't have other nodes sending packets at a high interval.
Please note this happens on multiple devices so I tried to rule out a bad ESP with corrupt flash memory. I've seen the issue happen while nodes are powered on battery and from benchtop power too.
I have another node of same hardware as this issue; that has been running as Router role for months now without issue. That device is using v2.0.23. I also own some Tbeam devices, which have not exhibited this behavior.
I mention firmware v2.1.10 here but I've seen this behavior on 2.1.15 and 2.1.17 too; this is the first time I had a serial log connected to capture the issue. Unfortunately, I don't have the log section to coincide with the first reboot cycle. I have setup the same node that produced the error today, updated to firmware v2.1.17, with serial output connected to a Linux PC that's saving to disk; hope that captures it eventually.
Full log
Here is full log showing multiple reboots at roughly every 2 seconds. uart_log_output.txt
A screenshot of nRF Connect Power Profiler showing the power consumption around time of issue
Any thoughts are appreciated!
Relevant log output