Closed mgrimace closed 3 months ago
Thanks for reporting this. I have decoded the stack trace and will log it here for reference as we decide how to fix...
Exception Cause: 3 [LoadStoreError: Processor internal physical address or data error during load or store]
0x4025f23c: umm_info at ??:?
0x4000cbe9: ?? ??:0
0x40229092: WiFiServer::accept() at ??:?
0x40230bb8: EspClass::getHeapStats(unsigned int*, unsigned int*, unsigned char*) at ??:?
0x4021f13d: memory_ticker() at ??:?
0x40231ee4: run_scheduled_functions() at ??:?
0x4023473c: loop_end at ??:?
0x40234777: loop_wrapper() at core_esp8266_main.cpp:?
0x401007ad: cont_wrapper at ??:?
and...
Exception Cause: 4 [Level1Interrupt: Level-1 interrupt as indicated by set level-1 bits in the INTERRUPT register]
0x400005cb: ?? ??:0
0x4000cbe9: ?? ??:0
0x40230bb8: EspClass::getHeapStats(unsigned int*, unsigned int*, unsigned char*) at ??:?
0x4021e621: printMessageLog(Print&) at ??:?
0x4021e740: crashCallback() at ??:?
0x4022dd85: custom_crash_callback at ??:?
0x40234964: cut_here at core_esp8266_postmortem.cpp:?
0x40234be9: postmortem_report at core_esp8266_postmortem.cpp:?
0x4010639d: ets_timer_disarm at ??:?
0x4025f23c: umm_info at ??:?
0x4000cbe9: ?? ??:0
0x4000cbe9: ?? ??:0
0x402821f0: chip_v6_unset_chanfreq at ??:?
0x4025f23c: umm_info at ??:?
0x4000cbe9: ?? ??:0
0x4026de8b: user_uart_wait_tx_fifo_empty at ??:?
0x4025f23c: umm_info at ??:?
0x4000cbe9: ?? ??:0
0x40105e06: wdt_feed at ??:?
0x40105de8: wdt_feed at ??:?
0x4025f23c: umm_info at ??:?
0x4024eaa8: ip4_output_if_opt_src at /Users/jstroud/git/Arduino/tools/sdk/lwip2/builder/lwip2-src/src/core/ipv4/ip4.c:1765
0x400005e1: ?? ??:0
0x40247f95: mem_malloc at /Users/jstroud/git/Arduino/tools/sdk/lwip2/builder/lwip2-src/src/core/mem.c:210
0x4025f23c: umm_info at ??:?
0x4025f1ce: umm_info at ??:?
0x4000cbe8: ?? ??:0
0x40234548: esp_suspend at ??:?
0x40228871: ClientContext::wait_until_acked(int) at ??:?
0x40101012: umm_free_core at umm_malloc.cpp:?
0x40101368: free at ??:?
0x40231b55: Print::printf(char const*, ...) at ??:?
0x4025f1ce: umm_info at ??:?
0x40229092: WiFiServer::accept() at ??:?
0x40230bb8: EspClass::getHeapStats(unsigned int*, unsigned int*, unsigned char*) at ??:?
0x4021f13d: memory_ticker() at ??:?
0x40231ee4: run_scheduled_functions() at ??:?
Just to document my thoughts here, @jgstroud. This is somewhat disconcerting. The crash is inside the getHeapStats()
function which is called in two different places... both of which show up in these two stack dumps.
The second, longer, stack dump shows it inside a call to getHeapStats()
that is called during the handling of a crash. Ugh. The quick fix here is simply to remove that call and report the most recently gathered stats that take place in the once-a-second timer... so the data included in a crash log is potentially up to one second old.
But the first, shorter, stack dump shows the crash occurring from within getHeapStats()
called from that very once-a-second timer. Ugh again. In fact, the second crash was caused by the first crash!!! (crash in heap stats called from the memory_ticker()
, causes our crashCallback()
function to be called which calls getHeapStats()
again which of course crashes again.
So, we can avoid the second crash by removing the call from crash callback, but what caused the first one? Humm.
@mgrimace I would like you to monitor please and see if this occurs again.
David.
Of course, I’ll update if it does happen again - as it is no further crashes on my end.
Both my RATGDO's crashed at the same time with the same error. The cause was my powering off my Apple TV as I rearranged my electrical connections (we have seen this before and may be something we just have to live with) But that the getHeapStats()
call crashed inside crash logging is bad. So I have removed that call (commented out) in the crash handling.
But, @jgstroud , I think we should revert back to just tracking available free memory and not tracking heap fragmentation. We've not seen any evidence that fragmentation is an issue, and getHeapStats()
is failing for unknown reasons.
David.
I'm good with removing the heap stats.
I've tried to reproduce the Apple hub reboot crash while capturing all network traffic and console logs. Unfortunately, I was never able to forcefully reproduce it. I may need to put back my continuous packet capture so I can catch the failure.
Reverting heap stats is part of PR #220
Ok, had an additional crash, near as I can tell it was 8/6/2024, 9:30:15 PM based on ‘last reboot’ and uptime counter (I never manually rebooted it):
Crash information recovered from EEPROM
Crash # 1 at 625338699 ms
Restart reason: 2
Exception (3):
epc1=0x4010110d epc2=0x00000000 epc3=0x00000000 excvaddr=0x400195b9 depc=0x00000000
>>>stack>>>
ctx: sys
sp: 3ffffdd0 end: 3fffffb0
3ffffdd0: 00004d18 000009a3 3ffe8f1c 3ffef640
3ffffde0: 0000001a 3ffefcd0 00000020 4010139c
3ffffdf0: 000000d4 3ffefcd0 000000d4 4010160a
3ffffe00: 40255806 3ffefcd0 00000002 40259e10
3ffffe10: 00000002 40259db7 00000002 40258f0c
3ffffe20: 40258f35 3ffffed0 3ffefcd0 0000001a
3ffffe30: 40256994 3ffffed0 3ffefb78 3ffef540
3ffffe40: 3ffeda10 3ffffed0 3ffffed0 00000001
3ffffe50: 4c2d5054 5f4b4e49 00546f49 00000000
3ffffe60: 00000000 0000001f 40100854 3ffeba90
3ffffe70: 40252ced 3fffc228 4010639d 4000050c
3ffffe80: ffffffd4 3ffef68c 3ffeda20 3ffefcd0
3ffffe90: 3ffeec70 00000034 00000000 4025768f
3ffffea0: 00000000 3fff2b44 ffffffd4 00000000
3ffffeb0: 00000000 3ffefcd0 3fff20fc 0000000f
3ffffec0: 0000006f 00000001 7a76e73f 00000000
3ffffed0: 00000000 10310606 00640100 0000004a
3ffffee0: 3ffeda34 00000163 3ffeda4e 3ffeda28
3ffffef0: 3ffeda54 3ffeda34 3ffeda41 3ffeda63
3fffff00: 00000000 3ffedb7b 3ffeda69 3ffedb40
3fffff10: 3ffeda8a 3ffedaa6 00000000 00000000
3fffff20: 00000000 00000000 00000034 00000000
3fffff30: 3fff2dec 402570fe 3ffeec70 3fff2b44
3fffff40: 00000000 3ffefcd0 3ffeec70 3ffeda10
3fffff50: 3ffeda10 0000018b 00000000 00000034
3fffff60: 00000000 3ffeda1a 4026ae9f 3ffeec70
3fffff70: 3ffeda04 3fffdcc0 3ffeb2e8 3ffeb2e8
3fffff80: 00000080 3ffeec70 00000000 3fffdab0
3fffff90: 4026a763 3fffdab0 00000000 402344da
3fffffa0: 3ffeb2e8 40000f49 ffffff01 40000f49
<<<stack<<<
EEPROM space available: 0x01fb bytes
Flash CRC OK
Firmware Version: 1.6.1
Flash CRC: 0xFD684C66
Flash Length: 605440
Free heap: 20096
Max malloc size: 16792
Fragmentation pct: 16
sion established
>>> [625325746] HomeKit: Free heap: 19936
>>> [625325792] HomeKit: [Client 1073687236] Get Accessories
>>> [625325847] RATGDO: get active: 1
>>> [625325849] RATGDO: get current door state: 1
>>> [625325850] RATGDO: get target door state: 1
>>> [625325893] RATGDO: get obstruction: 0
>>> [625325894] RATGDO: get current lock state: 0
>>> [625325895] RATGDO: get target lock state: 0
>>> [625325906] RATGDO: get light state: Off
>>> [625325945] HomeKit: [Client 1073687236] Get Characteristics
>>> [625326018] HomeKit: [Client 1073687236] Get Characteristics
>>> [625326114] HomeKit: [Client 1073687236] Update Characteristics
>>> [625326150] HomeKit: [Client 1073687236] Update Characteristics
>>> [625326223] HomeKit: [Client 1073687236] Update Characteristics
>>> [625326261] HomeKit: [Client 1073687236] Update Characteristics
>>> [625326319] HomeKit: [Client 1073687236] Update Characteristics
>>> [625326357] HomeKit: [Client 1073687236] Update Characteristics
>>> [625326423] HomeKit: [Client 1073687236] Update Characteristics
>>> [625326459] HomeKit: [Client 1073687236] Update Characteristics
>>> [625326511] HomeKit: [Client 1073687236] Update Characteristics
>>> [625326619] HomeKit: [Client 1073687236] Update Characteristics
>>> [625326971] HomeKit: [Client 1073687236] Update Characteristics
>>> [625327046] HomeKit: [Client 1073687236] Get Characteristics
>>> [625327150] HomeKit: [Client 1073687236] Get Characteristics
>>> [625327223] HomeKit: [Client 1073687236] Get Characteristics
>>> [625327284] HomeKit: [Client 1073687236] Get Characteristics
>>> [625327351] HomeKit: [Client 1073687236] Get Characteristics
>>> [625327428] HomeKit: [Client 1073687236] Get Characteristics
>>> [625327762] HomeKit: [Client 1073687236] Get Characteristics
>>> [625327856] HomeKit: [Client 1073687236] Get Characteristics
>>> [625327951] HomeKit: [Client 1073687236] Get Characteristics
>>> [625328001] HomeKit: [Client 1073687236] Get Characteristics
>>> [625328072] HomeKit: [Client 1073687236] Get Characteristics
And to clarify the crash, as far as I can tell I had no real practical negative effect in my daily usage, just checked the web UI and noticed it.
I'm currently of the opinion, and I think @dkerr64 agrees, that as long as the device recovers gracefully, we probably won't fix these issues given the rarity of the events. However, we do still appreciate the logs, and they may become helpful if we do decide we need to fix it.
I'm currently of the opinion, and I think @dkerr64 agrees, that as long as the device recovers gracefully, we probably won't fix these issues given the rarity of the events. However, we do still appreciate the logs, and they may become helpful if we do decide we need to fix it.
Of course, that makes perfect sense. I’ll keep providing them (if there’s more) simply as data, and FWIW my intention isn’t for a ‘fix’ but to share and contribute however I can.
Looks to be in the WiFi stack, decodes below. But yes, we always seem to recover so occasional crashes looks like something we just have to live with.
Exception Cause: 3 [LoadStoreError: Processor internal physical address or data error during load or store]
0x4010110d: umm_malloc_core at umm_malloc.cpp:?
0x400195b9: ?? ??:0
0x4010139c: malloc at ??:?
0x4010160a: calloc at ??:?
0x40255806: ieee80211_regdomain_update at ??:?
0x40259e10: ieee80211_setup_ratetable at ??:?
0x40259db7: ieee80211_phy_init at ??:?
0x40258f0c: cnx_update_bss_more at ??:?
0x40258f35: cnx_update_bss_more at ??:?
0x40256994: scan_parse_beacon at ??:?
0x40100854: ets_post at ??:?
0x40252ced: ieee80211_parse_beacon at ??:?
0x4010639d: ets_timer_disarm at ??:?
0x4000050c: ?? ??:0
0x4025768f: sta_input at ??:?
0x402570fe: sta_input at ??:?
0x4026ae9f: pp_tx_idle_timeout at ??:?
0x4026a763: ppPeocessRxPktHdr at ??:?
0x402344da: loop_task(ETSEventTag*) at core_esp8266_main.cpp:?
0x40000f49: ?? ??:0
0x40000f49: ?? ??:0
removed calls to getHeapStats in v1.7
Hello, 2 crashes after update from 1.50 -> 1.61. Discovered after no response from siri/home app. Appears working as expected now. Hopefully these are helpful: