ratgdo / homekit-ratgdo

A native HomeKit implementation of a Security+ 2.0 garage door controller based on ratgdo hardware
https://ratgdo.github.io/homekit-ratgdo/
GNU General Public License v3.0
215 stars 21 forks source link

Crashlog 1.6.1 #218

Closed mgrimace closed 3 months ago

mgrimace commented 3 months ago

Hello, 2 crashes after update from 1.50 -> 1.61. Discovered after no response from siri/home app. Appears working as expected now. Hopefully these are helpful:

Crash information recovered from EEPROM
Crash # 1 at 92523221 ms
Restart reason: 2
Exception (3):
epc1=0x4025f23c epc2=0x00000000 epc3=0x00000000 excvaddr=0x4000cbe9 depc=0x00000000
>>>stack>>>

ctx: cont

sp: 3fff1fe0 end: 3fff20a0
3fff1fe0: 38323235 202c3633 65726622 00654865 
3fff1ff0: 00000020 3fff2000 00000008 006d4264 
3fff2000: 00000000 40229092 3fff568c 3ffe8ec0 
3fff2010: 3ffe8eb8 3ffe8eb8 3fff0c0d 40230bb8 
3fff2020: 000000e5 3ffe983c 3fff2030 00000001 
3fff2030: 3fff0c0d 3ffe8eb8 3ffe8ec0 4021f13d 
3fff2040: 3fff0f70 3fff0f6c 00000001 40231ee4 
3fff2050: 3fff660c 00000000 00000020 3fff20fc 
3fff2060: 00000000 00000000 00000000 3fff20fc 
3fff2070: 3fffdad0 00000000 3fff20d0 4023473c 
3fff2080: 3fffdad0 00000000 3fff20d0 40234777 
3fff2090: feefeffe feefeffe 3fffdab0 401007ad 
<<<stack<<<
Crash # 2 at 92524562 ms
Restart reason: 2
Exception (4):
epc1=0x400005cb epc2=0x00000000 epc3=0x00000000 excvaddr=0x4000cbe9 depc=0x00000000
>>>stack>>>

ctx: cont

sp: 3fff1d60 end: 3fff20a0
3fff1d60: 00000100 ffffffff 3fffc718 3fff2e44 
3fff1d70: 00000023 00000000 3fff4008 3fff2e44 
3fff1d80: 00000002 3ffe9d4f 3fff4008 3ffe8ec0 
3fff1d90: 3ffe8eb8 3fff0c0d 3fff0c0d 40230bb8 
3fff1da0: 00000001 00000000 3fff40bc 3ffe8ec0 
3fff1db0: 3ffe8eb8 3fff0c0d 3fff0988 4021e621 
3fff1dc0: 3fff20a0 3fff6015 3fff0988 3ffe8f06 
3fff1dd0: 3fff20a0 3fff6015 3fff0988 4021e740 
3fff1de0: 3fff1e40 00000583 3fff29b0 4022dd85 
3fff1df0: 0583cad5 058300e5 00000000 00ff0000 
3fff1e00: 3fff1fe0 3fff20a0 00000000 00ff0000 
3fff1e10: 40234964 5ffffe00 0000000f 3fff1e40 
3fff1e20: 3fff20a0 3fff226c 3fff1fe0 40234be9 
3fff1e30: 00000000 24a00000 3fffc228 4010639d 
3fff1e40: 00000002 00000003 4025f23c 00000000 
3fff1e50: 00000000 4000cbe9 00000000 4000cbe9 
3fff1e60: 3fff10a0 00000000 00000000 fffffffd 
3fff1e70: 402821f0 4025f23c 00000000 00000000 
3fff1e80: 00000000 4000cbe9 00000000 4026de8b 
3fff1e90: 00000002 00000003 4025f23c 00000000 
3fff1ea0: 00000000 4000cbe9 00000000 40105e06 
3fff1eb0: 40105de8 3fffc100 00000000 00000003 
3fff1ec0: 00000003 4025f23c 3fff6a58 4024eaa8 
3fff1ed0: 400005e1 00000218 00000000 40247f95 
3fff1ee0: 4025f23c 0000003f 00000010 00000006 
3fff1ef0: 4025f1ce 00000036 4000cbe8 00003436 
3fff1f00: 00003436 00003435 00002936 3fff8238 
3fff1f10: 000000a0 3fffc6fc 000dc6b0 3ffe8f1c 
3fff1f20: 00000000 00000000 0001a1b0 00000003 
3fff1f30: 0583c956 00000001 40234548 00000001 
3fff1f40: 0583c956 0000034b 3fff5264 40228871 
3fff1f50: 0000034b 0000012c 2b439581 2915053a 
3fff1f60: 45270b05 ce692143 d6a571d9 00000002 
3fff1f70: 00004b38 00000967 3ffe8f1c 40101012 
3fff1f80: 000000e5 00000000 00000008 3fff0a04 
3fff1f90: 3fff2000 00000020 3fff7574 40101368 
3fff1fa0: 3fff2000 00000008 3fff7574 40231b55 
3fff1fb0: 6e657665 6d203a74 61737365 720a6567 
3fff1fc0: 79727465 3531203a 0a303030 61746164 
3fff1fd0: 4025f1ce 54707522 22656d69 3239203a 
3fff1fe0: 38323235 202c3633 65726622 00654865 
3fff1ff0: 00000020 3fff2000 00000008 006d4264 
3fff2000: 00000000 40229092 3fff568c 3ffe8ec0 
3fff2010: 3ffe8eb8 3ffe8eb8 3fff0c0d 40230bb8 
3fff2020: 000000e5 3ffe983c 3fff2030 00000001 
3fff2030: 3fff0c0d 3ffe8eb8 3ffe8ec0 4021f13d 
3fff2040: 3fff0f70 3fff0f6c 00000001 40231ee4 
3fff2050: 3fff660c 00000000 00000000 
Incomplete stack trace saved!
<<<stack<<<
No more EEPROM space available to save crash information!
dkerr64 commented 3 months ago

Thanks for reporting this. I have decoded the stack trace and will log it here for reference as we decide how to fix...

Exception Cause: 3  [LoadStoreError: Processor internal physical address or data error during load or store]

0x4025f23c: umm_info at ??:?
0x4000cbe9: ?? ??:0
0x40229092: WiFiServer::accept() at ??:?
0x40230bb8: EspClass::getHeapStats(unsigned int*, unsigned int*, unsigned char*) at ??:?
0x4021f13d: memory_ticker() at ??:?
0x40231ee4: run_scheduled_functions() at ??:?
0x4023473c: loop_end at ??:?
0x40234777: loop_wrapper() at core_esp8266_main.cpp:?
0x401007ad: cont_wrapper at ??:?

and...

Exception Cause: 4  [Level1Interrupt: Level-1 interrupt as indicated by set level-1 bits in the INTERRUPT register]

0x400005cb: ?? ??:0
0x4000cbe9: ?? ??:0
0x40230bb8: EspClass::getHeapStats(unsigned int*, unsigned int*, unsigned char*) at ??:?
0x4021e621: printMessageLog(Print&) at ??:?
0x4021e740: crashCallback() at ??:?
0x4022dd85: custom_crash_callback at ??:?
0x40234964: cut_here at core_esp8266_postmortem.cpp:?
0x40234be9: postmortem_report at core_esp8266_postmortem.cpp:?
0x4010639d: ets_timer_disarm at ??:?
0x4025f23c: umm_info at ??:?
0x4000cbe9: ?? ??:0
0x4000cbe9: ?? ??:0
0x402821f0: chip_v6_unset_chanfreq at ??:?
0x4025f23c: umm_info at ??:?
0x4000cbe9: ?? ??:0
0x4026de8b: user_uart_wait_tx_fifo_empty at ??:?
0x4025f23c: umm_info at ??:?
0x4000cbe9: ?? ??:0
0x40105e06: wdt_feed at ??:?
0x40105de8: wdt_feed at ??:?
0x4025f23c: umm_info at ??:?
0x4024eaa8: ip4_output_if_opt_src at /Users/jstroud/git/Arduino/tools/sdk/lwip2/builder/lwip2-src/src/core/ipv4/ip4.c:1765
0x400005e1: ?? ??:0
0x40247f95: mem_malloc at /Users/jstroud/git/Arduino/tools/sdk/lwip2/builder/lwip2-src/src/core/mem.c:210
0x4025f23c: umm_info at ??:?
0x4025f1ce: umm_info at ??:?
0x4000cbe8: ?? ??:0
0x40234548: esp_suspend at ??:?
0x40228871: ClientContext::wait_until_acked(int) at ??:?
0x40101012: umm_free_core at umm_malloc.cpp:?
0x40101368: free at ??:?
0x40231b55: Print::printf(char const*, ...) at ??:?
0x4025f1ce: umm_info at ??:?
0x40229092: WiFiServer::accept() at ??:?
0x40230bb8: EspClass::getHeapStats(unsigned int*, unsigned int*, unsigned char*) at ??:?
0x4021f13d: memory_ticker() at ??:?
0x40231ee4: run_scheduled_functions() at ??:?
dkerr64 commented 3 months ago

Just to document my thoughts here, @jgstroud. This is somewhat disconcerting. The crash is inside the getHeapStats() function which is called in two different places... both of which show up in these two stack dumps.

The second, longer, stack dump shows it inside a call to getHeapStats() that is called during the handling of a crash. Ugh. The quick fix here is simply to remove that call and report the most recently gathered stats that take place in the once-a-second timer... so the data included in a crash log is potentially up to one second old.

But the first, shorter, stack dump shows the crash occurring from within getHeapStats() called from that very once-a-second timer. Ugh again. In fact, the second crash was caused by the first crash!!! (crash in heap stats called from the memory_ticker(), causes our crashCallback() function to be called which calls getHeapStats() again which of course crashes again.

So, we can avoid the second crash by removing the call from crash callback, but what caused the first one? Humm.

@mgrimace I would like you to monitor please and see if this occurs again.

David.

mgrimace commented 3 months ago

Of course, I’ll update if it does happen again - as it is no further crashes on my end.

dkerr64 commented 3 months ago

Both my RATGDO's crashed at the same time with the same error. The cause was my powering off my Apple TV as I rearranged my electrical connections (we have seen this before and may be something we just have to live with) But that the getHeapStats() call crashed inside crash logging is bad. So I have removed that call (commented out) in the crash handling.

But, @jgstroud , I think we should revert back to just tracking available free memory and not tracking heap fragmentation. We've not seen any evidence that fragmentation is an issue, and getHeapStats() is failing for unknown reasons.

David.

jgstroud commented 3 months ago

I'm good with removing the heap stats.

I've tried to reproduce the Apple hub reboot crash while capturing all network traffic and console logs. Unfortunately, I was never able to forcefully reproduce it. I may need to put back my continuous packet capture so I can catch the failure.

dkerr64 commented 3 months ago

Reverting heap stats is part of PR #220

mgrimace commented 3 months ago

Ok, had an additional crash, near as I can tell it was 8/6/2024, 9:30:15 PM based on ‘last reboot’ and uptime counter (I never manually rebooted it):

Crash information recovered from EEPROM
Crash # 1 at 625338699 ms
Restart reason: 2
Exception (3):
epc1=0x4010110d epc2=0x00000000 epc3=0x00000000 excvaddr=0x400195b9 depc=0x00000000
>>>stack>>>

ctx: sys

sp: 3ffffdd0 end: 3fffffb0
3ffffdd0: 00004d18 000009a3 3ffe8f1c 3ffef640 
3ffffde0: 0000001a 3ffefcd0 00000020 4010139c 
3ffffdf0: 000000d4 3ffefcd0 000000d4 4010160a 
3ffffe00: 40255806 3ffefcd0 00000002 40259e10 
3ffffe10: 00000002 40259db7 00000002 40258f0c 
3ffffe20: 40258f35 3ffffed0 3ffefcd0 0000001a 
3ffffe30: 40256994 3ffffed0 3ffefb78 3ffef540 
3ffffe40: 3ffeda10 3ffffed0 3ffffed0 00000001 
3ffffe50: 4c2d5054 5f4b4e49 00546f49 00000000 
3ffffe60: 00000000 0000001f 40100854 3ffeba90 
3ffffe70: 40252ced 3fffc228 4010639d 4000050c 
3ffffe80: ffffffd4 3ffef68c 3ffeda20 3ffefcd0 
3ffffe90: 3ffeec70 00000034 00000000 4025768f 
3ffffea0: 00000000 3fff2b44 ffffffd4 00000000 
3ffffeb0: 00000000 3ffefcd0 3fff20fc 0000000f 
3ffffec0: 0000006f 00000001 7a76e73f 00000000 
3ffffed0: 00000000 10310606 00640100 0000004a 
3ffffee0: 3ffeda34 00000163 3ffeda4e 3ffeda28 
3ffffef0: 3ffeda54 3ffeda34 3ffeda41 3ffeda63 
3fffff00: 00000000 3ffedb7b 3ffeda69 3ffedb40 
3fffff10: 3ffeda8a 3ffedaa6 00000000 00000000 
3fffff20: 00000000 00000000 00000034 00000000 
3fffff30: 3fff2dec 402570fe 3ffeec70 3fff2b44 
3fffff40: 00000000 3ffefcd0 3ffeec70 3ffeda10 
3fffff50: 3ffeda10 0000018b 00000000 00000034 
3fffff60: 00000000 3ffeda1a 4026ae9f 3ffeec70 
3fffff70: 3ffeda04 3fffdcc0 3ffeb2e8 3ffeb2e8 
3fffff80: 00000080 3ffeec70 00000000 3fffdab0 
3fffff90: 4026a763 3fffdab0 00000000 402344da 
3fffffa0: 3ffeb2e8 40000f49 ffffff01 40000f49 
<<<stack<<<

EEPROM space available: 0x01fb bytes

Flash CRC OK
Firmware Version: 1.6.1
Flash CRC: 0xFD684C66
Flash Length: 605440
Free heap: 20096
Max malloc size: 16792
Fragmentation pct: 16

sion established
>>> [625325746] HomeKit: Free heap: 19936
>>> [625325792] HomeKit: [Client 1073687236] Get Accessories
>>> [625325847] RATGDO: get active: 1
>>> [625325849] RATGDO: get current door state: 1
>>> [625325850] RATGDO: get target door state: 1
>>> [625325893] RATGDO: get obstruction: 0
>>> [625325894] RATGDO: get current lock state: 0
>>> [625325895] RATGDO: get target lock state: 0
>>> [625325906] RATGDO: get light state: Off
>>> [625325945] HomeKit: [Client 1073687236] Get Characteristics
>>> [625326018] HomeKit: [Client 1073687236] Get Characteristics
>>> [625326114] HomeKit: [Client 1073687236] Update Characteristics
>>> [625326150] HomeKit: [Client 1073687236] Update Characteristics
>>> [625326223] HomeKit: [Client 1073687236] Update Characteristics
>>> [625326261] HomeKit: [Client 1073687236] Update Characteristics
>>> [625326319] HomeKit: [Client 1073687236] Update Characteristics
>>> [625326357] HomeKit: [Client 1073687236] Update Characteristics
>>> [625326423] HomeKit: [Client 1073687236] Update Characteristics
>>> [625326459] HomeKit: [Client 1073687236] Update Characteristics
>>> [625326511] HomeKit: [Client 1073687236] Update Characteristics
>>> [625326619] HomeKit: [Client 1073687236] Update Characteristics
>>> [625326971] HomeKit: [Client 1073687236] Update Characteristics
>>> [625327046] HomeKit: [Client 1073687236] Get Characteristics
>>> [625327150] HomeKit: [Client 1073687236] Get Characteristics
>>> [625327223] HomeKit: [Client 1073687236] Get Characteristics
>>> [625327284] HomeKit: [Client 1073687236] Get Characteristics
>>> [625327351] HomeKit: [Client 1073687236] Get Characteristics
>>> [625327428] HomeKit: [Client 1073687236] Get Characteristics
>>> [625327762] HomeKit: [Client 1073687236] Get Characteristics
>>> [625327856] HomeKit: [Client 1073687236] Get Characteristics
>>> [625327951] HomeKit: [Client 1073687236] Get Characteristics
>>> [625328001] HomeKit: [Client 1073687236] Get Characteristics
>>> [625328072] HomeKit: [Client 1073687236] Get Characteristics
mgrimace commented 3 months ago

And to clarify the crash, as far as I can tell I had no real practical negative effect in my daily usage, just checked the web UI and noticed it.

jgstroud commented 3 months ago

I'm currently of the opinion, and I think @dkerr64 agrees, that as long as the device recovers gracefully, we probably won't fix these issues given the rarity of the events. However, we do still appreciate the logs, and they may become helpful if we do decide we need to fix it.

mgrimace commented 3 months ago

I'm currently of the opinion, and I think @dkerr64 agrees, that as long as the device recovers gracefully, we probably won't fix these issues given the rarity of the events. However, we do still appreciate the logs, and they may become helpful if we do decide we need to fix it.

Of course, that makes perfect sense. I’ll keep providing them (if there’s more) simply as data, and FWIW my intention isn’t for a ‘fix’ but to share and contribute however I can.

dkerr64 commented 3 months ago

Looks to be in the WiFi stack, decodes below. But yes, we always seem to recover so occasional crashes looks like something we just have to live with.

Exception Cause: 3  [LoadStoreError: Processor internal physical address or data error during load or store]

0x4010110d: umm_malloc_core at umm_malloc.cpp:?
0x400195b9: ?? ??:0
0x4010139c: malloc at ??:?
0x4010160a: calloc at ??:?
0x40255806: ieee80211_regdomain_update at ??:?
0x40259e10: ieee80211_setup_ratetable at ??:?
0x40259db7: ieee80211_phy_init at ??:?
0x40258f0c: cnx_update_bss_more at ??:?
0x40258f35: cnx_update_bss_more at ??:?
0x40256994: scan_parse_beacon at ??:?
0x40100854: ets_post at ??:?
0x40252ced: ieee80211_parse_beacon at ??:?
0x4010639d: ets_timer_disarm at ??:?
0x4000050c: ?? ??:0
0x4025768f: sta_input at ??:?
0x402570fe: sta_input at ??:?
0x4026ae9f: pp_tx_idle_timeout at ??:?
0x4026a763: ppPeocessRxPktHdr at ??:?
0x402344da: loop_task(ETSEventTag*) at core_esp8266_main.cpp:?
0x40000f49: ?? ??:0
0x40000f49: ?? ??:0
dkerr64 commented 3 months ago

removed calls to getHeapStats in v1.7