Closed dkerr64 closed 3 weeks ago
I've definitely seen mine crash when my HomePod reboots. one time when I was determined to fix it, right after it happened, I went and started a packet capture and then rebooted the home pod over and over and failed to reproduce. It will only happen when you aren't looking for it. Something like a watched pot never boils.... "A watched ratgdo never crashes"
Heisenberg's Uncertainty Principle :-)
Adding stack decode...
Exception Cause: 3 [LoadStoreError: Processor internal physical address or data error during load or store]
0x4010110d: umm_malloc_core at umm_malloc.cpp:?
0x4000b2e1: ?? ??:0
0x40101394: malloc at ??:?
0x4024797d: netif_do_set_ipaddr at /Users/jstroud/git/Arduino/tools/sdk/lwip2/builder/lwip2-src/src/core/netif.c:475
0x40247f17: pbuf_copy_partial_pbuf at /Users/jstroud/git/Arduino/tools/sdk/lwip2/builder/lwip2-src/src/core/pbuf.c:1024
0x4024a973: tcp_write at /Users/jstroud/git/Arduino/tools/sdk/lwip2/builder/lwip2-src/src/core/tcp_out.c:718
0x40228541: ClientContext::_consume(unsigned int) at ??:?
0x40229715: client_send_encrypted_(_client_context_t*, unsigned char*, unsigned int) at ??:?
0x40229864: client_decrypt_(_client_context_t*, unsigned char*, unsigned int, unsigned char*, unsigned int*) at ??:?
0x40102864: pp_post at ??:?
0x4010014c: std::function<void (void const*)>::operator()(void const*) const at ??:?
0x40277e19: system_get_sdk_version at ??:?
0x402299e0: client_send_P(_client_context_t*, char const*) at ??:?
0x40229cb6: send_json_response(_client_context_t*, int, unsigned char*, unsigned int) at ??:?
0x40229c22: send_tlv_error_response(_client_context_t*, int, TLVError) at ??:?
0x4022bc64: homekit_server_close_client(homekit_server_t*, _client_context_t*) at ??:?
0x4022d5a4: arduino_homekit_setup at ??:?
0x40204c32: http_parser_execute at ??:?
Adding in crash reported by @donavanbecker ...
Crash information recovered from EEPROM
Crash # 1 at 540949935 ms
Restart reason: 2
Exception (3):
epc1=0x4010110d epc2=0x00000000 epc3=0x00000000 excvaddr=0x400180e9 depc=0x00000000
>>>stack>>>
ctx: cont
sp: 3fff1d20 end: 3fff2080
3fff1d20: 3fff77e4 0000016a 00000100 3fff1e30
3fff1d30: 3fff6d8c 00000000 00000020 40101394
3fff1d40: 3fff3ed4 000000ff 3fff718c 402296b4
3fff1d50: 401033ef 3ffeec80 00000005 40102864
3fff1d60: 00000000 00000000 00000000 401035cc
3fff1d70: 401033ef 3ffeebe0 3fff1da0 3fff1d90
3fff1d80: 0000014c 3fff77e4 3ffe8f10 40101012
3fff1d90: 40277d31 00000000 00000000 3fff1e30
3fff1da0: 3fff77e4 0000016b 3fff6d8c 402298f8
3fff1db0: 00000000 0000016b 3fff76fc 40229bce
3fff1dc0: 50545448 312e312f 30303220 0d4b4f20
3fff1dd0: 6e6f430a 746e6574 7079542d 61203a65
3fff1de0: 696c7070 69746163 702f6e6f 69726961
3fff1df0: 742b676e 0d38766c 6e6f430a 746e6574
3fff1e00: 6e654c2d 3a687467 0d642520 6e6f430a
3fff1e10: 7463656e 3a6e6f69 65656b20 6c612d70
3fff1e20: 0d657669 000a0d0a 3fff76fc 40229b3a
3fff1e30: 000000e4 ffffffff ffffffff ffffffff
3fff1e40: 3fff6d8c 3fff1e30 000000e4 00000068
3fff1e50: 3fff1e01 f43e8d6a 3fff6d8c 3fff7684
3fff1e60: 3fff5bf4 3fff1f64 3fff6d8c 4022bb7c
3fff1e70: 00000002 38334435 36324333 4632362d
3fff1e80: 34342d46 422d3338 2d453944 36423844
3fff1e90: 44363234 32364242 31f60300 bc7f12f2
3fff1ea0: 29a1594e 54967124 66fdbcbe f54ab707
3fff1eb0: 55406a4c afdc65c7 00000053 00000000
3fff1ec0: 00000000 00000000 00000000 00000000
3fff1ed0: 00000000 00000000 00000000 00000000
3fff1ee0: 00000000 00000000 00000000 00000000
3fff1ef0: 00000000 00000000 00000000 00000001
3fff1f00: f231f603 4ebc7f12 2429a159 be549671
3fff1f10: 0766fdbc 4cf54ab7 c755406a 53afdc65
3fff1f20: 0b1b88a1 adb94d0a 74d5cc83 f705c40f
3fff1f30: 8b90b790 be85524d 80541051 1a30788b
3fff1f40: 00000004 00000000 00000219 00000000
3fff1f50: 00000000 00000000 00000000 00000000
3fff1f60: 00000020 00000010 00000000 00000000
3fff1f70: 01f5d706 c8ee1a8c 00000000 00000000
3fff1f80: 3fff693a 00000006 00000020 00000000
3fff1f90: 3fff693f 3fff718c 3fff6d8c 4022d4bc
3fff1fa0: 3fff693f 0000008c 3fff71c4 40204c32
3fff1fb0: 3fff68b4 3fff0e74 00000000 00000000
3fff1fc0: 00000000 00000000 00000000 00000000
3fff1fd0: 3fff6940 00000001 3fff68b4 3fff7215
3fff1fe0: 00000021 00000030 3fff691f 00000000
3fff1ff0: 3fffdad0 0000009e 00000020 3fff0eac
3fff2000: 3fff68b4 3fff68b4 3fff6d8c 4022be35
3fff2010: 0000008c 00000000 3fff5410 4021d254
3fff2020: 0000008c 3fff0c58 3fff0c1c 3fff20dc
3fff2030: 3fffdad0 3fff2930 3fff20b0 3fff20dc
3fff2040: 3fffdad0 3fff441c 3fff718c 4022c199
3fff2050: 3fffdad0 00000000 3fff20b0 4021eadb
3fff2060: 00000000 00000000 00000001 40234168
3fff2070: feefeffe feefeffe 3fffdab0 401007ad
<<<stack<<<
EEPROM space available: 0x007b bytes
Flash CRC OK
Firmware Version: 1.6.0
TGDO: get target door state: 0
>>> [540590377] RATGDO: get light state: On
>>> [540628843] HomeKit: [Client 1073703340] Get Characteristics
>>> [540689374] HomeKit: [Client 1073703340] Get Characteristics
>>> [540749665] HomeKit: [Client 1073703340] Get Characteristics
>>> [540809787] HomeKit: [Client 1073703340] Get Characteristics
>>> [540839818] HomeKit: [Client 1073703340] Get Characteristics
>>> [540839842] RATGDO: get light state: On
>>> [540849669] HomeKit: [Client 1073703340] Get Characteristics
>>> [540849788] RATGDO: get current door state: 0
>>> [540850990] HomeKit: [Client 1073703340] Get Characteristics
>>> [540851000] RATGDO: get light state: On
>>> [540854537] HomeKit: [Client 1073703340] Get Characteristics
>>> [540854548] RATGDO: get current door state: 0
>>> [540870670] HomeKit: [Client 1073703340] Get Characteristics
>>> [540871933] RATGDO: reader completed packet
>>> [540871934] RATGDO: DECODED 0002388B 000000306E511006 42608181
>>> [540871935] RATGDO: PACKET(0x511006 @ 0x2388B) Status - Status: [DoorState Open, Parity 0x8, Obs 1, Lock 0, Light 1]
>>> [540871944] RATGDO: tgt 0 curr 0
>>> [540919461] HomeKit: [Client 1073703340] Get Characteristics
>>> [540919470] RATGDO: get light state: On
>>> [540929431] HomeKit: [Client 1073703340] Get Characteristics
>>> [540929442] RATGDO: get current door state: 0
>>> [540930586] HomeKit: [Client 1073703340] Get Characteristics
>>> [540930603] RATGDO: get light state: On
>>> [540930710] HomeKit: [Client 1073703340] Get Characteristics
>>> [540935424] HomeKit: [Client 1073703340] Get Characteristics
>>> [540935443] RATGDO: get current door state: 0
>>> [540936594] HomeKit: [Client 1073703340] Get Characteristics
>>> [540936611] RATGDO: get light state: On
>>> [540946437] HomeKit: [Client 1073703340] Get Characteristics
>>> [540946454] RATGDO: get current door state: 0
>>> [540947869] HomeKit: [Client 1073703340] Get Characteristics
>>> [540947925] RATGDO: get light state: On
>>> [540949657] HomeKit: [Client 1073703340] List Pairings
which decodes to...
Exception Cause: 3 [LoadStoreError: Processor internal physical address or data error during load or store]
0x4010110d: umm_malloc_core at umm_malloc.cpp:?
0x400180e9: ?? ??:0
0x40101394: malloc at ??:?
0x402296b4: client_send_encrypted_(_client_context_t*, unsigned char*, unsigned int) at ??:?
0x401033ef: rcReachRetryLimit at ??:?
0x40102864: pp_post at ??:?
0x401035cc: rcReachRetryLimit at ??:?
0x401033ef: rcReachRetryLimit at ??:?
0x40101012: umm_free_core at umm_malloc.cpp:?
0x40277d31: system_get_sdk_version at ??:?
0x402298f8: client_send(_client_context_t*, unsigned char*, unsigned int) at ??:?
0x40229bce: send_tlv_response(_client_context_t*, tlv_values_t*) at ??:?
0x40229b3a: send_tlv_response(_client_context_t*, tlv_values_t*) at ??:?
0x4022bb7c: homekit_server_on_pairings(_client_context_t*, unsigned char const*, unsigned int) at ??:?
0x4022d4bc: homekit_server_on_message_complete(http_parser*) at ??:?
0x40204c32: http_parser_execute at ??:?
0x4022be35: homekit_client_process(_client_context_t*) at ??:?
0x4021d254: comms_loop() at ??:?
0x4022c199: homekit_server_process(homekit_server_t*) at ??:?
0x4021eadb: loop at ??:?
0x40234168: loop_wrapper() at core_esp8266_main.cpp:?
0x401007ad: cont_wrapper at ??:?
we had stormy weather today and I got a bunch of notifications about my HomePod coming and going offline and on again. ultimately a bunch of devices were no response including the ratgdo. had to power cycle the HomePod and everything came back. During that time the HomePod was coming and going, but ratgdo did crash, and its a very similar crash to some of the ones reported in discord
Crashdump: https://gist.github.com/jgstroud/081f5e0ae711776cd4e8e1ffc565a375
Recording another similar crash...
Messages logged immediately before last system crash...
Server started at: Fri, 27 Sep 2024 05:59:28 GMT
Server crashed at: Sun, 13 Oct 2024 22:50:17 GMT
Server upTime: 16 days 16 hrs 50 mins 48 secs
Crash information recovered from EEPROM
Crash # 1 at 1443047816 ms
Restart reason: 2
Exception (3):
epc1=0x4010110d epc2=0x00000000 epc3=0x00000000 excvaddr=0x40014381 depc=0x00000000
>>>stack>>>
ctx: cont
sp: 3fff24e0 end: 3fff2840
3fff24e0: 3fff87e4 0000016a 00000100 3fff25f0
3fff24f0: 3fff7a64 00000000 00000020 40101394
3fff2500: 3fff472c 000000ff 3fff7e64 4022b280
3fff2510: 0000014f 00409916 45a1cac0 4ab101c0
3fff2520: 00000000 00000000 00000000 000006ac
3fff2530: 3fff472c 00000032 3fff2560 3fff2550
3fff2540: 0000014c 3fff87e4 3ffe8fdc 40101012
3fff2550: 4027cad6 00000080 00000000 3fff25f0
3fff2560: 3fff87e4 0000016b 3fff7a64 4022b4c4
3fff2570: 00000000 0000016b 3fff86fc 4022b79a
3fff2580: 50545448 312e312f 30303220 0d4b4f20
3fff2590: 6e6f430a 746e6574 7079542d 61203a65
3fff25a0: 696c7070 69746163 702f6e6f 69726961
3fff25b0: 742b676e 0d38766c 6e6f430a 746e6574
3fff25c0: 6e654c2d 3a687467 0d642520 6e6f430a
3fff25d0: 7463656e 3a6e6f69 65656b20 6c612d70
3fff25e0: 0d657669 000a0d0a 3fff86fc 4022b706
3fff25f0: 000000e4 ffffffff ffffffff ffffffff
3fff2600: 3fff7a64 3fff25f0 000000e4 00000068
3fff2610: 3fff2601 cba492ee 3fff7a64 3fff7664
3fff2620: 3fff328c 3fff2724 3fff7a64 4022d74c
3fff2630: 00000002 42463832 42453641 3945342d
3fff2640: 39342d41 392d3243 2d343335 36313934
3fff2650: 42333133 43384333 26569500 d9127420
3fff2660: 02d167b9 7ce40d86 6195152c b43aeb9c
3fff2670: 86120c3f 642987e3 00000049 00000000
3fff2680: 00000000 00000000 00000000 00000000
3fff2690: 00000000 00000000 00000000 00000000
3fff26a0: 00000000 00000000 00000000 00000000
3fff26b0: 00000000 00000000 00000000 00000001
3fff26c0: 20265695 b9d91274 8602d167 2c7ce40d
3fff26d0: 9c619515 3fb43aeb e386120c 49642987
3fff26e0: 72d1929a f67f2e48 acf20995 51875654
3fff26f0: be632c1f 85e2cba6 9800f356 b668562c
3fff2700: 00000004 00000000 00003dbb 00000000
3fff2710: 00000000 00000000 00000000 00000000
3fff2720: 00000020 00000010 00000000 00000000
3fff2730: 5763224e 7c03a07c 00000000 00000000
3fff2740: 3fff7428 00000006 00000020 00000000
3fff2750: 3fff742d 3fff7e64 3fff7a64 4022f08c
3fff2760: 3fff742d 0000008a 3fff7e9c 40204d36
3fff2770: 3fff73a4 3fff1634 00000000 00000000
3fff2780: 00000000 00000000 00000000 00000000
3fff2790: 3fff742e 00000001 3fff73a4 3fff7eed
3fff27a0: 00000021 00000030 3fff740d 00000000
3fff27b0: 3fffdad0 0000009c 00000020 3fff166c
3fff27c0: 3fff73a4 3fff73a4 3fff7a64 4022da05
3fff27d0: 0000008a 00000000 3fff5c60 4021d4a0
3fff27e0: 0000008a 00000000 3fff1414 3fff289c
3fff27f0: 3fffdad0 3fff30f4 3fff2870 3fff289c
3fff2800: 3fffdad0 3fff4d3c 3fff7e64 4022dd69
3fff2810: 3fffdad0 00000000 3fff2870 4021f053
3fff2820: 00000000 00000000 00000001 40236040
3fff2830: feefeffe feefeffe 3fffdab0 401007ad
<<<stack<<<
EEPROM space available: 0x007b bytes
Flash CRC OK
Server time (secs): 1728859817
Server uptime (ms): 1443048930
Firmware version: 1.7.1
Flash CRC: 0xC3762338
Flash length: 625088
Free heap: 16928
Minimum heap: 12144
9116] Get Characteristics
[2024-10-13T22:43:55.172Z] RATGDO: get current door state: 1
[2024-10-13T22:43:55.172Z] RATGDO: get target door state: 1
[2024-10-13T22:43:55.173Z] RATGDO: get obstruction: 0
[2024-10-13T22:43:55.174Z] RATGDO: get light state: Off
[2024-10-13T22:44:01.453Z] HomeKit: [Client 1073689116] Get Characteristics
[2024-10-13T22:44:20.370Z] HomeKit: [Client 1073689116] Get Characteristics
[2024-10-13T22:44:20.550Z] HomeKit: [Client 1073689116] Get Characteristics
[2024-10-13T22:45:07.144Z] HomeKit: [Client 1073689116] Get Characteristics
[2024-10-13T22:46:07.171Z] HomeKit: [Client 1073689116] Get Characteristics
[2024-10-13T22:46:42.258Z] RATGDO: reader completed packet
[2024-10-13T22:46:42.259Z] RATGDO: DECODED 0001B305 000000C098575006 40609281
[2024-10-13T22:46:42.260Z] RATGDO: PACKET(0x575006 @ 0x1B305) Status - Status: [DoorState Closed, Parity 0x9, Obs 1, Lock 0, Light 0]
[2024-10-13T22:46:42.269Z] RATGDO: tgt 1 curr 1
[2024-10-13T22:47:07.944Z] HomeKit: [Client 1073689116] Get Characteristics
[2024-10-13T22:47:53.627Z] RATGDO: reader completed packet
[2024-10-13T22:47:53.627Z] RATGDO: DECODED 0001B306 000000C098575006 0006209D
[2024-10-13T22:47:53.628Z] RATGDO: PACKET(0x575006 @ 0x1B306) UNKNOWN - Unknown: [09D]
[2024-10-13T22:47:53.633Z] RATGDO: Support for UNKNOWN packet unimplemented. Ignoring.
[2024-10-13T22:48:09.710Z] HomeKit: [Client 1073689116] Get Characteristics
[2024-10-13T22:49:11.124Z] HomeKit: [Client 1073689116] Get Characteristics
[2024-10-13T22:49:29.548Z] HomeKit: [Client 1073689116] Update Characteristics
[2024-10-13T22:49:29.565Z] HomeKit: [Client 1073689116] Get Characteristics
[2024-10-13T22:49:29.581Z] HomeKit: [Client 1073689116] Get Characteristics
[2024-10-13T22:49:36.633Z] RATGDO: reader completed packet
[2024-10-13T22:49:36.634Z] RATGDO: DECODED 0001B307 000000C098575006 0008C09D
[2024-10-13T22:49:36.635Z] RATGDO: PACKET(0x575006 @ 0x1B307) UNKNOWN - Unknown: [09D]
[2024-10-13T22:49:36.640Z] RATGDO: Support for UNKNOWN packet unimplemented. Ignoring.
[2024-10-13T22:50:13.750Z] HomeKit: [Client 1073689116] Get Characteristics
[2024-10-13T22:50:14.420Z] HomeKit: [Client 1073689116] Get Characteristics
[2024-10-13T22:50:14.447Z] HomeKit: [Client 1073689116] Get Characteristics
[2024-10-13T22:50:15.607Z] HomeKit: [Client 1073689116] List Pairings
Decodes to...
106-elf-addr2line ~/homekit-ratgdo-v1.7.1.elf ~/crash-1013.log
Exception Cause: 3 [LoadStoreError: Processor internal physical address or data error during load or store]
0x4010110d: umm_malloc_core at umm_malloc.cpp:?
0x40014381: ?? ??:0
0x40101394: malloc at ??:?
0x4022b280: client_send_encrypted_(_client_context_t*, unsigned char*, unsigned int) at ??:?
0x40101012: umm_free_core at umm_malloc.cpp:?
0x4027cad6: system_get_sdk_version at ??:?
0x4022b4c4: client_send(_client_context_t*, unsigned char*, unsigned int) at ??:?
0x4022b79a: send_tlv_response(_client_context_t*, tlv_values_t*) at ??:?
0x4022b706: send_tlv_response(_client_context_t*, tlv_values_t*) at ??:?
0x4022d74c: homekit_server_on_pairings(_client_context_t*, unsigned char const*, unsigned int) at ??:?
0x4022f08c: homekit_server_on_message_complete(http_parser*) at ??:?
0x40204d36: http_parser_execute at ??:?
0x4022da05: homekit_client_process(_client_context_t*) at ??:?
0x4021d4a0: comms_loop() at ??:?
0x4022dd69: homekit_server_process(homekit_server_t*) at ??:?
0x4021f053: loop at ??:?
0x40236040: loop_wrapper() at core_esp8266_main.cpp:?
0x401007ad: cont_wrapper at ??:?
Consistent theme here is that last log message is "List Pairings" and that the crash appears to trace back to client_send_encrypted_()
I have found in that code a malloc() for > 1KB of memory, which is where I suspect we crash... it could fail if the heap has become fragmented?
As we are not multi-threaded, I'm wondering if we should move that out and allocate a block of memory at initialization.
Possibly fixed in PR #242 by allocating the encrypted data buffer at initialization and starting the HomeKit server with IRAM heap.
Believe fixed in release 1.8.0
Both my ratgdo's crashed at the same time after 17 days running. One of them subsequently crashed a second time 3 mins 28 seconds after its first crash, overwriting the crash log from the first crash. But I believe the first (simultaneous) crashes had the same cause.
The easy one is the 2nd crash... it the MDNSResponder crash we're familiar with.
The first crash is harder. What was I doing at the time? Well I just returned from two weeks away and found that the UPS my AppleTV was connected to had entered some sort of error state. As part of recovering from that I unplugged the AppleTV, and ~5 minutes later plugged it back in. As far as I can tell this triggered some activity on both ratgdo's that caused a crash.
Captured log is...