ratgdo / homekit-ratgdo

A native HomeKit implementation of a Security+ 2.0 garage door controller based on ratgdo hardware
https://ratgdo.github.io/homekit-ratgdo/
GNU General Public License v3.0
213 stars 20 forks source link

Multiple WiFi creation errors lead to MDNSResponder crash and memory leak #223

Closed nyjklein closed 1 week ago

nyjklein commented 2 months ago

crashlog.txt crashlog3.txt crashlog4.txt I get seemingly random infrequent (every day or two) that look like the attached. From what I can tell, something happens to the WiFi connection and in the process of trying to recover RATGDO runs out of heap space and crashes. I'm at a loss as to what to try. My network is a non-mesh dedicated 2.4GHz network with its own SSID and other than these crashes seem stable in every other way.

dkerr64 commented 2 months ago

Thank you for reporting this. Decoding the stack traces...

Exception Cause: 0  [Illegal instruction]

0x402342d8: operator new(unsigned int) at ??:?
0x40239f4a: esp8266::MDNSImplementation::MDNSResponder::_readRRAnswer(esp8266::MDNSImplementation::MDNSResponder::stcMDNS_RRAnswer*&) at ??:?
0x402373d9: esp8266::MDNSImplementation::MDNSResponder::_parseQuery(esp8266::MDNSImplementation::MDNSResponder::stcMDNS_MsgHeader const&) at ??:?
0x40101012: umm_free_core at umm_malloc.cpp:?
0x40100854: ets_post at ??:?
0x40259e7b: ieee80211_setup_ratetable at ??:?
0x4010286c: pp_post at ??:?
0x40105b03: lmacRxDone at ??:?
0x40258f0c: cnx_update_bss_more at ??:?
0x401033f7: rcReachRetryLimit at ??:?
0x401035d4: rcReachRetryLimit at ??:?
0x40103a8e: wDev_ProcessFiq at ??:?
0x40239a65: esp8266::MDNSImplementation::MDNSResponder::_udpReadBuffer(unsigned char*, unsigned int) at ??:?
0x40239a65: esp8266::MDNSImplementation::MDNSResponder::_udpReadBuffer(unsigned char*, unsigned int) at ??:?
0x40239d78: esp8266::MDNSImplementation::MDNSResponder::_udpRead16(unsigned short&) at ??:?
0x4023a0e1: esp8266::MDNSImplementation::MDNSResponder::_readMDNSMsgHeader(esp8266::MDNSImplementation::MDNSResponder::stcMDNS_MsgHeader&) at ??:?
0x40237745: esp8266::MDNSImplementation::MDNSResponder::_parseMessage() at ??:?
0x402483c6: pbuf_free_LWIP2 at /Users/jstroud/git/Arduino/tools/sdk/lwip2/builder/lwip2-src/src/core/pbuf.c:786
 (inlined by) pbuf_free_LWIP2 at /Users/jstroud/git/Arduino/tools/sdk/lwip2/builder/lwip2-src/src/core/pbuf.c:725
0x40237d5a: esp8266::MDNSImplementation::MDNSResponder::_process(bool) at ??:?
Exception Cause: 0  [Illegal instruction]

0x40233254: String::changeBuffer(unsigned int) at ??:?
0x40238f8a: esp8266::MDNSImplementation::MDNSResponder::stcMDNSService::releaseService() at ??:?
0x40236419: esp8266::MDNSImplementation::MDNSResponder::addService(char const*, char const*, char const*, unsigned short) at ??:?
0x40103888: wDev_ProcessFiq at ??:?
0x40102b20: RC_GetBlockAckTime at ??:?
0x40103d42: wDev_AppendRxBlocks at ??:?
0x40101192: umm_malloc_core at umm_malloc.cpp:?
0x40101874: pm_rtc_clock_cali at ??:?
0x40100878: esp_schedule at ??:?
0x40251338: hostap_input at ??:?
0x40102b20: RC_GetBlockAckTime at ??:?
0x40105db7: wdt_feed at ??:?
0x401036ab: rcReachRetryLimit at ??:?
0x40103888: wDev_ProcessFiq at ??:?
0x4010664d: os_printf_plus at ??:?
0x40103d42: wDev_AppendRxBlocks at ??:?
0x40238aa5: esp8266::MDNSImplementation::MDNSResponder::stcMDNS_RRDomain::clear() at ??:?
0x40238aa5: esp8266::MDNSImplementation::MDNSResponder::stcMDNS_RRDomain::clear() at ??:?
0x40238db8: esp8266::MDNSImplementation::MDNSResponder::stcMDNS_RRAnswerTXT::stcMDNS_RRAnswerTXT(esp8266::MDNSImplementation::MDNSResponder::stcMDNS_RRHeader const&, unsigned int) at ??:?
0x40239121: esp8266::MDNSImplementation::MDNSResponder::stcMDNSServiceQuery::stcAnswer::releaseTxts() at ??:?
0x40236785: _ZNK7esp826618MDNSImplementation13MDNSResponder20_replyMaskForServiceERKNS1_16stcMDNS_RRHeaderERKNS1_14stcMDNSServiceEPb$part$0 at LEAmDNS_Control.cpp:?
0x40247456: esp2glue_netif_set_up1down0 at /Users/jstroud/git/Arduino/tools/sdk/lwip2/builder/glue-lwip/lwip-git.c:492
0x40236d9a: esp8266::MDNSImplementation::MDNSResponder::_processSRVAnswer(esp8266::MDNSImplementation::MDNSResponder::stcMDNS_RRAnswerSRV const*, bool&) at ??:?
Exception Cause: 0  [Illegal instruction]

0x402342d8: operator new(unsigned int) at ??:?
0x40239f4a: esp8266::MDNSImplementation::MDNSResponder::_readRRAnswer(esp8266::MDNSImplementation::MDNSResponder::stcMDNS_RRAnswer*&) at ??:?
0x402373d9: esp8266::MDNSImplementation::MDNSResponder::_parseQuery(esp8266::MDNSImplementation::MDNSResponder::stcMDNS_MsgHeader const&) at ??:?
0x4010286c: pp_post at ??:?
0x40105b03: lmacRxDone at ??:?
0x40100854: ets_post at ??:?
0x401033f7: rcReachRetryLimit at ??:?
0x4010286c: pp_post at ??:?
0x401035d4: rcReachRetryLimit at ??:?
0x40101012: umm_free_core at umm_malloc.cpp:?
0x40100854: ets_post at ??:?
0x40259e7b: ieee80211_setup_ratetable at ??:?
0x4010286c: pp_post at ??:?
0x40105b03: lmacRxDone at ??:?
0x40258f0c: cnx_update_bss_more at ??:?
0x401033f7: rcReachRetryLimit at ??:?
0x401035d4: rcReachRetryLimit at ??:?
0x40103a8e: wDev_ProcessFiq at ??:?
0x40239a65: esp8266::MDNSImplementation::MDNSResponder::_udpReadBuffer(unsigned char*, unsigned int) at ??:?
0x40239a65: esp8266::MDNSImplementation::MDNSResponder::_udpReadBuffer(unsigned char*, unsigned int) at ??:?
0x40239d78: esp8266::MDNSImplementation::MDNSResponder::_udpRead16(unsigned short&) at ??:?
0x4023a0e1: esp8266::MDNSImplementation::MDNSResponder::_readMDNSMsgHeader(esp8266::MDNSImplementation::MDNSResponder::stcMDNS_MsgHeader&) at ??:?
0x40237745: esp8266::MDNSImplementation::MDNSResponder::_parseMessage() at ??:?
0x402483c6: pbuf_free_LWIP2 at /Users/jstroud/git/Arduino/tools/sdk/lwip2/builder/lwip2-src/src/core/pbuf.c:786
 (inlined by) pbuf_free_LWIP2 at /Users/jstroud/git/Arduino/tools/sdk/lwip2/builder/lwip2-src/src/core/pbuf.c:725
0x40237d5a: esp8266::MDNSImplementation::MDNSResponder::_process(bool) at ??:?
dkerr64 commented 2 months ago

All the crashes are inside the MDNSResponder code. This is a fairly common crash which we have not got to the bottom of. Usually it is infrequent caused by a lot of mDNS activity on the network. #211 and #215 look to be similar.

In your case it looks specific to problems in connecting to WiFi (sample from your last log below) so we need to look at how to make the code more tolerant of failed WiFi connection. And there looks to be a memory leak in the WiFi code when it fails to connect.

Flash CRC OK
Firmware Version: 1.6.1
Flash CRC: 0xFD684C66
Flash Length: 605440
Free heap: 2696
Max malloc size: 536
Fragmentation pct: 72

on error, IP address is not set
>>> [124727615] HomeKit: wifiClient creation error, IP address is not set
>>> [124799561] HomeKit: wifiClient creation error, IP address is not set
>>> [124871656] HomeKit: wifiClient creation error, IP address is not set
>>> [124904879] RATGDO: Minimum free heap dropped to 3552
>>> [124943702] HomeKit: wifiClient creation error, IP address is not set
>>> [125015622] HomeKit: wifiClient creation error, IP address is not set
>>> [125081968] HomeKit: wifiClient creation error, IP address is not set
>>> [125154064] HomeKit: wifiClient creation error, IP address is not set
>>> [125165137] RATGDO: Minimum free heap dropped to 3016
>>> [125165137] RATGDO: Maximum malloc block size dropped to 536 (72% fragmented)
>>> [125220434] HomeKit: wifiClient creation error, IP address is not set
>>> [125281380] HomeKit: wifiClient creation error, IP address is not set
>>> [125342301] RATGDO: reader completed packet
>>> [125342302] RATGDO: DECODED  00000000 0000000000000000 00000000
>>> [125342302] RATGDO: PACKET(0x0 @ 0x0) UNKNOWN - Unknown: [000]
>>> [125342307] RATGDO: Support for UNKNOWN packet unimplemented. Ignoring.
>>> [125342314] HomeKit: wifiClient creation error, IP address is not set
>>> [125408722] HomeKit: wifiClient creation error, IP address is not set
>>> [125475168] HomeKit: wifiClient creation error, IP address is not set
>>> [125541563] HomeKit: wifiClient creation error, IP address is not set
>>> [125608009] HomeKit: wifiClient creation error, IP address is not set
>>> [125674505] HomeKit: wifiClient creation error, IP address is not set
>>> [125746575] HomeKit: wifiClient creation error, IP address is not set
>>> [125763174] RATGDO: Minimum free heap dropped to 2704
>>> [125763174] RATGDO: Maximum malloc block size dropped to 280 (77% fragmented)
>>> [125824045] HomeKit: wifiClient creation error, IP address is not set
>>> [125901515] HomeKit: wifiClient creation error, IP address is not set
>>> [125973411] HomeKit: wifiClient creation error, IP address is not set
nyjklein commented 2 months ago

Not sure if this is much help, but here's my router log from the time of the last crash.

2024-08-10 02:32:55 Daemon.Info 192.168.23.1    Aug 10 02:32:55 GT-AX1000-9CAAFA6-C hostapd: eth10: STA c8:c9:a3:18:fd:6b IEEE 802.11: disassociated
2024-08-10 02:32:55 User.Notice 192.168.23.1    Aug 10 02:32:55 GT-AX1000-9CAAFA6-C wlceventd: wlceventd_proc_event(645): eth10: Deauth_ind C8:C9:A3:18:FD:6B, status: 0, reason: Deauthenticated because sending station is leaving (or has left) IBSS or ESS (3), rssi:-68
2024-08-10 02:32:55 User.Notice 192.168.23.1    Aug 10 02:32:55 GT-AX1000-9CAAFA6-C wlceventd: wlceventd_proc_event(662): eth10: Disassoc C8:C9:A3:18:FD:6B, status: 0, reason: Disassociated because sending station is leaving (or has left) BSS (8), rssi:-68
2024-08-10 02:32:55 User.Notice 192.168.23.1    Aug 10 02:32:55 GT-AX1000-9CAAFA6-C wlceventd: wlceventd_proc_event(685): eth10: Auth C8:C9:A3:18:FD:6B, status: Successful (0), rssi:-68
2024-08-10 02:32:55 User.Notice 192.168.23.1    Aug 10 02:32:55 GT-AX1000-9CAAFA6-C wlceventd: wlceventd_proc_event(722): eth10: Assoc C8:C9:A3:18:FD:6B, status: Successful (0), rssi:-68
2024-08-10 02:32:55 Daemon.Info 192.168.23.1    Aug 10 02:32:55 GT-AX1000-9CAAFA6-C hostapd: eth10: STA c8:c9:a3:18:fd:6b IEEE 802.11: disassociated
2024-08-10 02:32:55 Daemon.Info 192.168.23.1    Aug 10 02:32:55 GT-AX1000-9CAAFA6-C hostapd: eth10: STA c8:c9:a3:18:fd:6b IEEE 802.11: associated
2024-08-10 02:32:55 Daemon.Info 192.168.23.1    Aug 10 02:32:55 GT-AX1000-9CAAFA6-C hostapd: eth10: STA c8:c9:a3:18:fd:6b RADIUS: starting accounting session DF65575E61C70DDE
2024-08-10 02:32:55 Daemon.Info 192.168.23.1    Aug 10 02:32:55 GT-AX1000-9CAAFA6-C hostapd: eth10: STA c8:c9:a3:18:fd:6b WPA: pairwise key handshake completed (RSN)
2024-08-10 02:32:55 Daemon.Info 192.168.23.1    Aug 10 02:32:55 GT-AX1000-9CAAFA6-C dnsmasq-dhcp[30588]: DHCPDISCOVER(br0) c8:c9:a3:18:fd:6b 
2024-08-10 02:32:55 Daemon.Info 192.168.23.1    Aug 10 02:32:55 GT-AX1000-9CAAFA6-C dnsmasq-dhcp[30588]: DHCPOFFER(br0) 192.168.23.224 c8:c9:a3:18:fd:6b 
2024-08-10 02:32:55 Daemon.Info 192.168.23.1    Aug 10 02:32:55 GT-AX1000-9CAAFA6-C dnsmasq-dhcp[30588]: DHCPREQUEST(br0) 192.168.23.224 c8:c9:a3:18:fd:6b 
2024-08-10 02:32:55 Daemon.Info 192.168.23.1    Aug 10 02:32:55 GT-AX1000-9CAAFA6-C dnsmasq-dhcp[30588]: DHCPACK(br0) 192.168.23.224 c8:c9:a3:18:fd:6b ESP-18FD6B
2024-08-10 02:32:58 User.Notice 192.168.23.1    Aug 10 02:32:58 GT-AX1000-9CAAFA6-C wlceventd: wlceventd_proc_event(685): eth9: Auth 7E:FB:A1:0D:35:15, status: Successful (0), rssi:-81
2024-08-10 02:32:58 Daemon.Info 192.168.23.1    Aug 10 02:32:58 GT-AX1000-9CAAFA6-C hostapd: eth9: STA 7e:fb:a1:0d:35:15 IEEE 802.11: associated (aid 3)
2024-08-10 02:32:58 User.Notice 192.168.23.1    Aug 10 02:32:58 GT-AX1000-9CAAFA6-C wlceventd: wlceventd_proc_event(695): eth9: ReAssoc 7E:FB:A1:0D:35:15, status: Successful (0), rssi:-81
2024-08-10 02:32:58 Daemon.Info 192.168.23.1    Aug 10 02:32:58 GT-AX1000-9CAAFA6-C hostapd: eth9: STA 7e:fb:a1:0d:35:15 RADIUS: starting accounting session C808CACEC32081D9
2024-08-10 02:32:58 Daemon.Info 192.168.23.1    Aug 10 02:32:58 GT-AX1000-9CAAFA6-C hostapd: eth9: STA 7e:fb:a1:0d:35:15 WPA: pairwise key handshake completed (RSN)
2024-08-10 02:32:59 Daemon.Info 192.168.23.1    Aug 10 02:32:59 GT-AX1000-9CAAFA6-C dnsmasq-dhcp[30588]: DHCPREQUEST(br0) 192.168.23.224 c8:c9:a3:18:fd:6b 
2024-08-10 02:32:59 Daemon.Info 192.168.23.1    Aug 10 02:32:59 GT-AX1000-9CAAFA6-C dnsmasq-dhcp[30588]: DHCPACK(br0) 192.168.23.224 c8:c9:a3:18:fd:6b 
nyjklein commented 2 months ago

Just dropping this one here because it's slightly different than the rest with exception 29 and restart 2: crashlog6.txt

dkerr64 commented 2 months ago

@nyjklein thank you for the new log. The stack dump decodes to...

Exception Cause: 29  [StoreProhibited: A store referenced a page mapped with an attribute that does not permit stores]

0x4000df64: ?? ??:0
0x40259f43: ieee80211_setup_ratetable at ??:?
0x40259ed8: ieee80211_setup_ratetable at ??:?
0x40259e7f: ieee80211_phy_init at ??:?
0x40258fd4: cnx_update_bss_more at ??:?
0x40258ffd: cnx_update_bss_more at ??:?
0x40256a5c: scan_parse_beacon at ??:?
0x40252db5: ieee80211_parse_beacon at ??:?
0x40257757: sta_input at ??:?
0x40000f58: ?? ??:0
0x402571c6: sta_input at ??:?
0x4026abbf: pp_tx_idle_timeout at ??:?
0x4026a483: ppPeocessRxPktHdr at ??:?
0x4026e9f7: ets_snprintf at ??:?
0x40000f49: ?? ??:0
0x40000f49: ?? ??:0

But I think the root cause is the same. Multiple WiFi creation errors leading to out-of-memory. Is there anything going on with your network that may cause these errors?

>>> [89775503] HomeKit: wifiClient creation error, IP address is not set
>>> [89775504] RATGDO: Minimum free heap dropped to 6488
>>> [89947367] HomeKit: wifiClient creation error, IP address is not set
>>> [89947368] RATGDO: Minimum free heap dropped to 6040
>>> [90119006] HomeKit: wifiClient creation error, IP address is not set
>>> [90119007] RATGDO: Minimum free heap dropped to 5240
>>> [90290545] HomeKit: wifiClient creation error, IP address is not set
>>> [90290546] RATGDO: Minimum free heap dropped to 4920
>>> [90462059] HomeKit: wifiClient creation error, IP address is not set
>>> [90462060] RATGDO: Minimum free heap dropped to 3408
>>> [90633600] HomeKit: wifiClient creation error, IP address is not set
nyjklein commented 2 months ago

Nothing I'm aware of and this one device seems to be the only one that has issues. Everything else is extremally stable. That's why it's so frustrating with the crashes happening so randomly even in the middle of the night with clearly no other activity.

nyjklein commented 2 months ago

Let me ask another question here. Here's what my heap and stack look like after running for about 25 hours: freeHeap: 18664 | minHeap: 14904 | minStack 1184

In all of the crashes I've seen, the heap is always well on its way in its downward spiral (well below 6000). So, it appears to me whatever event precepted the crash began before any messages captured in the crashlog. Any way to preserve a longer history of these messages?

jgstroud commented 2 months ago

You can capture the kids over the network, but that might not be helpful in your case since your network connection died. We are limited by ram and flash wear, so it's not easy to save a longer history on the device. Your best option is to capture the logs using a usb cable which may be hard to do depending on your setup.

dkerr64 commented 2 months ago

@nyjklein I could do a build that increases the space that holds the log messages... This would be a special build for you that enables the IRAM heap... which gives us a much larger RAM area for log messages. I could do 10KB which is 5x normal.

Some folks have h/w which did not work well with IRAM heap enabled (hence why not enabled by default). But assuming you don't run into that this would give you a version to test that hopefully has large enough message buffer

nyjklein commented 2 months ago

@dkerr64 Thanks. Before we try that I disconnected the device, erased it and re-flashed it. I positioned it slightly differently (away from the LED bulb in the operner, just in case). Let's let it run a while and see if that makes any difference.

nyjklein commented 2 months ago

So I've now been running for over three days without interruption. I'm not sure if it was the erase and reload or relocating the device about a foot away from the DG opener and light (or a combo). But it seems MUCH better now.

Ratgdo 2024-08-31 at 12 15 10 PM

dkerr64 commented 2 months ago

Thanks for the report. Do let us know how it goes.

dkerr64 commented 2 weeks ago

Possible fix in PR #242 as we use IRAM heap to make more memory available.

dkerr64 commented 1 week ago

Believe to be fixed in release 1.8.0