Closed dkerr64 closed 8 months ago
Thanks for this. I think it's definitely needed. I had started doing the same thing a while back and just never merged any of it. Have you looked at all at updating to the latest wolfssl library?
No, I have not thought of doing any more updates, but if you know of some worthwhile then I think we should consider it.
What I did is fork from the original, then merge in the changes from the previous fork we were using (@mrthiti), then merge in from @JKoss2 (had to resolve some conflicts) and then added my test for IP address not being set... and bumped version to 1.5. I preserved all commit history "just-in-case" which has made it look somewhat ugly. I chose to merge in from @JKoss2 because all I did was sort all the forks (260+) by recent activity, to look at what others were doing. I felt like @Jkoss2 has the most complete.
@dkerr64, how is this testing out so far?
@dkerr64, how is this testing out so far?
I am at 2 days 7 hours so a little early to know.
I have two ratgdo's running with this version of firmware. One crashed and rebooted itself at ~5 days 16 hours, the other at ~4 days 6 hours. Unfortunately I was not able to capture the log... when I got back to VScode to grab it, VScode demanded that I reload and that lost the history. But I strongly suspect the same as what I posted in #94, which looks like stack overflow.
Having said that... 4+ days is the longest it has run for me, and pairing was not lost, HomeKit still worked.
I'm going to have another go at capturing logs (which will take another 4-5 days).
@dkerr64 I just realized one of my devices rebooted sometime today. I just put a serial logger on it in hopes of catching something.
In the mean time I'm trying to break it by hammering it with network traffic. So far it's holding up surprisingly well.
I’m installed the V0.12 firmware and set my RATGDO to reboot every 48 hours. It has been working perfectly ever since. I have a full Ubiquiti network and the RATGDO is assigned to a single AP.
Caught the reboot. If the stack is really 4K, then it didn't overflow here as this is just over 2k. But I think the pointers being passed to write() in the client_sendencrypted() function have probably been corrupted based on the unaligned address error.
--------------- CUT HERE FOR EXCEPTION DECODER ---------------
Exception (9): epc1=0x40228d83 epc2=0x00000000 epc3=0x00000000 excvaddr=0x624f220a depc=0x00000000
LoadStoreAlignmentCause: Load or store to an unaligned address epc1=0x40228d83 in write(_client_context_t, unsigned char, int) at ??:?
stack>>>
ctx: cont
sp: 3ffff7b0 end: 3fffffd0 offset: 0150
3ffff900: 3ffffd64 00000002 00000000 00004bc6
3ffff910: 00000000 00000001 3fff55b4 00000000
3ffff920: 3fff55b4 3fff59b4 00000206 40228f54
3ffff930: 3ffff942 3ffffb48 47293ce6 00000002
3ffff940: 8fb20206 d05283e7 90b4622f 2f1645eb
3ffff950: 2cd2caa1 34d795a4 0e8a747f bcceef21
3ffff960: bc3763a6 76ff1b41 c3a2978b c5f9d72b
3ffff970: fb35bbd5 c6882e4e 2a979a79 185dabc9
3ffff980: 60793010 9b85e22d d4ae9457 a534b950
3ffff990: 3f15bc55 da8ab414 1fb4caaa 91b8daf5
3ffff9a0: 99d46429 3aa57a11 a3af219b 08da5fda
3ffff9b0: c8909efb 2bffb2d2 127ca575 62a83476
3ffff9c0: 6ca78103 3adaf509 c4341275 2c166c43
3ffff9d0: 0b128b4c 80273654 8280d29b 0acc86ff
3ffff9e0: 7237b9b7 88a211e7 4e848a90 8785b4f4
3ffff9f0: 4fb2ec3c d3cb680a 0fea1b56 a1205af9
3ffffa00: 64064cc7 944f8333 7e0e2475 be430c41
3ffffa10: e73cb6dd a7777c67 0a697e9e 2393b565
3ffffa20: 2847552b 6bd16113 2d550cab 3b4d4dc6
3ffffa30: d3418ad0 b548d68a 91cd4711 2e0376c9
3ffffa40: c76374a1 7633ae59 541d7afc 9ee4432e
3ffffa50: f56ea4c8 702f11ae d04058d2 93282341
3ffffa60: 76682f80 844d5803 61749494 9310fcc3
3ffffa70: e17beff4 ef992dcc e3a88c42 291df931
3ffffa80: fe4f9a2f 32427f9b 25e4c3da 11e302e5
3ffffa90: 12317568 634180c9 09cec86c 1d874e06
3ffffaa0: e446a915 2457a862 c7be86b7 1b565456
3ffffab0: 6f744961 0a189226 ce7af10e 5629a06c
3ffffac0: 69495201 97f6e092 d2b7e735 6bfe0f7d
3ffffad0: 40b74ebb c01c5baa 5f23001e 659cdee4
3ffffae0: 494ce0ad 03d29936 a6d92944 7479bdeb
3ffffaf0: c3ceb001 45427734 4fd30ff0 08d5e887
3ffffb00: 4d329d0c 7a37f963 3659f7a1 10d0ed74
3ffffb10: 34ab8612 07ae15d3 9db17b38 06422efa
3ffffb20: b8ee624d 6a50ea30 b18308e3 f6708196
3ffffb30: 0ad2f3ef 9ba9dd24 ef955016 48fa73fe
3ffffb40: 39034f4e 72ffca7e f4ec9171 7e13aeec
3ffffb50: 72df0cb5 618a9093 00000010 00000000
3ffffb60: 00000001 00000000 0000000a 3ffe8368
3ffffb70: 00000000 80000000 00000000 0000001b
3ffffb80: 00000800 00000001 00000800 3ffe8368
3ffffb90: 00000000 00000000 3ffffd00 40241e34
3ffffba0: 3ffffc90 3ffffc93 00000000 00000000
3ffffbb0: 00000000 00000002 4023d1cd 3ffe8368
3ffffbc0: 00000003 00000000 3ffffd00 40241e34
3ffffbd0: 40241d70 3ffffc90 00000003 00000000
3ffffbe0: 00000001 3ffffcd3 00000000 3ffffd00
3ffffbf0: 3ffe9fab 00000000 3ffffc50 4023d2fc
3ffffc00: 00000003 3ffe8368 3ffffd40 3ffe8368
3ffffc10: 00000002 3ffe8368 3ffffd00 40241e34
3ffffc20: 00000000 3ffe9fab 00000002 00000000
3ffffc30: 3ffe9e84 3ffe8368 3ffffc90 3ffe9fad
3ffffc40: 3ffe9fab 3ffe8368 3ffffd00 40242031
3ffffc50: 00000000 ffffffff ffffffff 00000000
3ffffc60: 00000003 00000005 00302078 40242031
3ffffc70: 40241d70 3ffffd32 00000001 3ffea624
3ffffc80: 3ffea623 3ffe8368 3ffffd40 40242031
3ffffc90: 00666631 00007803 00000000 00000000
3ffffca0: 00000003 00000005 00000000 3ffe8368
3ffffcb0: 3ffffd90 3ffffd80 00000010 00000001
3ffffcc0: 3ffffd90 3ffffd80 0000000c 00000003
3ffffcd0: 00000002 3ffffde0 00000008 3ffea62a
3ffffce0: 3ffffe10 3ffffde0 00000000 3fff5d94
3ffffcf0: 000001ff 3ffe8368 00000207 4023da15
3ffffd00: 3fff55b9 3ffffe00 00000201 ffff0208
3ffffd10: 3fff55b4 00000206 00000003 4023f2cd
3ffffd20: 3fff5f92 3ffffe30 00000001 3ffffe30
3ffffd30: 3ffffe00 3ffe8368 00000001 4023f2cd
3ffffd40: 3fff5f93 3ffffe50 00000000 ffff0208
3ffffd50: 00005f93 00040000 00000000 00000000
3ffffd60: 00000216 3fff0206 3ffffd90 3ffffd80
3ffffd70: 00000206 3fff55b4 3ffffd40 3ffffd40
3ffffd80: 3ffffe50 3ffe8368 0000000c 3fff5d94
3ffffd90: 000001ff 00000005 3fff55b4 40228f98
3ffffda0: 3ffffe10 00000005 3fff55b4 4022a206
3ffffdb0: 3fff55b4 00000008 3fff51a4 4023f309
3ffffdc0: 3ffffe30 3ffffe00 00000008 3ffffe30
3ffffdd0: 3ffffe00 00000008 3fff51a4 4021bbc6
3ffffde0: 3fff55b4 00000008 3fff51a4 402054ee
3ffffdf0: 3ffffe30 3ffffe00 00000008 ffff0208
3ffffe00: 3ffffe50 00000008 3ffe9e84 00000001
3ffffe10: 3fff5f93 00000001 3fff51a4 0000001e
3ffffe20: 3ffea620 3ffe8700 3fff51a4 402057da
3ffffe30: 3ffffe50 3fff51a4 3ffffe50 0000001e
3ffffe40: 00000000 3ffe8700 3fff51a4 40205a04
3ffffe50: 3ffe9e84 3fff51a4 3ffffe70 00000011
3ffffe60: 3fff5f88 00000001 3fff51a4 40227ca5
3ffffe70: 3fff5f8a 00000001 3fff51a4 402057da
3ffffe80: 3ffea625 3fff51a4 3ffffea0 3ffe8634
3ffffe90: 3fff55b4 3ffe86d8 3fff51a4 3ffe8634
3ffffea0: 3fff55b4 3ffe86f0 3fff51a4 3ffe8634
3ffffeb0: 3fff55b4 3ffe86f0 3fff51a4 4022a5ef
3ffffec0: 3ffe8614 3ffe8700 0a5e6ab9 d44269f4
3ffffed0: 3fff55c4 00000002 00000000 00000000
3ffffee0: 3fff5008 3fff59b4 3fff55b4 4022bf79
3ffffef0: 3fff4fcc 0000004d 3fff59ec 40204b00
3fffff00: 3fff4fbc 3fff16f4 00000000 00000000
3fffff10: 00000000 00000000 00000000 00000000
3fffff20: 3fff5009 00000001 3fff5a3d 0000004d
3fffff30: 0000002b 00000030 3fff4fde 00000000
3fffff40: 3fffdad0 0000005f 00000020 3fff172c
3fffff50: 3fff4fbc 3fff4fbc 3fff55b4 40228d21
3fffff60: 0000004d 00000000 3fff4600 4021da60
3fffff70: 0000004d 00000000 00001388 3fff1920
3fffff80: 3fffdad0 3fff21ac 3fff18f4 3fff1920
3fffff90: 3fffdad0 3fff429c 3fff59b4 4022ac6d
3fffffa0: 3fffdad0 00000000 3fff18f4 4021e89f
3fffffb0: 00000000 00000000 00000001 402326d4
3fffffc0: feefeffe feefeffe 3fffdab0 401007d5
<<<stack<<<
0x40228f54 in client_sendencrypted(_client_context_t, unsigned char, unsigned int) at ??:? 0x40241e34 in ssputs_r at /workdir/repo/newlib/newlib/libc/stdio/nano-vfprintf.c:232 0x4023d1cd in _printf_i at /workdir/repo/newlib/newlib/libc/stdio/nano-vfprintf_i.c:196 (discriminator 1) 0x40241e34 in ssputs_r at /workdir/repo/newlib/newlib/libc/stdio/nano-vfprintf.c:232 0x40241d70 in ssputs_r at /workdir/repo/newlib/newlib/libc/stdio/nano-vfprintf.c:179 0x4023d2fc in _printf_i at /workdir/repo/newlib/newlib/libc/stdio/nano-vfprintf_i.c:246 0x40241e34 in ssputs_r at /workdir/repo/newlib/newlib/libc/stdio/nano-vfprintf.c:232 0x40242031 in _svfprintf_r at /workdir/repo/newlib/newlib/libc/stdio/nano-vfprintf.c:528 0x40242031 in _svfprintf_r at /workdir/repo/newlib/newlib/libc/stdio/nano-vfprintf.c:528 0x40241d70 in __ssputs_r at /workdir/repo/newlib/newlib/libc/stdio/nano-vfprintf.c:179 0x40242031 in _svfprintf_r at /workdir/repo/newlib/newlib/libc/stdio/nano-vfprintf.c:528 0x4023da15 in snprintf at /workdir/repo/newlib/newlib/libc/stdio/snprintf.c:87 (discriminator 4) 0x4023f2cd in _vsnprintf_r at /workdir/repo/newlib/newlib/libc/stdio/vsnprintf.c:71 (discriminator 4) 0x4023f2cd in _vsnprintf_r at /workdir/repo/newlib/newlib/libc/stdio/vsnprintf.c:71 (discriminator 4) 0x40228f98 in client_send(_client_context_t, unsigned char, unsigned int) at ??:? 0x4022a206 in client_send_chunk(unsigned char, unsigned int, void) at ??:? 0x4023f309 in vsnprintf at /workdir/repo/newlib/newlib/libc/stdio/vsnprintf.c:41 0x4021bbc6 in json_flush at ??:? 0x402054ee in json_write at ??:? 0x402057da in _json_number at ??:? 0x40205a04 in json_string at ??:? 0x40227ca5 in write_characteristic_json(json_stream, _client_context_t, _homekit_characteristic const, characteristic_format_t, homekit_value_t const) at ??:? 0x402057da in _json_number at ??:? 0x4022a5ef in homekit_server_on_get_accessories(_client_context_t) at ??:? 0x4022bf79 in homekit_server_on_message_complete(http_parser) at ??:? 0x40204b00 in http_parser_execute at ??:? 0x40228d21 in homekit_client_process(_client_context_t) at ??:? 0x4021da60 in comms_loop() at ??:? 0x4022ac6d in homekit_server_process(homekit_server_t) at ??:? 0x4021e89f in loop at ??:? 0x402326d4 in loop_wrapper() at core_esp8266_main.cpp:? 0x401007d5 in cont_wrapper at ??:?
--------------- CUT HERE FOR EXCEPTION DECODER ---------------
I don't know how Arduino sets the stack, but if this define is setting it to 2048, then I've overflowed
https://github.com/dkerr64/Arduino-HomeKit-ESP8266/blob/master/src/port.h#L45
I've captured a few more errors. Both look like this. It always looks like corrupted memory and the stack pointer is always the same value when it craters.
Exception (28): epc1=0x4023a40e epc2=0x00000000 epc3=0x00000000 excvaddr=0x00000009 depc=0x00000000
LoadProhibited: A load referenced a page mapped with an attribute that does not permit loads epc1=0x4023a40e in littlefs_impl::LittleFSImpl::rmdir(char const*) at ??:?
stack>>>
ctx: cont sp: 3ffff7b0 end: 3fffffd0 offset: 0150
if the stack pointer is 3ffff7b0 and the top of the stack is 3fffffd0 (probably actually 3fffffff) then some quick hex math...
3fffffd0 - 3ffff7b0 = 2080 (decimal).
If the stack size is 2K then yes we have exceeded it, and likely caused the crash.
What consumes stack? AFAIK it is only function parameters, local variables, return addresses. Allocated memory comes from the heap (which we have >20KB available). So... we can go looking for 1) large arrays/structs in the stack that could be malloc'd? 2) recursive/very deep function calls?, 3) a way to increase the stack size.
Again, As far as I can tell, the Arduino based firmware is all single threaded, so there is only the one stack. I put a check for stack size into the main loop and included it in my heartbeat messages to the client javascript... at the point I grabbed the stack size it was in the 100's bytes, not KBs. It looks like all cases where we crash, the function call hierarchy all trace back to the HomeKit server. So any stack overflow is within that.
Tell me I'm wrong... but client_send_encrypted
is allocating a rather large array on the stack that will easily blow things up? Here.
I think you are right, that data gets allocated on the stack. If we convert that to a malloc, I think it will go on the heap instead.
Tell me I'm wrong... but
client_send_encrypted
is allocating a rather large array on the stack that will easily blow things up? Here.
This is a great find. I converted to a malloc and I'm running a test now. This particular module Im testing on now has been crashing fairly frequently. So, we'll see how it goes. There are lots of arrays being created in the 16 and 32byte range. By themselves, those don't seem to be an issue, but there are quite a few being created fairly far down the call stack, so they start to add up.
I think there are quite a few functions that are only called in one location, such as this client_sendencrypted() function. We can inline those functions.
I think there are quite a few functions that are only called in one location, such as this client_sendencrypted() function. We can inline those functions.
I'm not sure that buys much savings in stack space... just parameters and return addresses. But as you say, it all adds up.
You are right it's not the low hanging fruit. Just thinking about how we can reduce stack usage. I'm always seeing stack at 2080 bytes which means are overflowing but also means we already have 1kB on the stack, which to me seems a bit high for this thing.
In any case I'm already seeing a big improvement moving that 1kB buffer. I have a secondary device I setup for debugging this issue. It crashes about every hour but is running about 4 hours with this change.
This is probably stating the obvious, but there is mounting evidence the problem is with the HK networking side. I can hammer the device with various packet types and even hammer the web server and nothing bad happens other than some dropped packets. However the big difference between my stable device and my unstable device is the good one is connected to a HomeKit hub and the other is not. So one is getting requests from just the home hub and the other is seeing traffic from multiple phones, laptops etc.
In the meantime, I have been googling to see if I can find out how the stack size can be increased. It is probably a tradeoff between stack and maximum heap. I've not found it yet.
But I agree, the HomeKit part, exceeding available stack size, is most likely where the problem is. I think it goodness to look for large local array/struct declarations and removing them. If the function can never be called in parallel (ie always the same thread, non-interruptible), then we could consider global declaration. If not then malloc() -- but lets be real careful to always free before exiting the function.
by-the-way... something else I noticed... HomeKit pairing takes a long time, during which there is a ~30 second period that the main loop is stuck inside HomeKit. I know this because my once-every-second heartbeat from the web server stops for about 30 seconds, meaning nothing else is being allowed to run. This could be in the encryption key generation, but I don't know.
Anyway... point of this is to suggest that we maybe look for opportunities inside the HomeKit code to add yield()... https://stackoverflow.com/questions/34497758/what-is-the-secret-of-the-arduino-yieldfunction
I sent you a PR to malloc the buffer. You could probably make it static instead of malloc, but this works. And yeah, I made sure to free it everywhere.
If you want to merge that or just fix it yourself a better way, then I'll release another build as I think this will definitely help improve stability.
Jumping in late, but why not make the stack bigger? If it was esp32 for sure would do thatI have had tasks that need 3k of stack But this is esp8266 not sure what kind of ram we gotSent from my iPhoneOn Mar 15, 2024, at 6:12 PM, jgstroud @.***> wrote: I sent you a PR to malloc the buffer. You could probably make it static instead of malloc, but this works. And yeah, I made sure to free it everywhere.
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: @.***>
Mostly because I couldn't figure out how to do it! But even if we did change the stack size, we shouldn't be abusing it by pushing kB's at a time on it
Oh i saw a line highlighted that a define was 2048Well if the var is just temp I see no harm using stack… once the function exits the mem is freeSent from my iPhoneOn Mar 15, 2024, at 7:08 PM, jgstroud @.***> wrote: Mostly because I couldn't figure out how to do it! But even if we did change the stack size, we shouldn't be abusing it by pushing kB's at a time on it
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: @.***>
Well, no harm in putting it on the stack if you don't overflow the stack.
With this change, I ran clean for 6 hours, then hit 4 back to back crashes. The crash dumps no longer have the really deep stacks and crazy looking backtraces.
https://gist.github.com/jgstroud/e258b693f958b678b1d7c3ce7543b435 https://gist.github.com/jgstroud/eaa111a362f3550c954a943ce8e5d961 https://gist.github.com/jgstroud/1ce46cbe1c1d417180abd5410a52f893 https://gist.github.com/jgstroud/aad058be27b76acec804cc72cbd36165
Oh i saw a line highlighted that a define was 2048
Yes, but if you grep for SERVER_TASK_STACK it appears to never be used. So if we want to adjust stack size then I think it has to be done elsewhere.
@jgstroud I merged your PR into my HomeKit fork. I am now running this version on one of my ratgdo's and will likely update the other as well.
on another note, i switched to @dkerr64 PR version over 4 days ago... and no reboots yet my uptime > 4 days
i saw someone posted the standard 8266 arduino core uses 4096 as stack size then something about a concept of stack thunking
i saw someone posted the standard 8266 arduino core uses 4096 as stack size then something about a concept of stack thunking
yes, I found that as well... but I think it is overly complicated. Best first thing for us to do is reduce stack usage where we can.
do we have any consistency to which method we get the exception? is client_send_encrypted() using a large amount of stack, or being called recusivly?
Well, no harm in putting it on the stack if you don't overflow the stack.
With this change, I ran clean for 6 hours, then hit 4 back to back crashes. The crash dumps no longer have the really deep stacks and crazy looking backtraces.
https://gist.github.com/jgstroud/e258b693f958b678b1d7c3ce7543b435 https://gist.github.com/jgstroud/eaa111a362f3550c954a943ce8e5d961 https://gist.github.com/jgstroud/1ce46cbe1c1d417180abd5410a52f893 https://gist.github.com/jgstroud/aad058be27b76acec804cc72cbd36165
Looks to me that there are two distinct crashes here... fourth is dup of second, third is dup of first.
I thought I had fixed the first one, I saw this myself and reported it in https://github.com/ratgdo/homekit-ratgdo/issues/126 but I just closed that because I thought I fixed it in https://github.com/ratgdo/homekit-ratgdo/pull/127 here Can you check my logic... maybe I got it wrong.
The second one needs investigation.
do we have any consistency to which method we get the exception? is client_send_encrypted() using a large amount of stack, or being called recusivly?
No recursive calling that I can see... but yes, that function was allocating an array of 1KB size on the stack. @jgstroud just changed that to a alloc/free from heap and we're now testing with that.
ok, im still waiting for my 1st crash on your firmware from 4+ days ago :)
yeah, there are 2 distinctly different crashes here, but they both look like corrupt memory to me. I didn't realize that 2 were a dup of one you already reported.
I kinda think the stack trace may be misleading as I think the memory is corrupted from some other operation.
Well, here's another one. Just looks like more corrupt memory though. I added some debug code, so it's likely just moving the error around now.
https://gist.github.com/jgstroud/3577e93ee796e02611cb968088c9b356
@jgstroud I have reopened #126 as you seem to still have the problem. I suggest also that we move our discussion / comments over to it or another issue as this PR is now closed.
the one bad thing is when i turned off compile optimization, for more debug detail... it wasnt stable at all
build_type = debug
To try and fix various HomeKit stability problems I have created my own fork of the Arduino-HomeKit-ESP8266 server and merged in fixes others have implemented plus also a fix for issue #126
This PR updates homekit-ratgdo to pull the homekit server from my repository, plus also reports the available free heap space