Stability Improvements - Githubissues

jgstroud commented 8 months ago

See: https://github.com/esp8266/Arduino/blob/master/cores/esp8266/core_esp8266_app_entry_noextra4k.cpp#L14 https://github.com/esp8266/Arduino/blob/master/cores/esp8266/core_esp8266_main.cpp#L345

jgstroud commented 8 months ago

All of the other projects I see using wolfSSL are having to use this function.

Example: https://github.com/Yurik72/ESPHap/blob/master/examples/EspHap_GarageDoor/EspHap_GarageDoor.ino#L265

Also, wonder if we should look at this ESPHap library more.

dkerr64 commented 8 months ago

Trying this now. I observe that the freeHeap is down by about 4K... still above 20KB though, so fine.

jgstroud commented 8 months ago

I got 4 of these crashes last night: https://gist.github.com/jgstroud/e04ee6c156b808ee573b035e790be80d

They all look identical. The program counter is in the WiFi Keepalive, but the backtrace shows client_send_encrypted. I'm guessing this wifi keep alive is running on an interrupt and we are hitting some shared memory contention?

I may be crazy, but I feel like even though it's still crashing a lot, I feel like we're getting close. I started with crash dumps that made no sense and looked like a corrupted back trace. now they all look very similar and with sane looking stacks and back traces.

jgstroud commented 8 months ago

it looks like the ESPHap project was doing a lot of messing around with this Wifi Keepalive. One very notable difference is here: https://github.com/Yurik72/ESPHap/blob/master/arduino_homekit_server.cpp#L851 https://github.com/Yurik72/ESPHap/commit/8aa88111059993560a0d3da5104178753d60d578

They commented out this call which just so happens to come inside the client_send_encrypted.

dkerr64 commented 8 months ago

Your stack history shows that the keep alive was called from client_send_encrypted. So if it is not required then we should try removing (commenting out) as well.

jgstroud commented 8 months ago

Yeah. I started a test a few hours ago with the keep alive completely disabled. Not sure what consequence that will have but hopefully lead me a little closer to a proper solution.

dkerr64 commented 8 months ago

@jgstroud I have invited you to collaborate in my HomeKit server repository, so you can modify this without waiting for me to merge PRs.

dkerr64 commented 8 months ago

@jgstroud I rebased a branch I am working on to improve client-side error handling onto this branch and have opened a PR to you to merge.

jgstroud commented 8 months ago

@dkerr64. Cool, I'll review it. I managed to get a stable build. It ran overnight, which is a first. However, I had to disable the heartbeat in the web code. Which I don't think we really need but it also seems to break OTA by having that disabled. A better fix would be to understand the error, but for now disabling it has made a huge difference.

I saw a number of these: https://gist.github.com/jgstroud/9a9680483d4fd888db33fe47a0db334d

dkerr64 commented 8 months ago

I think we are getting closer. The crash in heartbeat is probably here

I propose removing the PSTR and _P and using plain old printf... subscription[i].client.printf("event: message\nretry: 15000\ndata: %s\n\n", json);

We do need some sort of heartbeat (I think) to keep the Server-Sent Events socket open. It is used "on demand" to push any changes (like door open) to the browser client.

jgstroud commented 8 months ago

Yeah I removed the PSTR part but haven't started testing it yet. Just letting it run a while longer with the heartbeat disabled

jgstroud commented 8 months ago

I'll start testing that version this afternoon.

jgstroud commented 8 months ago

@dkerr64 after turning the heartbeat back on, things are mostly ok. But I'm noticing that if I have the web ui open on my laptop and then I open it on my phone, they both update nicely, but if my laptop goes to sleep, then my phone will start updating very very slowly. Surely just getting lots of TCP retries. The free heap starts dropping by a lot, and I got it to crash like this. I suspect it ran out of memory. I really don't understand all of this logic, so I'm not sure how to fix it.

For now, I'll continue testing with only 1 client active.

jgstroud commented 8 months ago

@dkerr64 looks to me like the default timeout is set to 5 seconds and we have a new heartbeat every second. I think we should set the timeout to 500ms and if you get a timeout on the printf in the heartbeat function, then we should close the tcp connection. See this commit https://github.com/ratgdo/homekit-ratgdo/pull/129/commits/5ef96eaa6af5117eb418ef77adae10fdbebbd183

dkerr64 commented 8 months ago

Thanks, I am testing this version now.

jgstroud commented 8 months ago

@dkerr64 I just merged one more change into the HK library. I'm running stable now for 12+ hrs. This is the longest it has run other than the time I had the heartbeats disabled. If this run clean the rest of the day I'm going to go ahead and release.

dkerr64 commented 8 months ago

Excellent. My systems take 4+ days to crash so it will take a while for me to confirm, but I have confidence. I'm traveling this week so can't mess with it anyway!

The heartbeat is pretty much a no-op unless any browsers are open/actively connected.

There are a couple of other PSTR() in the web.cpp which maybe we should get rid of too? Much more infrequently called, but if we suspect problems accessing that block of memory then I think it safer to remove. I have just done that in my client-error-handling branch.

dkerr64 commented 8 months ago

I sent a PR that removes remaining PSTR() from web.cpp and fixes a clearInterval() problem in the javascript.

dkerr64 commented 8 months ago

Wait, don't merge that PR. I am seeing a problem similar to what you observed when multiple browser clients try to connect. Let me figure out what is going on.

dkerr64 commented 8 months ago

We have a race condition with the heartbeat... which exposes itself when browser client is on slow connection. Server allocates say.. /rest/events/1 to a browser. Now the browser needs to open up an EventSource channel to that URI. But if SSEheartbeat() runs before the client has connected, then we get a client not listening error.

I was running into a similar problem when stress testing client error handling, and was seeing what happens when the EventSource connection failed. That was one of the outstanding error cases I still needed to tackle.

I have to think how best to handle this.

Oh... and it's got nothing to do with last PR so you can merge it.

mitchjs commented 8 months ago

you guys been busy :) mine still hasnt rebooted, up 6 days! since #pr127 my friends did lockup and he had to physically power cycle, and it did loose HK pairing... he was running the main branch of course

dkerr64 commented 8 months ago

I've opened issue #130 to track this race condition. Lets discuss over there on how to fix.

jgstroud commented 8 months ago

@mitchjs it's getting a lot more stable especially if you pull this PR. I'm tempted to do a release, but I might want to wait on a resolution to #130 first as I think I have also seen this.

mitchjs commented 8 months ago

flashed mine, and sent to my friend who has 2 running...

jgstroud commented 8 months ago

@dkerr64 Great, I just merged that PR. I'm pretty sure I have run into this issue. How confident are you in these changes? I'll do a test build, but looks like I'm going to have a busy week and probably won't be able to spend much more time looking at this for a while, so I'm tempted to just merge everything andrelease an new image.

jgstroud commented 8 months ago

@dkerr64 what are your thoughts on releasing this as a 1.0 build and removing the pre-release designation?

dkerr64 commented 8 months ago

@jgstroud I've figured out how to solve the other problem in #130 when client fails to open EventSource connection. It's simple logic, so I'd like to implement that. Then lets pull the trigger on 1.0.

jgstroud commented 8 months ago

We fixed the network problems too :D


PING 192.168.8.205 (192.168.8.205) 56(84) bytes of data.
64 bytes from 192.168.8.205: icmp_seq=1 ttl=254 time=2.38 ms
64 bytes from 192.168.8.205: icmp_seq=2 ttl=254 time=4.31 ms
64 bytes from 192.168.8.205: icmp_seq=3 ttl=254 time=2.93 ms
64 bytes from 192.168.8.205: icmp_seq=4 ttl=254 time=2.11 ms
64 bytes from 192.168.8.205: icmp_seq=5 ttl=254 time=1.98 ms
64 bytes from 192.168.8.205: icmp_seq=6 ttl=254 time=3.18 ms
64 bytes from 192.168.8.205: icmp_seq=7 ttl=254 time=1.89 ms
64 bytes from 192.168.8.205: icmp_seq=8 ttl=254 time=1.98 ms
64 bytes from 192.168.8.205: icmp_seq=9 ttl=254 time=5.37 ms
64 bytes from 192.168.8.205: icmp_seq=10 ttl=254 time=2.83 ms
64 bytes from 192.168.8.205: icmp_seq=11 ttl=254 time=2.08 ms
64 bytes from 192.168.8.205: icmp_seq=12 ttl=254 time=1.87 ms
64 bytes from 192.168.8.205: icmp_seq=13 ttl=254 time=1.94 ms
64 bytes from 192.168.8.205: icmp_seq=14 ttl=254 time=2.67 ms
64 bytes from 192.168.8.205: icmp_seq=15 ttl=254 time=3.12 ms

jgstroud commented 8 months ago

@mitchjs @dkerr64 if you are running the PR, can you test un-pairing and re-pairing?

mitchjs commented 8 months ago

@jgstroud , i just unpaired... after the reboot, i paired it to my homekit (iPhone) and all seem ok

dkerr64 commented 8 months ago

I am away from home so am not able to test unpair/pair. But I have pushed a fix to the 2nd problem described in #130, I opened a PR on your branch.

While implementing the fix I discovered that I was setting the ticker for each connected client... so instead of the ticker running once per second, it was running once per second for each connected client, so could be 2 or 3. And then the ticker looped through all clients. I fixed it by having one global heartbeat timer, rather than one for each client. Maybe not the most elegant, but works.

I think we have dramatically improved things. So I vote for getting this out to folks to test. There will always be more improvements, but I think we have reached a significant milestone, so should push it out.

jgstroud commented 8 months ago

@dkerr64 I just fixed the wifi provisioning timeout errors! Ready for a release 1.0

ratgdo / homekit-ratgdo

Stability Improvements #129