Closed jgstroud closed 8 months ago
All of the other projects I see using wolfSSL are having to use this function.
Example: https://github.com/Yurik72/ESPHap/blob/master/examples/EspHap_GarageDoor/EspHap_GarageDoor.ino#L265
Also, wonder if we should look at this ESPHap library more.
Trying this now. I observe that the freeHeap is down by about 4K... still above 20KB though, so fine.
I got 4 of these crashes last night: https://gist.github.com/jgstroud/e04ee6c156b808ee573b035e790be80d
They all look identical. The program counter is in the WiFi Keepalive, but the backtrace shows client_send_encrypted. I'm guessing this wifi keep alive is running on an interrupt and we are hitting some shared memory contention?
I may be crazy, but I feel like even though it's still crashing a lot, I feel like we're getting close. I started with crash dumps that made no sense and looked like a corrupted back trace. now they all look very similar and with sane looking stacks and back traces.
it looks like the ESPHap project was doing a lot of messing around with this Wifi Keepalive. One very notable difference is here: https://github.com/Yurik72/ESPHap/blob/master/arduino_homekit_server.cpp#L851 https://github.com/Yurik72/ESPHap/commit/8aa88111059993560a0d3da5104178753d60d578
They commented out this call which just so happens to come inside the client_send_encrypted.
Your stack history shows that the keep alive was called from client_send_encrypted. So if it is not required then we should try removing (commenting out) as well.
Yeah. I started a test a few hours ago with the keep alive completely disabled. Not sure what consequence that will have but hopefully lead me a little closer to a proper solution.
@jgstroud I have invited you to collaborate in my HomeKit server repository, so you can modify this without waiting for me to merge PRs.
@jgstroud I rebased a branch I am working on to improve client-side error handling onto this branch and have opened a PR to you to merge.
@dkerr64. Cool, I'll review it. I managed to get a stable build. It ran overnight, which is a first. However, I had to disable the heartbeat in the web code. Which I don't think we really need but it also seems to break OTA by having that disabled. A better fix would be to understand the error, but for now disabling it has made a huge difference.
I saw a number of these: https://gist.github.com/jgstroud/9a9680483d4fd888db33fe47a0db334d
I think we are getting closer. The crash in heartbeat is probably here
I propose removing the PSTR and _P and using plain old printf...
subscription[i].client.printf("event: message\nretry: 15000\ndata: %s\n\n", json);
We do need some sort of heartbeat (I think) to keep the Server-Sent Events socket open. It is used "on demand" to push any changes (like door open) to the browser client.
Yeah I removed the PSTR part but haven't started testing it yet. Just letting it run a while longer with the heartbeat disabled
I'll start testing that version this afternoon.
@dkerr64 after turning the heartbeat back on, things are mostly ok. But I'm noticing that if I have the web ui open on my laptop and then I open it on my phone, they both update nicely, but if my laptop goes to sleep, then my phone will start updating very very slowly. Surely just getting lots of TCP retries. The free heap starts dropping by a lot, and I got it to crash like this. I suspect it ran out of memory. I really don't understand all of this logic, so I'm not sure how to fix it.
For now, I'll continue testing with only 1 client active.
@dkerr64 looks to me like the default timeout is set to 5 seconds and we have a new heartbeat every second. I think we should set the timeout to 500ms and if you get a timeout on the printf in the heartbeat function, then we should close the tcp connection. See this commit https://github.com/ratgdo/homekit-ratgdo/pull/129/commits/5ef96eaa6af5117eb418ef77adae10fdbebbd183
Thanks, I am testing this version now.
@dkerr64 I just merged one more change into the HK library. I'm running stable now for 12+ hrs. This is the longest it has run other than the time I had the heartbeats disabled. If this run clean the rest of the day I'm going to go ahead and release.
Excellent. My systems take 4+ days to crash so it will take a while for me to confirm, but I have confidence. I'm traveling this week so can't mess with it anyway!
The heartbeat is pretty much a no-op unless any browsers are open/actively connected.
There are a couple of other PSTR() in the web.cpp which maybe we should get rid of too? Much more infrequently called, but if we suspect problems accessing that block of memory then I think it safer to remove. I have just done that in my client-error-handling branch.
I sent a PR that removes remaining PSTR() from web.cpp and fixes a clearInterval() problem in the javascript.
Wait, don't merge that PR. I am seeing a problem similar to what you observed when multiple browser clients try to connect. Let me figure out what is going on.
We have a race condition with the heartbeat... which exposes itself when browser client is on slow connection. Server allocates say.. /rest/events/1
to a browser. Now the browser needs to open up an EventSource
channel to that URI. But if SSEheartbeat()
runs before the client has connected, then we get a client not listening error.
I was running into a similar problem when stress testing client error handling, and was seeing what happens when the EventSource connection failed. That was one of the outstanding error cases I still needed to tackle.
I have to think how best to handle this.
Oh... and it's got nothing to do with last PR so you can merge it.
you guys been busy :) mine still hasnt rebooted, up 6 days! since #pr127 my friends did lockup and he had to physically power cycle, and it did loose HK pairing... he was running the main branch of course
I've opened issue #130 to track this race condition. Lets discuss over there on how to fix.
@mitchjs it's getting a lot more stable especially if you pull this PR. I'm tempted to do a release, but I might want to wait on a resolution to #130 first as I think I have also seen this.
flashed mine, and sent to my friend who has 2 running...
@dkerr64 Great, I just merged that PR. I'm pretty sure I have run into this issue. How confident are you in these changes? I'll do a test build, but looks like I'm going to have a busy week and probably won't be able to spend much more time looking at this for a while, so I'm tempted to just merge everything andrelease an new image.
@dkerr64 what are your thoughts on releasing this as a 1.0 build and removing the pre-release designation?
@jgstroud I've figured out how to solve the other problem in #130 when client fails to open EventSource connection. It's simple logic, so I'd like to implement that. Then lets pull the trigger on 1.0.
We fixed the network problems too :D
PING 192.168.8.205 (192.168.8.205) 56(84) bytes of data.
64 bytes from 192.168.8.205: icmp_seq=1 ttl=254 time=2.38 ms
64 bytes from 192.168.8.205: icmp_seq=2 ttl=254 time=4.31 ms
64 bytes from 192.168.8.205: icmp_seq=3 ttl=254 time=2.93 ms
64 bytes from 192.168.8.205: icmp_seq=4 ttl=254 time=2.11 ms
64 bytes from 192.168.8.205: icmp_seq=5 ttl=254 time=1.98 ms
64 bytes from 192.168.8.205: icmp_seq=6 ttl=254 time=3.18 ms
64 bytes from 192.168.8.205: icmp_seq=7 ttl=254 time=1.89 ms
64 bytes from 192.168.8.205: icmp_seq=8 ttl=254 time=1.98 ms
64 bytes from 192.168.8.205: icmp_seq=9 ttl=254 time=5.37 ms
64 bytes from 192.168.8.205: icmp_seq=10 ttl=254 time=2.83 ms
64 bytes from 192.168.8.205: icmp_seq=11 ttl=254 time=2.08 ms
64 bytes from 192.168.8.205: icmp_seq=12 ttl=254 time=1.87 ms
64 bytes from 192.168.8.205: icmp_seq=13 ttl=254 time=1.94 ms
64 bytes from 192.168.8.205: icmp_seq=14 ttl=254 time=2.67 ms
64 bytes from 192.168.8.205: icmp_seq=15 ttl=254 time=3.12 ms
@mitchjs @dkerr64 if you are running the PR, can you test un-pairing and re-pairing?
@jgstroud , i just unpaired... after the reboot, i paired it to my homekit (iPhone) and all seem ok
I am away from home so am not able to test unpair/pair. But I have pushed a fix to the 2nd problem described in #130, I opened a PR on your branch.
While implementing the fix I discovered that I was setting the ticker for each connected client... so instead of the ticker running once per second, it was running once per second for each connected client, so could be 2 or 3. And then the ticker looped through all clients. I fixed it by having one global heartbeat timer, rather than one for each client. Maybe not the most elegant, but works.
I think we have dramatically improved things. So I vote for getting this out to folks to test. There will always be more improvements, but I think we have reached a significant milestone, so should push it out.
@dkerr64 I just fixed the wifi provisioning timeout errors! Ready for a release 1.0
See: https://github.com/esp8266/Arduino/blob/master/cores/esp8266/core_esp8266_app_entry_noextra4k.cpp#L14 https://github.com/esp8266/Arduino/blob/master/cores/esp8266/core_esp8266_main.cpp#L345