Closed dkerr64 closed 3 weeks ago
You sure you want to open this can of worms again by enabling the IRAM heap?
Having discovered that there is some impact from where the device is located, I want to ask themoneyish and pritchey over on Discord to try again, moving the device around to see if it makes a difference.
That said... I have consistently been seeing MDNSresponder OOM crashes. Often they occur immediately after a reboot, it crashes and reboots again and stabilizes. But then I found myself needing to un-pair and re-pair to HomeKit... in un-paired status the HomeKit server appears to need more memory, I was crashing every time I booted (still in MDNSresponder), it never got up to a stable state. Turning on IRAM heap immediately solved the problem, and I have not seen a OOM crash since (admittedly only a few days).
I think, for the vast majority of users, running with IRAM heap is going to be more stable.
I think we had already done the move the device around experiment. I wonder what's different during that reboot that's causing an OOM. I tried hitting it with way more mdns traffic than it could handle and I couldn't get it to crash. Something about doing that during a reboot. maybe having homekit initializing at the same time. I was thinking about adding a heap check down in the network stack and temporarily disabling RX when we get low on memory. However, if we can safely enable the IRAM, that's by far the easier solution.
One reason why I never implemented this was because I wasn't sure how to properly handle recovery when you can't connect to your wifi. Should it automatically go into AP mode? This leads to a lot of error conditions especially after power cycles. Instead I was thinking about a way to manually force it. I just pushed a commit where if you press the light button 5 times in a 3 second window it will automatically reboot into AP mode.
One reason why I never implemented this was because I wasn't sure how to properly handle recovery when you can't connect to your wifi. Should it automatically go into AP mode? This leads to a lot of error conditions especially after power cycles. Instead I was thinking about a way to manually force it. I just pushed a commit where if you press the light button 5 times in a 3 second window it will automatically reboot into AP mode.
That is a good idea. I wasn't sure what to do about this myself... I do not want to automatically send it into soft AP mode because as you state, there are times where WiFi may go offline that has nothing to do with ratgdo (power failures, etc.). Your solution is quite elegant.
I'll update the README file as well.
Before we merge this, there is one other change I want to make... and that is move all (or as much as possible) config settings into a single file. Loading them all from separate files is taking an awfully long time every reboot and I think I have come up with a way to store them all in one place.
Non-config data (like rolling code) would remain as it is in their own files.
I've deliberately avoided this whole patch because I didn't want to deal with all the possible failure modes. The idea to use the light switch just hit me. I think I might make it flash the lights couple times as well before it reboots as a sort of confirmation. I also need to add a similar fix to the sec+1 code.
Ok, I cleaned up the recovery logic some. I reused your TTC timer to flash the lights letting you know the command was accepted.
Is this still relevant? Seems like with all the logging changes you've made, this setting doesn't seem necessary anymore.
Is this still relevant? Seems like with all the logging changes you've made, this setting doesn't seem necessary anymore.
Yes, I think we can remove this.
I made significant change to how we save user configuration settings. There are 19 individual settings that have been stored in 19 separate files. This significantly slows down boot time. By saving them all in a single file boot time improves by approx 13 seconds... such that the device now completes initialization in about 5 seconds.
It did require extensive changes. But the result is actually much more readable code.
I made significant change to how we save user configuration settings. There are 19 individual settings that have been stored in 19 separate files. This significantly slows down boot time. By saving them all in a single file boot time improves by approx 13 seconds... such that the device now completes initialization in about 5 seconds.
It did require extensive changes. But the result is actually much more readable code.
Wd should, of course, do some thorough testing before we release this!
Also, @jgstroud I am thinking of changing the reboot countdown timer from 30 seconds to 15 seconds. My devices are consistently up and stable in less than 15 seconds now that config settings are all in one file.
Cool. I added syslog earlier today but wanted to wait for your settings patch. I'll merge it with your code later. I'm really glad you made the settings change. the code was getting messy and hard to read and modify.
1.7.1 seems pretty stable. We might want to remove the prerelease tag on that before we merge this.
1.7.1 seems pretty stable. We might want to remove the prerelease tag on that before we merge this.
I did that a few days ago.
Cool. I added syslog earlier today but wanted to wait for your settings patch. I'll merge it with your code later. I'm really glad you made the settings change. the code was getting messy and hard to read and modify.
Syslog is great, thanks.
For the settings, I think I over engineered it... I think use of std:map, std:any, std:string may be using way more memory than I would like. I think I'll change it to a simple struct which gives us control over memory use (potentially putting it into IRAM as well).
I really like the settings file, but the one thing I didn't like was trying to understand the first.second.second naming. That was confusing, but that's probably just me.
Oh I also added a comment on the code. You added code to migrate settings from old files to new. Should we remove those old files in the process to free up flash?
I fixed the broken dependency link in main branch for 1.7.1 which required that I rebase this branch onto the new main, and thus a forced-push to GitHub.
Oh I also added a comment on the code. You added code to migrate settings from old files to new. Should we remove those old files in the process to free up flash?
Yes makes sense. I'll do that as I change the code to get rid of the c++ classes which I think are using more memory than I would like.
I really like the settings file, but the one thing I didn't like was trying to understand the first.second.second naming. That was confusing, but that's probably just me.
It is a C++ pair data structure. In this case I have a pair within a pair (the second member of the first pair, is itself a pair). I am not a regular C++ programmer, so these concepts are new to me too.
I changed the user settings to use standard C language structure rather than C++ objects. Maybe not as elegant, but much easier to understand I think... and the memory usage is well defined.
I have also wrapped all the startup initialization in IRAM heap. This has made a huge difference in available free heap... which should cut down OOM exceptions significantly.
That said... we have somehow got ourself in a situation where we must build with IRAM heap. Without it, I get a OOM crash inside MDNS responder every boot, so it never starts up. I'm not sure what I've done that makes it such that we're using too much memory... last free heap message before crash says there is > 18KB available.
Reducing the message log buffer from 2KB to 1KB gets me booting again without IRAM heap, so I made that change. But I rather suspect that OOM exceptions may occur from time-to-time.
I'm hopeful that we can stick with IRAM heap enabled. It makes a huge difference.
That's probably due to the extra 3kB of memory usage that I pointed out.
Sorry for all the forced pushes, I wanted to cleanup my recent commit history.
Also... there is a problem with your code to add WiFi RSSI signal strength to the softAP page. If you have multiple access points broadcasting the same SSID then each have different signal strength. The problem is that we're not checking for the strongest signal so almost certainly overwriting the strongest with a later weaker one.
I have 3 access points and one of my networks reports as...
>>> [ 26774] RATGDO: Network: KIoT Ch:6 (-36dBm)
>>> [ 26823] RATGDO: Network: KIoT Ch:6 (-72dBm)
>>> [ 26837] RATGDO: Network: KIoT Ch:11 (-61dBm)
So before adding the 2nd or 3rd, we need to check the strength against what we already saved. Otherwise we're giving misleading information to the user.
Sorry for all the forced pushes, I wanted to cleanup my recent commit history.
Also... there is a problem with your code to add WiFi RSSI signal strength to the softAP page. If you have multiple access points broadcasting the same SSID then each have different signal strength. The problem is that we're not checking for the strongest signal so almost certainly overwriting the strongest with a later weaker one.
I have 3 access points and one of my networks reports as...
>>> [ 26774] RATGDO: Network: KIoT Ch:6 (-36dBm) >>> [ 26823] RATGDO: Network: KIoT Ch:6 (-72dBm) >>> [ 26837] RATGDO: Network: KIoT Ch:11 (-61dBm)
So before adding the 2nd or 3rd, we need to check the strength against what we already saved. Otherwise we're giving misleading information to the user.
So in wifi.cpp we need to test the new RSSI against what was already saved in the wifiNets C++ set before inserting the new one (which replaces the existing).
We can just revert that patch. It's not really necessary.
I'm away from my computer, but in the improv code we already have code that scans the wifi network and removes duplicates and only shows the strongest signal. We should just reuse that function.
I'm away from my computer, but in the improv code we already have code that scans the wifi network and removes duplicates and only shows the strongest signal. We should just reuse that function.
Not as simple as just using the same function as it has improv serial port responses embedded into the function.
I'm away from my computer, but in the improv code we already have code that scans the wifi network and removes duplicates and only shows the strongest signal. We should just reuse that function.
Not as simple as just using the same function as it has improv serial port responses embedded into the function.
But I found an easy fix.
I'm away from my computer, but in the improv code we already have code that scans the wifi network and removes duplicates and only shows the strongest signal. We should just reuse that function.
Not as simple as just using the same function as it has improv serial port responses embedded into the function.
But I found an easy fix.
I just sat down to my computer to fix it, but great. Glad you found an easy fix.
Sorry man. I really butchered your web code. Thanks for fixing it.
Is this still relevant? Seems like with all the logging changes you've made, this setting doesn't seem necessary anymore.
Yes, I think we can remove this.
I have now removed the logging to JavaScript console.
I'm ready to declare this ready for pre-release and wider testing. What do you think?
I like your change for the time in the syslog message. When I read the RFC, I throught filling the time parameter was mandatory in RFC5424 format and you couldn't send a NIL. but I tested your patch and it works fine.
I'm fine with doing a pre-release build.
@jgstroud just a heads up that I need to make use of IRAM heap more granular around HomeKit server. I un-paired my door to test initial setup and it fails. On investigation it looks like HomeKit uses a lot (and I mean a lot) of memory to initialize itself if it is not already paired. I had to reduce the size of my message log buffer all the way down and while that got it going, it was very slow to create the initial crypto keys.
There are two big malloc's in the server initialization which is all we really need to wrap in IRAM, so I am going to do that and I think that will get us going again.
In the meantime, don't un-pair !!!
Those initial pairing crypto operations are very slow and I spent a good deal of time on fixing those so you don't get timeouts. After you make the change, we need to do a bunch of unpair / repair and make sure we don't see the timeouts come back.
We're good now. I have tested un-pairing and pairing several times. Consistent memory usage and timing. It takes about 10-12 seconds to generate the crypto keys on boot if no HomeKit paired.
I've also added comments in the source file every place that IRAM is used so that any future maintainer knows to keep an eye on usage/allocation during startup process.
I have added an advanced mode to the soft AP page... which shows all available WiFi networks with the unique hardware BSSID of the access points. This allows us to lock the device to a specific AP. Appropriate warnings have been added to the readme, soft AP page and confirmation pop-up.
I tried the AP Locking and I assume it's working, but there is nothing in the logs to indicate it has been locked to an AP. I think we should fix that before merging.
I tried the AP Locking and I assume it's working, but there is nothing in the logs to indicate it has been locked to an AP. I think we should fix that before merging.
Yes, agree. I have coded this up including reporting it on the main web page so very visible. I just haven't committed and pushed yet as I want to take a screenshot to include in readme.
I tried the AP Locking and I assume it's working, but there is nothing in the logs to indicate it has been locked to an AP. I think we should fix that before merging.
Yes, agree. I have coded this up including reporting it on the main web page so very visible. I just haven't committed and pushed yet as I want to take a screenshot to include in readme.
Ok, great. Otherwise I have no issues.
Fixed the bug with empty hostname.
Also put a test for zero length accessory name into the arduino_homekit_server source. And discovered that, shockingly, all the initialization functions are declared with void returns... so while there is a bunch of testing for error conditions, the functions do not return true/false to indicate whether setup succeeded... so calling code cannot itself abort.
Changing those functions to return a bool would be good idea, but also fairly significant change, as implies also figuring out how to handle a failure. I'm not up to that right now.
I fixed the issue with the random lock notifications. Due to a race condition on boot. If homekit queries the status of the door before we receive our first status message then we have a bunch of default values set and those immediately change state when we get our first status response. I added a slight delay to starting homekit loop so we have a chance to populate the door state first.
I think I'm done for now. We can merge.
v1.8.0 (2024-10-??)
What's Changed
Known Issues