Closed clumsy-stefan closed 2 years ago
HW watchdogs are also what I have seen. (maybe due to performing WiFi scans in the background?)
The (reboot) command not working on the main page is probably related to having a password set? Still not a desired side effect => Will look into it.
Change FQDN to IP is also a side effect I had not anticipated. I added the redirect to help the setup page for initial access, but I guess it should only be done when you connect over the AP of the ESP and not the normal interface.
HW watchdogs are also what I have seen. (maybe due to performing WiFi scans in the background?)
I have not enabled the background scan in the advanced config... I'll do some more checks there
The (reboot) command not working on the main page is probably related to having a password set? Still not a desired side effect => Will look into it.
Yes, that's also what I found (see EDIT above, was after initial issue creation)...
Change FQDN to IP is also a side effect I had not anticipated. I added the redirect to help the setup page for initial access, but I guess it should only be done when you connect over the AP of the ESP and not the normal interface.
Yes, probably in AP mode and during setup that makes sense, afterwards it looks a bit nasty... even though it works...
thanks for looking into it..
On the exception reboots. Do you have some statistics on memory usage? Preferrably a trend while running, to see if there is some memory leak
Regarding this one:
If you have a password set, how did you do the reboot before?
I can change this part (RootPage.cpp)
if (!MAIN_PAGE_SHOW_BASIC_INFO_NOT_LOGGED_IN) {
if (!isLoggedIn()) { return; }
}
const bool loggedIn = isLoggedIn(false);
into:
if (!MAIN_PAGE_SHOW_BASIC_INFO_NOT_LOGGED_IN) {
if (!isLoggedIn()) { return; }
}
const bool loggedIn = isLoggedIn(!MAIN_PAGE_SHOW_BASIC_INFO_NOT_LOGGED_IN);
To get the old behavior.
mini-2 and mini-4 are running the new version.
Interestingly mini-17 and mini-18 below are also running the new version, but without the memory issues.
The only "difference" of these units are the tasks configures (same build, but different tasks defined).
Then I would like to know what the different plugins are... :)
At first sight the only "real" difference is the cron-plugin, which is defined, but not (anymore) activated...
const bool loggedIn = isLoggedIn(!MAIN_PAGE_SHOW_BASIC_INFO_NOT_LOGGED_IN);
Hmm... I have #define MAIN_PAGE_SHOW_BASIC_INFO_NOT_LOGGED_IN false
(as per Custom-sample.h
which would mean it has the opposite behaviour than before.. That could explain it, I'll change it to true
so that the negation matches again...
const bool loggedIn = isLoggedIn(!MAIN_PAGE_SHOW_BASIC_INFO_NOT_LOGGED_IN);
Hmm... I have
#define MAIN_PAGE_SHOW_BASIC_INFO_NOT_LOGGED_IN false
(as perCustom-sample.h
which would mean it has the opposite behaviour than before.. That could explain it, I'll change it totrue
so that the negation matches again...
That flag set to false is the default behavior.
So that should work as it did, so it must return on the first call to isLoggedIn()
, as it was.
Meybe you're experiencing 2 issues here, one related to the redirect?
2 and 4 both also have lots of plugins active, so no wonder they do show more free memory compared to 17 and 18. You also run pulse counter, which do show enormous high value for time. Maybe the latest code of that pulse counter is causing issues, although I have not been running that plugin myself.
Do you also have logging (perhaps to a syslog server) enabled? Do you call a controller by a domain name or an IP? (DNS resolve is flaky sometimes)
false was the default behaviour before the change, but now it a !false which means default behaviour changed to true?
But even then, obvisouly I'km logged in on the page (it asks when I first connect to the node), so I should be able to execute the reboot on the web page anyways...
Well I have to take another coffee to look at the root page issue handling commands... Let's first focus in the discussion on the reboots. You are also running the new VL53... plugin. Is there a correlation between new (or recently changed) plugins and increased memory usage?
2 and 4 both also have lots of plugins active, so no wonder they do show more free memory compared to 17 and 18.
yes, but they also did before, so lower memory is ok there, but it they never rebooted (uptimes >5days) before... also I have nodes with much more defined that do not reboot without that mentioned PR.
You also run pulse counter, which do show enormous high value for time.
Yep, but on these nodes nothing is connected to the pin's, was just for testing, On the other hand I have another node running with 2 fans connected running at high speed without issues or reboots...
Maybe the latest code of that pulse counter is causing issues, although I have not been running that plugin myself.
I did not expierence any issues after updating to the latest pulse-counter changes. The nodes with the fan's run for days without issues.
Do you also have logging (perhaps to a syslog server) enabled?
yes
Do you call a controller by a domain name or an IP? (DNS resolve is flaky sometimes)
Already changed some time ago everywehere and everything to IP to prevent DNS issues.
What log level to send to syslog? I've noticed generating logs and pushing them to syslog may also be tricky if the WiFi connection is supposed to be active but it isn't. The same reason DNS lookups fail and lead to crashes.
Well I have to take another coffee to look at the root page issue handling commands... Let's first focus in the discussion on the reboots. You are also running the new VL53... plugin. Is there a correlation between new (or recently changed) plugins and increased memory usage?
All nodes already run the new VL53L0X on all my nodes since the changes. especially also on a "productive" node that actively measures the distance and has an uptime of >3 days now... (mini-5 in the graphic above)..
void addToLog(byte loglevel, const __FlashStringHelper *str)
{
String copy;
if (copy.reserve(strlen_P((PGM_P)str))) {
copy = str;
addToLog(loglevel, copy.c_str());
}
}
This is what I did recently change in the (pending) ESPEasy-now mesh PR Maybe you can also try this on your flaky nodes?
What log level to send to syslog? I've noticed generating logs and pushing them to syslog may also be tricky if the WiFi connection is supposed to be active but it isn't. The same reason DNS lookups fail and lead to crashes.
On all nodes only ERROR
. So no difference in any of the nodes you see in the graphics above..
Interestnlgy the node rebooting the most with this build is mini-4, which does not have so much configured...
This is what I did recently change in the (pending) ESPEasy-now mesh PR Maybe you can also try this on your flaky nodes?
Sure, I'll try that. For sure that's an improvement, on hte other hand all my production nodes run a build from apr. 20st am without issues for 3 days now.
Do you call a controller by a domain name or an IP? (DNS resolve is flaky sometimes)
Already changed some time ago everywehere and everything to IP to prevent DNS issues.
My bad, I just realised that on 2 nodes I still had an FQDN for NTP server, changed that to IP, probably related (or not... )
Controller always uses (and used) IP...
it looks completely random to me... also with above change to addToLog()
I get reboots. Sometimes after 2 min. sometimes after 10...
I'm offline later on, but I think I'll reverse the commits one by one to find which one causes the issues... probably fastest and easiest way to go...
false was the default behaviour before the change, but now it a !false which means default behaviour changed to true?
But even then, obvisouly I'km logged in on the page (it asks when I first connect to the node), so I should be able to execute the reboot on the web page anyways...
Changing MAIN_PAGE_SHOW_BASIC_INFO_NOT_LOGGED_IN
to true makes the reboot button work again (as expected?)...
Hmm not sure if this is expected if you set it to true. You do have credentials set?
On one I have set it on the other not...
One more thing just observerd: I was away for 2 hors now, and both nodes had an uptime of >2hours... clicking around the various tabs it took about 20sec. to make the reboot....
So somehow it UI related...
Changing
MAIN_PAGE_SHOW_BASIC_INFO_NOT_LOGGED_IN
to true makes the reboot button work again (as expected?)...
Unfortunately not consistant... seemed to be luck... I just flashed another unit with the same build, and the reboot button just sends me to the main index page..
Do you have the "Periodical Scan WiFi" setting checked or unchecked? (Tools -> Advanced, bottom of the page)
unchecked everywhere...
It's really strange, as long as I don't access these units, they seem to work... for sure I didn't have this issue with versions prior to apr. 20st...
I'm testing now, even with some patches I made (which I will push soon) to prevent re-allocation as much as possible. I do notice the memory usage is quite a lot, so something must have changed recently and I cannot find it yet. So please also try to perform a binary search on commits to see what may be causing this.
I noticed you have one task running the CCS811. For that one I merged a rounding bug PR: https://github.com/letscontrolit/ESPEasy/pull/3513
So please also try to perform a binary search on commits to see what may be causing this.
Sure, I'll do.
I did a number of builds with at different points. I think it's within PR #3576, but can't say for sure. Unfortunately that PR has quite a lot of commits in it... So it'll take a bit of time to find the critical one.
I noticed you have one task running the CCS811. For that one I merged a rounding bug PR: #3513
Yes, but only on one of the test nodes, the ones without CCS811 enabled face the same issues..
One more thing, if I do a wget of one of the files (eg. for backup) I always seem to get the index page:
wget -q -O - --timeout=5 --tries=2 --http-password="******" --http-user="admin" http://wemos-mini-17.clumsy.ch/rules3.txt
returns the main page....
So I think the Authentication part has something to do with it....(same as with the unable to reboot)...
Unable to reboot has to do with the redirect. That part is already dealt with tin the PR linked here: https://github.com/letscontrolit/ESPEasy/pull/3598
I will also push some more code, would be great if you could also test that code as it tries to deal with memory allocation on low free memory and also with timeouts on UDP connections and DNS lookups.
Sure... I just ran a build on commit #e4bd63aff4be231f957b9c910fcc17343e8e91f8 which has significantly (2k) more free memory, but a lot of HW watchdogs... Uploaded around 18:40 to mini-2 and mini-4..
In the recent pushed commits I also changed serving the default template. It had a reserve(576) which was actually 3 bytes short of the actual needed size, so that would have been an allocation of twice that size. It now has a corrected default reserve size and keeps a static value with the largest seen size so it will at most perform such a re-alloc once (per run since boot)
In commits made after that one you mentioned ( https://github.com/letscontrolit/ESPEasy/commit/e4bd63aff4be231f957b9c910fcc17343e8e91f8 ) I do keep some data structures for a longer period:
I'm currently compiling 8165bff3ff08cbd3737a0a7f9671cc1d0988b51e which is between e407b1b2d4b41b0fc795e66890d94e3f0b700d19 (memory gone) and e4bd63aff4be231f957b9c910fcc17343e8e91f8 (memory still there)... so still in the middle of the binary search (takes some time to compile ;)...
I'm trying to pin down the indivual commit that creates the issue, then I'll check the new changes...
Just to be sure, what is for you "the issue" you're looking for? If it is the reboot command, then I can already tell you what it is... I was hoping to find a cause for the reboots and memory loss
I'm currently after the memory loss, which most likely are causing the reboots...
Hmm I just took a quick look at the size of the WiFi_AP_Candidate struct. That's roughly some 64 bytes (depending on SSID and password length). Since a recent change I'm now keeping scan results also in such a struct in a list. So if you're seeing like 30 APs on your node, then you already are close to 2kB of RAM usage.
Yep, that's for sure a memory hog... I will look into it to make it more efficient as there is for sure a lot of memory duplication.
Can you at least test the rest of my pending PR I linked?
Sure... I'm happy to test... but I only have (at max) two more commits to go to find the commit that causes the issue... so before starting all over (even if I think your guess is correct) I'll finish the binary search ;) now and test the mentioned PR's later... hope that's OK... I promise it'll be stil ltoday, I reserved all evening for it ;)
Sure :)
Ok, now I'm completely confused... I think I need a break or some coffeein.. I did half a dozen versions with different commits... and in the end again one version with the same commit as the first one that showed more memory... but that one also had lower memory... even though it should be the same (see table below)...
So I think I take a break and try yoru PR's. I think it's only #3598 as the others are already merged into mega?
Testbuilds:
T6: ok e4bd63aff4be231f957b9c910fcc17343e8e91f8
T7: nok e407b1b2d4b41b0fc795e66890d94e3f0b700d19
T8: nok 8165bff3ff08cbd3737a0a7f9671cc1d0988b51e
T9: nok b78b549456c7e3f8eaaf6eb3924b8823187af51f
T10: nok 704ba96451b91852be14cd51ad77794ebfd7913d
T11: nok 17e04b815ec087eb1715466f8954e7950609093b
T12: nok e4bd63aff4be231f957b9c910fcc17343e8e91f8
Hmm.. I think I just looked wrong.. T12 seems to be ok... also checking again the other versions..
the build with PR #3598 has low memory but at least seems to run stable (no reboots until now)... still testing though..
Just pushed a commit to my PR which -over time- will release the memory occupied by WiFi scan results. So if you don't have "periodical WiFi scan" enabled you will see all this memory be freed again after 5 minutes from boot (or reconnect)
ok, I'll try that one...
could it be that commit 17e04b815ec087eb1715466f8954e7950609093b could cause some issues? As said, I'm a bit confused currently...
at least still, PR #3598 runs stabel... even though it has low memory, so that's already an improvemnet!!
at least still, PR #3598 runs stabel... even though it has low memory, so that's already an improvemnet!!
It will re-gain some of the memory after 5 minutes.
I just pushed a quick and dirty fix to restore old behavior to allow reboot via HTTP. Have to think about how to fine tune access control and I'm afraid that may be quite some work to get it right.
Reboot is one thing... but dowloading the files via wget is not working either on nodes with passwd set.
Interestingly it works in the browser... do I need to send some cookies along with the http-auth when using wget?
Yep, authorization is done on a session and only one is kept. So if you logged into the browser and then tried some wget or curl call, you may have been logged out from the ESP in your browser.
that would be ok, but wget always returns 0B when getting a rules.txt file from a passwd protected node:
wget -q -O - --http-password="******" --http-user="admin" http://<FQDN>/rules1.txt
returns the rules on a node without authentication and empty on nodes with authentication...
Memory improvment is very marginal compared to what it was at commit e4bd63aff4be231f957b9c910fcc17343e8e91f8 mini-2 and mini-4 are running your latest PR with memory freeing (as mentioned above) all others run commit e4bd63aff4be231f957b9c910fcc17343e8e91f8
Your rules and controllers stil work fine with the PR? I also changed how rules events are being handed over to the queue, so just to make sure nothing got broken there....
Since a vew days (probably after merge of PR #3576) I face various strange issues, like:
Not sure if I'm the only one experiencing this or if it's an issue of the core, if so I'm happy to trace my local build, just wanted to feedback here in case of a wider issue..
EDIT: last point only seems to be an issue on password protected nodes..