Web Interface issues - Githubissues

clumsy-stefan commented 3 years ago

Since a vew days (probably after merge of PR #3576) I face various strange issues, like:

spontaneous reboots with HW watchdogs or exception (between 5min. and 5 hours), likely because of not enough memory
when connecting to a node it changes/redirects from FQDN to IP address for the requests
reboot from web IF does not work anymore (http://FQDN/?cmd=reboot). it just directly loads the main page.

Not sure if I'm the only one experiencing this or if it's an issue of the core, if so I'm happy to trace my local build, just wanted to feedback here in case of a wider issue..

EDIT: last point only seems to be an issue on password protected nodes..

TD-er commented 3 years ago

HW watchdogs are also what I have seen. (maybe due to performing WiFi scans in the background?)

The (reboot) command not working on the main page is probably related to having a password set? Still not a desired side effect => Will look into it.

Change FQDN to IP is also a side effect I had not anticipated. I added the redirect to help the setup page for initial access, but I guess it should only be done when you connect over the AP of the ESP and not the normal interface.

clumsy-stefan commented 3 years ago

HW watchdogs are also what I have seen. (maybe due to performing WiFi scans in the background?)

I have not enabled the background scan in the advanced config... I'll do some more checks there

The (reboot) command not working on the main page is probably related to having a password set? Still not a desired side effect => Will look into it.

Yes, that's also what I found (see EDIT above, was after initial issue creation)...

Change FQDN to IP is also a side effect I had not anticipated. I added the redirect to help the setup page for initial access, but I guess it should only be done when you connect over the AP of the ESP and not the normal interface.

Yes, probably in AP mode and during setup that makes sense, afterwards it looks a bit nasty... even though it works...

thanks for looking into it..

TD-er commented 3 years ago

On the exception reboots. Do you have some statistics on memory usage? Preferrably a trend while running, to see if there is some memory leak

TD-er commented 3 years ago

Regarding this one:

reboot from web IF does not work anymore (http://FQDN/?cmd=reboot). it just directly loads the main page.

If you have a password set, how did you do the reboot before?

I can change this part (RootPage.cpp)

  if (!MAIN_PAGE_SHOW_BASIC_INFO_NOT_LOGGED_IN) {
    if (!isLoggedIn()) { return; }
  }

  const bool loggedIn = isLoggedIn(false);

into:

  if (!MAIN_PAGE_SHOW_BASIC_INFO_NOT_LOGGED_IN) {
    if (!isLoggedIn()) { return; }
  }

  const bool loggedIn = isLoggedIn(!MAIN_PAGE_SHOW_BASIC_INFO_NOT_LOGGED_IN);

To get the old behavior.

clumsy-stefan commented 3 years ago

mini-2 and mini-4 are running the new version.

Interestingly mini-17 and mini-18 below are also running the new version, but without the memory issues.

The only "difference" of these units are the tasks configures (same build, but different tasks defined).

TD-er commented 3 years ago

Then I would like to know what the different plugins are... :)

clumsy-stefan commented 3 years ago

At first sight the only "real" difference is the cron-plugin, which is defined, but not (anymore) activated...

clumsy-stefan commented 3 years ago

  const bool loggedIn = isLoggedIn(!MAIN_PAGE_SHOW_BASIC_INFO_NOT_LOGGED_IN);

Hmm... I have #define MAIN_PAGE_SHOW_BASIC_INFO_NOT_LOGGED_IN false (as per Custom-sample.h which would mean it has the opposite behaviour than before.. That could explain it, I'll change it to trueso that the negation matches again...

TD-er commented 3 years ago

  const bool loggedIn = isLoggedIn(!MAIN_PAGE_SHOW_BASIC_INFO_NOT_LOGGED_IN);
Hmm... I have #define MAIN_PAGE_SHOW_BASIC_INFO_NOT_LOGGED_IN false (as per Custom-sample.h which would mean it has the opposite behaviour than before.. That could explain it, I'll change it to trueso that the negation matches again...

That flag set to false is the default behavior. So that should work as it did, so it must return on the first call to isLoggedIn(), as it was. Meybe you're experiencing 2 issues here, one related to the redirect?

TD-er commented 3 years ago

2 and 4 both also have lots of plugins active, so no wonder they do show more free memory compared to 17 and 18. You also run pulse counter, which do show enormous high value for time. Maybe the latest code of that pulse counter is causing issues, although I have not been running that plugin myself.

Do you also have logging (perhaps to a syslog server) enabled? Do you call a controller by a domain name or an IP? (DNS resolve is flaky sometimes)

clumsy-stefan commented 3 years ago

false was the default behaviour before the change, but now it a !false which means default behaviour changed to true?

But even then, obvisouly I'km logged in on the page (it asks when I first connect to the node), so I should be able to execute the reboot on the web page anyways...

TD-er commented 3 years ago

Well I have to take another coffee to look at the root page issue handling commands... Let's first focus in the discussion on the reboots. You are also running the new VL53... plugin. Is there a correlation between new (or recently changed) plugins and increased memory usage?

clumsy-stefan commented 3 years ago

2 and 4 both also have lots of plugins active, so no wonder they do show more free memory compared to 17 and 18.

yes, but they also did before, so lower memory is ok there, but it they never rebooted (uptimes >5days) before... also I have nodes with much more defined that do not reboot without that mentioned PR.

You also run pulse counter, which do show enormous high value for time.

Yep, but on these nodes nothing is connected to the pin's, was just for testing, On the other hand I have another node running with 2 fans connected running at high speed without issues or reboots...

Maybe the latest code of that pulse counter is causing issues, although I have not been running that plugin myself.

I did not expierence any issues after updating to the latest pulse-counter changes. The nodes with the fan's run for days without issues.

Do you also have logging (perhaps to a syslog server) enabled?

yes

Do you call a controller by a domain name or an IP? (DNS resolve is flaky sometimes)

Already changed some time ago everywehere and everything to IP to prevent DNS issues.

TD-er commented 3 years ago

What log level to send to syslog? I've noticed generating logs and pushing them to syslog may also be tricky if the WiFi connection is supposed to be active but it isn't. The same reason DNS lookups fail and lead to crashes.

clumsy-stefan commented 3 years ago

Well I have to take another coffee to look at the root page issue handling commands... Let's first focus in the discussion on the reboots. You are also running the new VL53... plugin. Is there a correlation between new (or recently changed) plugins and increased memory usage?

All nodes already run the new VL53L0X on all my nodes since the changes. especially also on a "productive" node that actively measures the distance and has an uptime of >3 days now... (mini-5 in the graphic above)..

TD-er commented 3 years ago

void addToLog(byte loglevel, const __FlashStringHelper *str)
{
  String copy;
  if (copy.reserve(strlen_P((PGM_P)str))) {
    copy = str;
    addToLog(loglevel, copy.c_str());
  }
}

This is what I did recently change in the (pending) ESPEasy-now mesh PR Maybe you can also try this on your flaky nodes?

clumsy-stefan commented 3 years ago

What log level to send to syslog? I've noticed generating logs and pushing them to syslog may also be tricky if the WiFi connection is supposed to be active but it isn't. The same reason DNS lookups fail and lead to crashes.

On all nodes only ERROR. So no difference in any of the nodes you see in the graphics above..

Interestnlgy the node rebooting the most with this build is mini-4, which does not have so much configured...

clumsy-stefan commented 3 years ago

This is what I did recently change in the (pending) ESPEasy-now mesh PR Maybe you can also try this on your flaky nodes?

Sure, I'll try that. For sure that's an improvement, on hte other hand all my production nodes run a build from apr. 20st am without issues for 3 days now.

clumsy-stefan commented 3 years ago

Do you call a controller by a domain name or an IP? (DNS resolve is flaky sometimes)

Already changed some time ago everywehere and everything to IP to prevent DNS issues.

My bad, I just realised that on 2 nodes I still had an FQDN for NTP server, changed that to IP, probably related (or not... )

Controller always uses (and used) IP...

clumsy-stefan commented 3 years ago

it looks completely random to me... also with above change to addToLog() I get reboots. Sometimes after 2 min. sometimes after 10...

I'm offline later on, but I think I'll reverse the commits one by one to find which one causes the issues... probably fastest and easiest way to go...

false was the default behaviour before the change, but now it a !false which means default behaviour changed to true?

But even then, obvisouly I'km logged in on the page (it asks when I first connect to the node), so I should be able to execute the reboot on the web page anyways...

Changing MAIN_PAGE_SHOW_BASIC_INFO_NOT_LOGGED_IN to true makes the reboot button work again (as expected?)...

TD-er commented 3 years ago

Hmm not sure if this is expected if you set it to true. You do have credentials set?

clumsy-stefan commented 3 years ago

On one I have set it on the other not...

One more thing just observerd: I was away for 2 hors now, and both nodes had an uptime of >2hours... clicking around the various tabs it took about 20sec. to make the reboot....

So somehow it UI related...

clumsy-stefan commented 3 years ago

Changing MAIN_PAGE_SHOW_BASIC_INFO_NOT_LOGGED_IN to true makes the reboot button work again (as expected?)...

Unfortunately not consistant... seemed to be luck... I just flashed another unit with the same build, and the reboot button just sends me to the main index page..

TD-er commented 3 years ago

Do you have the "Periodical Scan WiFi" setting checked or unchecked? (Tools -> Advanced, bottom of the page)

clumsy-stefan commented 3 years ago

unchecked everywhere...

It's really strange, as long as I don't access these units, they seem to work... for sure I didn't have this issue with versions prior to apr. 20st...

TD-er commented 3 years ago

I'm testing now, even with some patches I made (which I will push soon) to prevent re-allocation as much as possible. I do notice the memory usage is quite a lot, so something must have changed recently and I cannot find it yet. So please also try to perform a binary search on commits to see what may be causing this.

TD-er commented 3 years ago

I noticed you have one task running the CCS811. For that one I merged a rounding bug PR: https://github.com/letscontrolit/ESPEasy/pull/3513

clumsy-stefan commented 3 years ago

So please also try to perform a binary search on commits to see what may be causing this.

Sure, I'll do.

I did a number of builds with at different points. I think it's within PR #3576, but can't say for sure. Unfortunately that PR has quite a lot of commits in it... So it'll take a bit of time to find the critical one.

I noticed you have one task running the CCS811. For that one I merged a rounding bug PR: #3513

Yes, but only on one of the test nodes, the ones without CCS811 enabled face the same issues..

clumsy-stefan commented 3 years ago

One more thing, if I do a wget of one of the files (eg. for backup) I always seem to get the index page: wget -q -O - --timeout=5 --tries=2 --http-password="******" --http-user="admin" http://wemos-mini-17.clumsy.ch/rules3.txt

returns the main page....

So I think the Authentication part has something to do with it....(same as with the unable to reboot)...

TD-er commented 3 years ago

Unable to reboot has to do with the redirect. That part is already dealt with tin the PR linked here: https://github.com/letscontrolit/ESPEasy/pull/3598

I will also push some more code, would be great if you could also test that code as it tries to deal with memory allocation on low free memory and also with timeouts on UDP connections and DNS lookups.

clumsy-stefan commented 3 years ago

Sure... I just ran a build on commit #e4bd63aff4be231f957b9c910fcc17343e8e91f8 which has significantly (2k) more free memory, but a lot of HW watchdogs... Uploaded around 18:40 to mini-2 and mini-4..

TD-er commented 3 years ago

In the recent pushed commits I also changed serving the default template. It had a reserve(576) which was actually 3 bytes short of the actual needed size, so that would have been an allocation of twice that size. It now has a corrected default reserve size and keeps a static value with the largest seen size so it will at most perform such a re-alloc once (per run since boot)

TD-er commented 3 years ago

In commits made after that one you mentioned ( https://github.com/letscontrolit/ESPEasy/commit/e4bd63aff4be231f957b9c910fcc17343e8e91f8 ) I do keep some data structures for a longer period:

WiFiAPCandidates
(one more, I have to look through the code as I can't remember its name)

clumsy-stefan commented 3 years ago

I'm currently compiling 8165bff3ff08cbd3737a0a7f9671cc1d0988b51e which is between e407b1b2d4b41b0fc795e66890d94e3f0b700d19 (memory gone) and e4bd63aff4be231f957b9c910fcc17343e8e91f8 (memory still there)... so still in the middle of the binary search (takes some time to compile ;)...

I'm trying to pin down the indivual commit that creates the issue, then I'll check the new changes...

TD-er commented 3 years ago

Just to be sure, what is for you "the issue" you're looking for? If it is the reboot command, then I can already tell you what it is... I was hoping to find a cause for the reboots and memory loss

clumsy-stefan commented 3 years ago

I'm currently after the memory loss, which most likely are causing the reboots...

TD-er commented 3 years ago

Hmm I just took a quick look at the size of the WiFi_AP_Candidate struct. That's roughly some 64 bytes (depending on SSID and password length). Since a recent change I'm now keeping scan results also in such a struct in a list. So if you're seeing like 30 APs on your node, then you already are close to 2kB of RAM usage.

TD-er commented 3 years ago

Yep, that's for sure a memory hog... I will look into it to make it more efficient as there is for sure a lot of memory duplication.

Can you at least test the rest of my pending PR I linked?

clumsy-stefan commented 3 years ago

Sure... I'm happy to test... but I only have (at max) two more commits to go to find the commit that causes the issue... so before starting all over (even if I think your guess is correct) I'll finish the binary search ;) now and test the mentioned PR's later... hope that's OK... I promise it'll be stil ltoday, I reserved all evening for it ;)

TD-er commented 3 years ago

Sure :)

clumsy-stefan commented 3 years ago

Ok, now I'm completely confused... I think I need a break or some coffeein.. I did half a dozen versions with different commits... and in the end again one version with the same commit as the first one that showed more memory... but that one also had lower memory... even though it should be the same (see table below)...

So I think I take a break and try yoru PR's. I think it's only #3598 as the others are already merged into mega?

Testbuilds:

T6:     ok      e4bd63aff4be231f957b9c910fcc17343e8e91f8
T7:     nok     e407b1b2d4b41b0fc795e66890d94e3f0b700d19
T8:     nok     8165bff3ff08cbd3737a0a7f9671cc1d0988b51e
T9:     nok     b78b549456c7e3f8eaaf6eb3924b8823187af51f
T10:    nok     704ba96451b91852be14cd51ad77794ebfd7913d
T11:    nok     17e04b815ec087eb1715466f8954e7950609093b
T12:    nok     e4bd63aff4be231f957b9c910fcc17343e8e91f8

clumsy-stefan commented 3 years ago

Hmm.. I think I just looked wrong.. T12 seems to be ok... also checking again the other versions..

the build with PR #3598 has low memory but at least seems to run stable (no reboots until now)... still testing though..

TD-er commented 3 years ago

Just pushed a commit to my PR which -over time- will release the memory occupied by WiFi scan results. So if you don't have "periodical WiFi scan" enabled you will see all this memory be freed again after 5 minutes from boot (or reconnect)

clumsy-stefan commented 3 years ago

ok, I'll try that one...

could it be that commit 17e04b815ec087eb1715466f8954e7950609093b could cause some issues? As said, I'm a bit confused currently...

at least still, PR #3598 runs stabel... even though it has low memory, so that's already an improvemnet!!

TD-er commented 3 years ago

at least still, PR #3598 runs stabel... even though it has low memory, so that's already an improvemnet!!

It will re-gain some of the memory after 5 minutes.

TD-er commented 3 years ago

I just pushed a quick and dirty fix to restore old behavior to allow reboot via HTTP. Have to think about how to fine tune access control and I'm afraid that may be quite some work to get it right.

clumsy-stefan commented 3 years ago

Reboot is one thing... but dowloading the files via wget is not working either on nodes with passwd set.

Interestingly it works in the browser... do I need to send some cookies along with the http-auth when using wget?

TD-er commented 3 years ago

Yep, authorization is done on a session and only one is kept. So if you logged into the browser and then tried some wget or curl call, you may have been logged out from the ESP in your browser.

clumsy-stefan commented 3 years ago

that would be ok, but wget always returns 0B when getting a rules.txt file from a passwd protected node: wget -q -O - --http-password="******" --http-user="admin" http://<FQDN>/rules1.txt returns the rules on a node without authentication and empty on nodes with authentication...

Memory improvment is very marginal compared to what it was at commit e4bd63aff4be231f957b9c910fcc17343e8e91f8 mini-2 and mini-4 are running your latest PR with memory freeing (as mentioned above) all others run commit e4bd63aff4be231f957b9c910fcc17343e8e91f8

TD-er commented 3 years ago

Your rules and controllers stil work fine with the PR? I also changed how rules events are being handed over to the queue, so just to make sure nothing got broken there....

letscontrolit / ESPEasy

Web Interface issues #3597