Web Interface issues - Githubissues

clumsy-stefan commented 3 years ago

Since a vew days (probably after merge of PR #3576) I face various strange issues, like:

spontaneous reboots with HW watchdogs or exception (between 5min. and 5 hours), likely because of not enough memory
when connecting to a node it changes/redirects from FQDN to IP address for the requests
reboot from web IF does not work anymore (http://FQDN/?cmd=reboot). it just directly loads the main page.

Not sure if I'm the only one experiencing this or if it's an issue of the core, if so I'm happy to trace my local build, just wanted to feedback here in case of a wider issue..

EDIT: last point only seems to be an issue on password protected nodes..

TD-er commented 3 years ago

that would be ok, but wget always returns 0B when getting a rules.txt file from a passwd protected node:

Do you have MAIN_PAGE_SHOW_BASIC_INFO_NOT_LOGGED_IN at its default (false) ?

clumsy-stefan commented 3 years ago

Your rules and controllers stil work fine with the PR? I also changed how rules events are being handed over to the queue, so just to make sure nothing got broken there....

Just did a quick check with some of the rules and they seem to work (not thoroughly tested though)..

Do you have MAIN_PAGE_SHOW_BASIC_INFO_NOT_LOGGED_IN at its default (false) ?

Yes. But as already discussed, you check for negated MAIN_PAGE_SHOW_BASIC_INFO_NOT_LOGGED_IN as opposite to the previous version where it was checked for false.... so this seems logical, that the default is now the opposite.

Original const bool loggedIn = isLoggedIn(false); has been changed to const bool loggedIn = isLoggedIn(!MAIN_PAGE_SHOW_BASIC_INFO_NOT_LOGGED_IN);. So inverted logic here after change.

TD-er commented 3 years ago

Please make it as it is in the PR and do not #define MAIN_PAGE_SHOW_BASIC_INFO_NOT_LOGGED_IN

clumsy-stefan commented 3 years ago

Sure I can, but if it's by default false the check will still return the opposite as the original code because of the ! in front oth the MAIN_PAGE_SHOW_BASIC_INFO_NOT_LOGGED_IN

TD-er commented 3 years ago

Nope, see: https://github.com/letscontrolit/ESPEasy/commit/bdc1975162bd572c8aa783b8e9abee20e5a46ca8#diff-b2966947e97c13bb8f37afaa5f7adb9cefdfb98895261b9e376f3cea94b1eff3

clumsy-stefan commented 3 years ago

Ok, sorry, I just looked at your comment here which was different, and not at the code..

So I did a new version with PR #3598 now and updated 4 test nodes... I think I'll give it a break, get some sleep and verify tomorrow how they run... Currently it looks good, except that the wget fetching of files as stated above doesn't work on nodes with PW set.

I can further test and report tomorrow then...

I think mainly the memory issue and probably why the links won't work anymore (eg. with wget), I assume similar issue than with reboot button.

Thanks for all the fixes up to now!!

TD-er commented 3 years ago

OK, so we can conclude the PR does fix more than it breaks, right? Then I will merge it so I can also merge the code into my ESPEasy-now branch which really needs some fixes too....

clumsy-stefan commented 3 years ago

OK, so we can conclude the PR does fix more than it breaks, right?

At least the reboots/WD/exceptions are gone, yes, and no new issues found (until now), so I guess it's an improvement... Memory is already low in mega, so this also slightly improves the situation (but does not completely fix it). But due to the additinal memory checks it's not that problematic anymore... so in short: yes, IMHO you should merge it...

Then I will merge it so I can also merge the code into my ESPEasy-now branch which really needs some fixes too....

ok... I still have ESPEasy-now as a project to look at on my todo list ;) but one after the other..

thanks again and good n8!

TD-er commented 3 years ago

The ESPEasy-now is still in flux, and therefore not yet merged. But it is being used already in a product, which really needs an update so I've also been working hard on that PR.

clumsy-stefan commented 3 years ago

The good news, no reboots, no WD, no exceptions anymore! The units ran stable. Interestingly two of the nodes show the lower memory footprint (mini-2 and -4). These units have number of tasks defined, the two other test units (mini-17 and -18) do not show degraded memory... Around 19:15 yesterday I had a (nonstable) but older build also on -2 and -4 therefore the available memory there is increased... Still a bit strange I think... It could hint, that one of the plugins cause this memory issue.... I'll do a few tries on where this could be...

TD-er commented 3 years ago

Hard to say what may have caused it. I now just tackled some checks on success of memory allocation and am actually throwing away WiFi scan results after they have not been seen for 5 minutes. Later I will make a change to the struct holding the scan information as it can be made smaller

TD-er commented 3 years ago

I merged another commit this morning to reduce the number of string allocations and one of them is in the webserver part to send chunked transfers. Can you also test that part? The UI just feels quite a bit faster and more responsive.

clumsy-stefan commented 3 years ago

Sure, doing a new version now. I guess the mega branch has all in it now, PR has been merged..

clumsy-stefan commented 3 years ago

Seems to run stable. Actually I think it's one of the most stable versions ever... Will test it on more nodes... It show a slight free memory increase, but still much lower than versions before Apr. 16th. But not on all nodes... The difference between Units 2/4 and 17/18 is (desides some tasks) that 2/4 see more SSID's, so that could explain some of the memory drain...

Besides the memory the only problem I'm currently experiencing is that fetching files via wget from a password protected node does not work anymore (see above)... but that's a minor point i think..

Other than this, as said, it seems a very good and stable version to me!!

TD-er commented 3 years ago

Good to know all of yesterdays changes really seem to be doing something :)

clumsy-stefan commented 3 years ago

That's what I'm used to see of your changes! ;)

Just did some quick tests, unfortunately all 4 test nodes see the same AP's (8-10), so that wouldn't explain the free mem difference. Double checked all settings and I can't really see a difference between 17/18 and 2/4 (except the tasks). But others (eg. 5) has a lot more tasks and rules defined than 2/4.... This makes it difficult to pinpoint a specific commit, as it seems it doesn't happen on all nodes...

TD-er commented 3 years ago

Not all tasks allocate the same amount of memory as task data. Also it may depend on the number of queued messages for a controller how much those queues take. If one task sends a lot more messages but is rate limited by the minimal send interval for example, the queues will be filled a lot more. For some controllers the queue elements are not that optimal, so each element may take a significant amount of RAM.

But now with the recent checks for success of memory allocation, the amount of free memory should play less of a role in the system stability and only how much more a node can handle :)

clumsy-stefan commented 3 years ago

Sure... but still strange as the tasks didn't change but memory did (and 2k is not litte)... I updated now one of the production units (nr. 5) which is one of the more loaded ones... let's see what memory does on that one...

clumsy-stefan commented 3 years ago

quick update: nr 5 is alive... at least ;) even though it lost about 2K memory (see graphic)... which is not a problem, obvisouly as it seems to manage it correctly without running out of memory.

secondly, I found that when I pass the --auth-no-challenge option to the wget command above fetching of the files works again, also with password protected units. Basically this tells wget to always send authentication, no matter if it receives a challenge or not... so it seems it doesn't get a (correct) challenge anymore from the node... but at least this is a good workaround for me to make my backup scripts work again...

So from my point of view we're in a stable situation again, everything works again, we just don't know where it "looses" the memory and why password protected node's don't behave the same on reboot an fetching files by wget..

Again, thanks for your help and fast response, really appreciated!

PS: one more thing: did you ever manage to create a "small enough" 1M binary lately (I didn't)?

TD-er commented 3 years ago

Nope, 1M OTA builds are probably only possible with the latest code if we also update the 2-step OTA image so we can use compressed images.

clumsy-stefan commented 3 years ago

Ok, thanks..

Interesting: I did a new build (no code changes!) and updated all nodes (19:35)... now if you look at the graphs. some nodes even have more memory free than before, other have less though.. .and even the test nodes have more again... strange..

EDIT: the pro-7 (most loaded node I have) has even more free memory than before! So that's great!!

clumsy-stefan commented 3 years ago

secondly, I found that when I pass the --auth-no-challenge option to the wget command above fetching of the files works again, also with password protected units. Basically this tells wget to always send authentication, no matter if it receives a challenge or not... so it seems it doesn't get a (correct) challenge anymore from the node... but at least this is a good workaround for me to make my backup scripts work again...

Is there duplicated code here? Login can be done either via Basic-Auth or via http://<FQDN>/login, is that really necessary? Wuldn't basic-auth be enough?

TD-er commented 3 years ago

Not sure... I will have a look at it later this weekend.

clumsy-stefan commented 3 years ago

@TD-er: this duplication probably would also reduce build size slightly... (in reference to #3655)

TD-er commented 3 years ago

Is this still an open issue? Now with the changes I merged today related to the login and rules handling. Now the memory usage should also be less when editing rules (or at least switching rules in the rules set selector)

Also the /login url is removed.

clumsy-stefan commented 3 years ago

Hmm.. can't really say, I just know thet with commit form yesterday (includein your core dev 3.0. branch) I get quite often HW watchdog (5min -15h). I only realised it now, so I need to find out what's going on. but I think unrelated to this...

TD-er commented 3 years ago

There is a reason the core dev 3.0 PR is not yet merged, as I also get a lot of strange issues with the latest core. Mainly that one build feels rock solid, change a string somewhere, make a new build and it is highly unstable.

clumsy-stefan commented 3 years ago

agree, the version I did some 3 days ago ran solid without any issues.... compared to the one form today... strange...

TD-er commented 3 years ago

I do have an idea where it may be coming from... One part of me really hopes it is the magical fix. On the other hand if it is, then we're kinda screwed.... I think it may be caused by aliasing. For example to allocate an array of char and then cast 4 elements in that array to a bigger type like float or int.

Problem is that it is next to impossible to find all occurences and I have a very strong feeling that some of the closed source parts are doing this

TD-er commented 3 years ago

Can you add this to the build flags? No idea where it must be placed in the Arduino IDE, but I guess you're the expert here ;)

-fno-strict-aliasing

clumsy-stefan commented 3 years ago

Did a new build with this flag about an hour ago. One of 5 units reboots about very 5min the other ones seem to be more stable... I'll keep them running and report later.

TD-er commented 3 years ago

Are they all using the same build? Also the same plugins/controllers active? Any idea on what may be a possible cause? (e.g. reboot reason, memory usage)

clumsy-stefan commented 3 years ago

Are they all using the same build?

More or less, some are 1M some 4M and some 16M nodes the 4M an 16M run one on 80MHz and one on 16Mhz each.

Also the same plugins/controllers active?

No, different

Any idea on what may be a possible cause? (e.g. reboot reason, memory usage)

It says Hardware Watchdog in Background Task. No Clue yet what it may cause.. Memory look ok, about the same between the stable build from 4 days ago and today.

TD-er commented 3 years ago

HW watchdog is most of the time related to timeouts. => Often WiFi stability. But it could also be trying to perform some infinite loop. This can be simply a programming error, but then it is strange it isn't happening on all nodes. Do you see some info on the last task being executed before the crash? Last week I also extended that debug string to be slightly more informative. (I hope)

It could however also be a bug caused by this aliasing I was talking about, but then I would really like to have some clue where to start looking.

clumsy-stefan commented 3 years ago

Wifi did not change and all other nodes wich run the same 3.0 dev branch (but from last week) run without issues. By now all of the updated nodes rebooted at least ones.

I can't see more info currently except HW watchdog. Last task is eitehr background task or timer 20ms.

SO I think it's most likely one of the changes from this week, but have no clue yet which...

I can dig further in to it tonight, need to leave now...

TD-er commented 3 years ago

What is the typical uptime before crashing? One of the changes last week was about randomizing NTP a bit more and computing the new "time wander" to get some idea of crystal stability.

Can you give these wander values (see sysinfo page)? They will be available after 3600 - 4000 secs. The value should be less than 0.01 msec/sec for a crystal to meet Espressif's specs.

tonhuisman commented 3 years ago

The value should be less than 0.01 msec/sec for a crystal to meet Espressif's specs.

So this isn't particularly good... 🤔

Screenshot - 14_08_2021 , 13_57_17

(Wemos D1 mini Pro clone (4M, without metal shield))

clumsy-stefan commented 3 years ago

I'd say between 5 and 3h. uptime currently. Only one of the 5 did not reboot yet (2h35m).

Time Wander is 0 on nodes with <1h uptime and 0.048 on nodes >1h (on the 5 test nodes).

TD-er commented 3 years ago

The value should be less than 0.01 msec/sec for a crystal to meet Espressif's specs.

So this isn't particularly good... 🤔

(Wemos D1 mini Pro clone (4M, without metal shield))

Please check it after 7 hours uptime (+ some margin) Then it will have based the wander on a longer interval.

clumsy-stefan commented 3 years ago

With the latest commit from yesterday (2b662b2cbc3d26cec96f11f5ab54f9a55b088a6c) the nodes seems to be stable again. NTP Offset after 20h between 0.008ms and 0.050ms (depending on the node)..

TD-er commented 3 years ago

Hmm That's strange to see that particular commit may fix stability on your builds as you build them in Arduino IDE and the commit you refer to has only to do with PlatformIO. So it just sounds like you unknowingly made a (more) clean build which solved your issue.

Can you see some correlation between a high(er) time wander value and WiFi stability?

clumsy-stefan commented 3 years ago

Hmm.. but in there are also the commit 0844b7, af8aa7, 26964f, dfd427, which do change some code I guess... could be one of these...

Currently I don't see any correlation between NTP and WiFi. Especially as I don't have any wifi stability issues at all.

I redid the build today and the units run fine until now... I'll keep it on my tests nodes and see how they behave tonight.

TD-er commented 3 years ago

There isn't correlation between WiFi and NTP... But I do use NTP to check the stability and frequency of the used crystal in the ESP. If the crystal frequency is off by more than 10 ppm, then the crystal accuracy is less than what's specified by Espressif. And the crystal frequency is related to how the WiFi performs.

So for now it is just some idea to test to see if there's any correlation between bad crystals and stability of a unit.

clumsy-stefan commented 3 years ago

The test units ran stable for over 20h now. only one rebooted due to exception... I'll watch further..

clumsy-stefan commented 3 years ago

Just FYI, latest commits form today seem to be rock stable again (your dev3.0 branch with mega merged in)...

TD-er commented 3 years ago

Today? The latest merge was 4 days ago.

Or do you mean some merges in the esp8266/Arduino repository?

clumsy-stefan commented 3 years ago

Sorry, I meant I made the build yesterday, latest commit was 8c7b49520cb59574a2dc237dbebaa866aae3415a from mega and I guess 2b662b2cbc3d26cec96f11f5ab54f9a55b088a6c from 3.0 dev branch...

I had no restart within the last 12 hours on ~30 nodes..

TD-er commented 3 years ago

And built with Arduino IDE I assume?

If so, did you set the no strict aliasing parameter yourself?

clumsy-stefan commented 2 years ago

I think this is way outdated.. closing it... if anything rises I'll reopen.

letscontrolit / ESPEasy

Web Interface issues #3597