Low stability due to memory leak in latest ESP8266 custom firmware revisions

ghtester commented 1 year ago

I am still facing to low stability of recent ESP8266 custom ESP Easy firmware revisions. It's great there are many new features and much more plugins than couple years ago but when I compare achievable Uptime parameter, the the ESP node running 2018 firmware release is able to work over 400 days without reboot, the current release can survive just several days unfortunately. :-( I understand for many use cases it's OK but there are some applications where this issue makes a serious troubles. I have upgraded several ESP nodes with different configurations to latest two ESP Easy releases (custom Vagrant compilation from mega-20230508 and mega-20230623 sources) and it looks there's a significant memory leak which leads to node crash in several days. The release couple months old does not have such obvious issue (despite it's not so stable as 2018 release as well). It's strange nobody else reported this issue so far but I can see it clearly in my environment (ESP21 has an older firmware revision).

TD-er commented 1 year ago

I've noticed exactly the same lately and have been working on some tweaks, which I put in this PR: https://github.com/letscontrolit/ESPEasy/pull/4354 Can you perhaps test the code of that PR to see if it makes any improvement?

There is another issue I found which does cause a memory leak (not sure if it is the same as you're experiencing) Do you have debug log enabled on the serial port? When I have, I see free memory drop by roughly 1k per day, which does seem to be similar to the clear memory leak you're seeing.

ghtester commented 1 year ago

Thanks for the quick reply & confirmation, I'll give the mentioned PR a try. AFAIK I should have the serial logging disabled on all nodes (can't check right now) but on one node I am using the Serial Proxy plugin (don't know if it makes a trouble as well or not).

TD-er commented 1 year ago

In the latest builds I renamed it to "ESPEasy Serial Console" and I am now looking into that code to see if I may have messed up there.

It looks like the recently added IMPROV support may also cause some 'leaks', but I can't put my finger on where... So still looking for it. Anyway, you may also want to set #define FEATURE_IMPROV 0 to actually disable this feature as it is probably enabled in all non-1M builds. But please try to test 1 thing at a time otherwise we won't know what may have fixed it :)

ghtester commented 1 year ago

Unfortunately I was not able to custom compile with PR:

vagrant@vagrant:~/GitHub/letscontrolit/ESPEasy/tools$ cd /home/vagrant/GitHub/letscontrolit/ESPEasy/tools vagrant@vagrant:~/GitHub/letscontrolit/ESPEasy/tools$ ./build_ESPeasy.sh -p 4354 Summary: Pull Req : 4354 Git Tag : Description : SRC dir : /home/vagrant/GitHub/letscontrolit/ESPEasy Python venv : /home/vagrant/GitHub/letscontrolit/ESPEasy/venv/python3.10 TmpDir ZIP : /home/vagrant/GitHub/letscontrolit/ESPEasy/ESPEasy_collect_dist Build Docs : 1 Build ESP32 : 1 Build ESP82xx: 1

  Use at your own risk!!

Continue in 5 seconds... Press CTRL-C to abort ./build_ESPeasy.sh: line 214: /home/vagrant/GitHub/letscontrolit/ESPEasy/venv/python3.10/bin/activate: No such file or directory Could not activate Python virtual env: source /home/vagrant/GitHub/letscontrolit/ESPEasy/venv/python3.10/bin/activate vagrant@vagrant:~/GitHub/letscontrolit/ESPEasy/tools$

The python3.10 folder is empty.

ghtester commented 1 year ago

OK, so I have somehow managed the compilation with PR 4354 (I hope - this compilation needs both Python 3.10 and 3.8 venv activation and it's necessary to do it manually). Quickly tested on one ESP node (couple hours) but it does not seem to have a positive effect - the memory looks still leaking. There was the Info level set on Serial console. I have disabled it (set to None) at the red marked point but to my surprise the free RAM immediately decreased.

TD-er commented 1 year ago

Yep, I've been looking into this a lot today and still notice decrease in memory. But have not yep been able to pinpoint where it may leak.

ghtester commented 1 year ago

This is updated recording of Free RAM and Uptime variables on several ESP nodes. It's interesting there's a memory leak also on ESP node with earlier firmware (about 7 months old release) but with a different progress - see ESP21. It's also interesting sometimes the RAM is partially recovered (without reboot) - see ESP12.

TD-er commented 1 year ago

I have done a lot more tests here. The most effective seems to be to disable FEATURE_MDNS

#define FEATURE_MDNS 0

ghtester commented 1 year ago

Thanks for testing & sharing the result. Yesterday I have compiled the custom firmware with MDNS disabled + PR4354 and upgraded three of my ESP nodes. I can confirm your result, the behavior of upgraded nodes is now a much better. A plus is there's now a bit more free RAM at start. So hopefully the stability and achievable uptime should be a far better. Thanks a lot for your great work!

TD-er commented 1 year ago

Thanks for confirming.

ghtester commented 1 year ago

Well, one of three nodes crashed yesterday when uptime was about 16 hours, despite free RAM was at top level (but as I am monitoring this parameter every 10 minutes at the moment, maybe it's not true and I should read it more often on this node). :-(
As there are very different configurations (different rules used) across nodes, it needs much more time to identify the reason of crash.

TD-er commented 1 year ago

What sensor(s) were used on that node? For example I am now working on some fixes for the power sensor used in Sonoff POW modules. The one in the POW r1 did have some bug which could cause exception reboots and the CSE7766 used in later modules also may have some issues I'm working on.

ghtester commented 1 year ago

AFAIK (to be precised later) there are 18B20 sensors, PIR sensor connected to GPIO (input switch), IR sensor for receiving (should not be activated at time of crash), LCD display connected through I2C. Also the rules are using PWM to control fan speed through one GPIO. It's not Sonoff device so no power sensor is there.

Edit - the screenshot show the devices / plugins used. There are 2 active controllers, Home Assistant (openHAB) MQTT and Generic HTTP. The last crash was due to Exception, Last Action before Reboot: PLUGIN_READ: timer, id: 5

ghtester commented 1 year ago

Another crash of the same node, 19 hours uptime... It's strange this time a free RAM huge changes were visible. I'll try to struggle with node settings.

TD-er commented 1 year ago

I will try to see if I can merge the PR I have ready to hopefully have less bursts in RAM allocation. Hopefully I can merge this tomorrow.

ghtester commented 1 year ago

Thanks a lot but if you are talking about PR #4354, this is already included in my custom compiled firmware.

TD-er commented 1 year ago

Only one task seems to be sending to the HTTP controller. I will have a look there to see if there is something strange going on. Is it possible sending to that controller could trigger a chain of actions/events? For example ESP sends to host, host replies by making HTTP call to that ESP to set something?

ghtester commented 1 year ago

FYI Send to HTTP is also used in rules (at fixed time intervals, AFAIK every 5 minutes). The actions related with Send to HTTP are just one-direction - just sending data to remote portals. There's no (backward) HTTP access to ESP. Maybe the controller settings could play a role on this, in deep past I was struggling with this settings and it remained a bit different from defaults.

TD-er commented 1 year ago

And the rules, do you use event calls in there, or asyncevent ?

ghtester commented 1 year ago

There are just On event calls, in my rules I didn't even know there was the AsyncEvent command available so never used it.

Today the same node crashed again after about 24 hours. Yesterday I made one config change - enabled rules cache.

TD-er commented 1 year ago

But you're not calling another event from within a on....do...endon block?

ghtester commented 1 year ago

No, on this node there's no Event call in rules ever.

TD-er commented 1 year ago

What SDK are you using (see sysinfo page)?

I just tested again with SDK 3.1.2 and when using this one I get low free memory until I open a webpage again. Really strange...

ghtester commented 1 year ago

Firmware -- Build:⋄ | 20230711 - Mega System Libraries:⋄ | ESP82xx Core 2843a5ac, NONOS SDK 2.2.2-dev(38a443e), LWIP: 2.1.2 PUYA support Git Build:⋄ | My Build: Jul 11 2023 21:18:12 Plugin Count:⋄ | 40 [IR][No Debug Log] Build Origin: | Self built Build Time:⋄ | Jul 11 2023 21:17:52 Binary Filename:⋄ | ESP_Easy_mega_20230711_custom_IR_ESP8266_4M1M Build Platform:⋄ | Linux-4.15.0-51-generic-x86_64-with-glibc2.27 Git HEAD:⋄ | HEAD_fe35af2

TD-er commented 1 year ago

And the IR task was disabled as you mentioned? So no need to look there?

ghtester commented 1 year ago

Not disabled but there was not any IR signal active at the moment of crash. I think IR Receive plugin should be OK (btw. the current uptime is now 1,4 days). I have compiled it with disabled command map, using it also on another node with the same firmware version, without crash so far.

TD-er commented 1 year ago

Not disabled means it can still receive IR pulses. The IR Receiver module does have some filtering to only accept 38 kHz (or whatever version you're using), but I know my old TV sometimes started blinking its stand-by LED the same way it does when you press a button on the remote. So there is some IR source (probably the sun :) ) that can sometimes still trigger a bit sequence in the IR receiver module. And I don't know the IR plugin source code very well, but I do know that PVS Studio still has made some remarks in that code about uninitialized variables. Meaning anything can happen given enough random streams of bits.

TD-er commented 1 year ago

Do you also have HeatpumpIR included in your builds?

ghtester commented 1 year ago

Well I think more likely the reason of crash could be somewhere in SendToHTTP, maybe together with ocassional lost connection to target server(s). Please note the data to HTTP servers are sent every 60s from temperature sensor through controller (to one target server) and every 5 minutes through SendToHTTP command (to second target server) in rules. Also this (crashing) node is the only one using Generic HTTP controller and SendToHTTP commands. BTW. now the uptime is 1 day + almost 20 hours. As the crash also happened during the night and does not usually happen when I am near and using the IR RC, I suppose this is not currently the issue. AFAIK HeatpumpIR is not included as I don't use such devices so I believe it's excluded in Custom.h.

ghtester commented 1 year ago

FYI this is a fresh log of Free RAM and Uptime for 2 nodes with latest custom firmware. One node so far without crash, the ESP11 node which crashed two times after upgrade. There must be some random reason, I would say it could be due connection instability (not WiFi, this seems to be working great, but the WAN connection from AP to Internet which is sometimes temporarily interrupted). I'll try to use the ping plugin to detect this issue and record it somehow locally. BTW. I can confirm HeatpumpIR is NOT used in my custom firmware(s), it's commented out in Custom.h : // #define USES_P088 // HeatpumpIR

TD-er commented 1 year ago

I know the Espressif SDK sometimes may also cause crashes when a DNS lookup fails (a very old bug) Do you access hosts from ESPEasy via DNS or IP?

ghtester commented 1 year ago

Yes I am using DNS names as the host IP addresses may change.

TD-er commented 1 year ago

OK that might be some suspect for those crashes... Can you on one node try to only use IPs to address other hosts? Or is that not possible due to vhosts in Apache for example?

ghtester commented 1 year ago

I am afraid I can't to do that, the hosts are in cloud and they works like you outlined. I can't believe the old bug can't be fixed in SDK... :-/

TD-er commented 1 year ago

The problem seems to be related to the timout setting being ignored when starting to setup a connection. UDP seems to be more problematic with these issues compared to TCP.

ghtester commented 1 year ago

OK, then there's a question - could a timeout issue bug be activated when DNS IPs (set by DHCP) are: Primary: local IP (of local AP); Secondary: external (8.8.8.8) and the WAN connectivity from AP to Internet is temporarily lost?

TD-er commented 1 year ago

If there are multiple DNS entries, ESPEasy will pick one of them at random. So if one of those doesn't know about the requested hostname, or gives a timeout, then strange things might happen.

I only have my router as DNS for my ESP units.

ghtester commented 1 year ago

OK, thanks for the details mentioned, this could be an important thing to reconfigure. The current settings has a reason as there are some other local devices using it and my provider's DNS sometimes does not work properly so that's why there's a backup Google's DNS. I'll consider to change this.

TD-er commented 1 year ago

Just create a local VLAN for your ESPs, use some Layer-3 switch, then add a DNS forwarder pointing to a separate DNS setup and a separate DHCP pool. Can be done using a 25 euro Mikrotik access point and a 80 euro layer 3 switch supporting VLAN.

Isn't that hard and can be useful to separate networks :)

I once made a review for such a layer-3 switch in which I also describe how to setup these. And also a quick overview of why you may want to use a layer-3 switch.

ghtester commented 1 year ago

Well, thanks a lot for the advice but I wouldn't like to complicate my home network like that. ;-) What I might to do instead could be just adding another AP (that I already have so the cost is 0) dedicated to ESP nodes but I think even this shouldn't be necessary. The most wanted solution of course is to have all the SW bugs fixed... ;-) :-) I am sure the developers are going to do it, soon or later and I'll be patient... ;-)

TD-er commented 1 year ago

The most wanted solution of course is to have all the SW bugs fixed... ;-) :-) I am sure the developers are going to do it, soon or later and I'll be patient... ;-)

Ah sounds good, I will also wait for those to be fixed... Will save me a lot of time :)

TD-er commented 1 year ago

Could you perhaps also test with this commit included in your build? https://github.com/letscontrolit/ESPEasy/pull/4354/commits/392e6cc8956cb8dab7e1b5c83377f4a8890596cf

It isn't that exciting, as it just changes the compiler optimization flag from -Os to -O2. But hopefully it does "fix" strange WiFi issues on ESP8266 I tried it on several nodes in my network with different builds (normal/SonoffPOW/etc) to test here and it does 'feel' a bit more snappy. But that's not really a scientific way of testing.

So I hope (really, really hope) it will make builds a bit more predictable and reproducible regarding WiFi.

ghtester commented 1 year ago

Thanks for the hint but currently I don't encounter any WiFi issues, all my ESP nodes are (re)connecting properly and quickly. So if this does not help with the suspicious DNS issue / crash, I don't plan to test it (also it's not easy for me to find and edit the proper file in the Vagrant environment).

TD-er commented 1 year ago

OK, fair enough..

ghtester commented 1 year ago

FYI - one of three nodes with latest firmware crashed 4 days ago.

TD-er commented 1 year ago

The Cyan colored line still shows a memory leak... Is that one running the latest code, with mDNS disabled?

ghtester commented 1 year ago

No, this node still has an older custom firmware, dated 20230527, for a contrast. Also the orange (ESP21) node has an older firmware for a comparison, compiled 20221108. Another three nodes have the latest code, only one of them is still running without crash (after the upgrade) and it's not listed here. This surviving node is located at different place, with more stable connection to Internet. At the first location I have changed the DHCP settings to provide only one (local) DNS server, I wonder if it helps to achieve a better Uptime...

TD-er commented 1 year ago

I'm thinking about adding some wrapper surrounding DNS lookups, so we can pre-cache DNS lookups with our own controlled timeouts...

ghtester commented 1 year ago

That would be nice. BTW. yesterday afternoon I had a temporary WAN outage; when it was restored, two ESP nodes with the latest firmware could not reconnect to remote MQTT controller (configured with DNS name). The WiFi connectivity to AP was not interrupted. Two other ESP nodes at the same location, with old firmwares, which had configured the same MQTT controller with fixed IP address, reconnected to MQTT controller as soon as the WAN outage ended. I had to reboot the AP (an hour later) so the all ESP nodes lost the WiFi connectivity and then also the connectivity to MQTT controller was restored on all ESP nodes.

TD-er commented 1 year ago

I don't know how long DNS requests are being cached. So maybe eventually the ESP would be able to resolve the DNS, if it is related to a timeout/TTL setting. Another cause might be that LWIP still considers the request pending for some reason, but that's just an idea popping up in my head as possible cause. No grounds to prove this theory.

letscontrolit / ESPEasy

Low stability due to memory leak in latest ESP8266 custom firmware revisions #4726