letscontrolit / ESPEasy

Easy MultiSensor device based on ESP8266/ESP32
http://www.espeasy.com
Other
3.29k stars 2.22k forks source link

Home Assistant (openHAB) MQTT controller is exhausting RAM during reconnect #2684

Closed ghtester closed 3 years ago

ghtester commented 5 years ago

I am testing a custom build (by Vagrant) from official sources 20190926 with a plugin set customized Custom.h.txt The firmware details: Build:⋄ | 20104 - Mega System Libraries:⋄ | ESP82xx Core 2.6.0-dev, NONOS SDK 2.2.2-dev(38a443e), LWIP: 2.1.2 PUYA support Git Build:⋄ | My Build: Sep 26 201908:20:00 Plugins:⋄ | 34 [Normal] Build Md5: | e6d1ef3eb972c31b6659c853501e48 Md5 check: | passed. Build Time:⋄ | Sep 26 2019 08:09:00 Binary Filename:⋄ | ESP_Easy_20190926_vagrant_custom_ESP8266_4M1M.bin

The ESP8266 node with weak WiFi reception (about -90 or even worse) is reconnecting to AP quite often. Without MQTT controller active it's not an issue but as soon as the MQTT controller is enabled and a data from sensor are transferred to MQTT broker, soon or later the RAM is exhausted (regardless on controller settings), node stops sending data and reboots in the end (usually with Exception 29). The current controller settings is here (started with defaults, tried to change everything without positive effect): MQTT I am sending data from BMP280 every 15 seconds. Also tried 2 different MQTT brokers (openHAB 2.5 embedded - unusable and external - CentOS 7 default) with the same issue.

ghtester commented 4 years ago

OK, so there's a quick report - yesterday installed the custom build: Build:⋄ | 20104 - Mega System Libraries:⋄ | ESP82xx Core 2_6_2, NONOS SDK 3.0.0-dev(c0f7b44), LWIP: 2.1.2 PUYA support Git Build:⋄ | My Build: Dec 3 201919:35:18 Plugins:⋄ | 37 [Normal] Build Md5: | bd4c957b666ac72afd627be7988c112 Md5 check: | passed. Build Time:⋄ | Dec 3 2019 19:36:46 Binary Filename:⋄ | ESP_Easy_20191203_vagrant_custom_sdk3_ESP8266_4M1M.bin

Node rebooted with Exception (29) after about 18 hours, the latest status just before reboot: 65644649 : Info : WD : Uptime 1094 ConnectFailures 48 FreeMem 7024 WiFiStatus 3

The free RAM is now about 11100 (after the warm boot and 2 ConnectFailures).

TD-er commented 4 years ago

Which was the last Last Task: in the sysinfo page?

ghtester commented 4 years ago

I don't know, currently there is: Boot: | Manual reboot (4) Reset Reason: | Exception Last Task: | Task Device timer, id: 3

And the status info: 6304648 : Info : WD : Uptime 105 ConnectFailures 22 FreeMem 11152 WiFiStatus 3

Exception data from Serial Console: Exception (29): epc1=0x402810f6 epc2=0x00000000 epc3=0x40100da4 excvaddr=0x00000000 depc=0x00000000

              >>>stack>>>

                         ctx: cont
                                  sp: 3fff2650 end: 3fff2ae0 offset: 01a0
                                                                         3fff27f0:  00000000 3fff5bc4 000f000f 0023906e
                                         3fff2800:  382e3032 00000033 85ff2860 40266800
         3fff2810:  382e3032 00000033 85ff2860 3fff5bac
                                                         3fff2820:  000f000f 00ff07de 3fff28c0 00000001
                         3fff2830:  3fff28cc 00000000 3fff2860 402668e1
                                                                         3fff2840:  3fff07de 00000000 3fff29a0 00000001
                                         3fff2850:  3fff07de 00000000 3fff29a0 4024643f
         3fff2860:  382e3000 00000033 80100e4e f4395810
                                                         3fff2870:  0000000f 3ffe85e0 03e9c221 3ffeff00
                         3fff2880:  3fff146c 3ffe85e0 0000030f 40232cd2
                                                                         3fff2890:  402030b8 00000000 00000000 40266ae7
                                         3fff28a0:  00002820 00000003 3fff0710 402350a7
         3fff28b0:  00000032 3fff291c 00000005 00000001
                                                         3fff28c0:  3fff77a4 000f001f 00000020 382e3032
                         3fff28d0:  00000033 8516001f 3fff986c 0015001f
                                                                         3fff28e0:  002e3473 3fffaad4 3fff87ac 402668e1
                                         3fff28f0:  3fff77a4 3fff29a0 00000001 40266900
         3fff2900:  0000000c 3fff29a0 00000000 3fff29a0
                                                         3fff2910:  491e5b5d 00000003 0000002b 402245a1
                         3fff2920:  000f002f 802be9d0 3fff29a0 00000001
                                                                         3fff2930:  3fff2960 3fff29a0 00000000 40243fdf
                                         3fff2940:  402be9d0 00000000 00000000 00000480
         3fff2950:  00000000 3fff146c 3fff29a0 402440b8
                                                         3fff2960:  646e6500 61746144 80ff2900 40266f7c
                         3fff2970:  491d18e0 491cee4f 3fff2a10 402668e1
                                                                         3fff2980:  402be884 491cee4f 00000001 491d18c7
                                         3fff2990:  3fff07de 3fff0fe0 3fff4608 40244296
         3fff29a0:  3fff8700 000e000f 80ff29a8 3fff2900
                                                         3fff29b0:  3fff29b0 80ff29b0 3fff2900 3fff29b8
                         3fff29c0:  80ff29c0 3fff2900 3fff29c8 80ff29c8
                                                                         3fff29d0:  3fff2900 3fff29d0 80ff29d8 00000000
                                         3fff29e0:  00000000 00000000 00000000 00000000
         3fff29f0:  00000000 00000000 03000300 00010c00
                                                         3fff2a00:  41a6aaab 00000000 00000000 00000000
                         3fff2a10:  00000009 3ffefdec 80000020 401017d7
                                                                         3fff2a20:  491ced07 f277a2ae 0012b700 3fff2b28
                                         3fff2a30:  3ffefdcc 3ffefdec 00000003 3fff0618
         3fff2a40:  03ea36f5 3fff5f7e 00000003 3fff0618
                                                         3fff2a50:  491cec57 491cee4f 00000003 402442eb
                         3fff2a60:  491cec57 3ffefdac 30000003 402691da
                                                                         3fff2a70:  491cec57 00000003 40254e70 40254ea2
                                         3fff2a80:  00000027 03e9c1c5 0000031f 40222e92
         3fff2a90:  00000002 00000000 00000000 3fff2b28
                                                         3fff2aa0:  3fffdad0 3ffe857a 00000000 40254f95
                         3fff2ab0:  3fffda00 00000000 80ff2ae8 401007f5
                                                                         3fff2ac0:  3fffdad0 00000000 3fff2ae8 40268324
                                         3fff2ad0:  feefeffe feefeffe 3ffe871c 40100739
         <<<stack<<<

                    last failed alloc call: 4024201F(40)

ets Jan 8 2013,rst cause:2, boot mode:(3,6)

load 0x4010f000, len 1384, room 16 tail 8 chksum 0x2d csum 0x2d vbc204a9b ~ld

TD-er commented 4 years ago

last failed alloc call: 4024201F(40) That's something that we don't check on in our code (well sometimes we do), so if that happens it is very likely things will crash. How long are the MQTT messages for this controller?

ghtester commented 4 years ago

Not observed at MQTT broker for a long time but it should be just a few bytes as the path is short:

BTW. the current status with RSSI -80: 26194648 : Info : WD : Uptime 437 ConnectFailures 64 FreeMem 10656 WiFiStatus 3

TD-er commented 4 years ago

Home Assistant has the data formatted as JSON, right? That's not really a short way of sending the data.

ghtester commented 4 years ago

No, I am using OpenHAB and the Controller does subscribe to %sysname%/# and send a data in this form: %sysname%/%tskname%/%valname%

BTW. another crash couple minutes ago: 32404648 : Info : WD : Uptime 540 ConnectFailures 84 FreeMem 6064 WiFiStatus 3 32421139 : Info : EVENT: MQTTimport#Disconnected 32421261 : Error : IMPT : MQTT 037 Connection lost 32421412 : Error : IMPT : Failed to connect to MQTT broker - attempt 1 32422020 : Error : IMPT : Failed to connect to MQTT broker - attempt 2 32422629 : Error : IMPT : Failed to connect to MQTT broker - attempt 3 32423515 : Error : IMPT : Failed to connect to MQTT broker - attempt 1 32424123 : Error : IMPT : Failed to connect to MQTT broker - attempt 2 32424733 : Error : IMPT : Failed to connect to MQTT broker - attempt 3 32425541 : Error : MQTT : Connection lost, state: Disconnected 32425543 : Info : EVENT: MQTT#Disconnected 32425755 : Error : MQTT : Failed to connect to broker 32425842 : Info : BMP280 : Address: 0x76 32425844 : Info : BMP280 : Temperature: 24.54 32425846 : Info : BMP280 : Barometric Pressure: 1000.87 32425849 : Info : EVENT: BMX280#Temperature=24.54 32425929 : Info : EVENT: BMX280#=0.00 32426006 : Info : EVENT: BMX280#Pressure=1000.87

Exception (29): epc1=0x402810f6 epc2=0x00000000 epc3=0x4010624b excvaddr=0x00000000 depc=0x00000000

              >>>stack>>>

                         ctx: cont
                                  sp: 3fff2650 end: 3fff2ae0 offset: 01a0
                                                                         3fff27f0:  00000000 3fff9d44 0018001f 0023906e
                                         3fff2800:  352e3432 00000034 85ff2860 40266800
         3fff2810:  352e3432 00000034 85ff2860 3fff770c
                                                         3fff2820:  0018001f 00ff07de 3fff28c0 00000003
                         3fff2830:  3fff28cc 00000000 3fff2860 402668e1
                                                                         3fff2840:  3fff07de 00000000 3fff29a0 00000003
                                         3fff2850:  3fff07de 00000000 3fff29a0 4024643f
         3fff2860:  352e3400 ffff0034 80100e4e ff7ced91
                                                         3fff2870:  00000007 3ffe85e0 01eec865 401004ab
                         3fff2880:  3fff146c 3ffe85e0 000000ff 40232c74
                                                                         3fff2890:  402030b8 00000000 00000000 40266ae7
                                         3fff28a0:  000050c8 00000004 3fff0710 402350a7
         3fff28b0:  00000032 3fff291c 00000005 00000004
                                                         3fff28c0:  3fff76e4 0018001f 00000020 352e3432
                         3fff28d0:  00000034 851e001f 3fff76bc 0016001f
                                                                         3fff28e0:  002e3473 3fffb0b4 3fff8454 402668e1
                                         3fff28f0:  3fff76e4 3fff29a0 00000003 40266900
         3fff2900:  00000010 3fff29a0 00000002 3fff29a0
                                                         3fff2910:  8cbed396 00000003 0000002b 402245a1
                         3fff2920:  0017002f 802be9d0 3fff29a0 00000003
                                                                         3fff2930:  3fff2960 3fff29a0 00000002 40243fdf
                                         3fff2940:  402be9d0 00000000 00000000 00000480
         3fff2950:  00000000 3fff146c 3fff29a0 402440b8
                                                         3fff2960:  646e6500 61746144 80ff2900 40266f7c
                         3fff2970:  8cbb307a 8cbb1174 3fff2a10 402668e1
                                                                         3fff2980:  402be884 8cbb1174 00000001 8cbb3061
                                         3fff2990:  3fff07de 3fff0ff0 3fff4608 40244296
         3fff29a0:  3fff8400 000e000f 80ff29a8 3fff2900
                                                         3fff29b0:  3fff29b0 80ff29b0 3fff2900 3fff29b8
                         3fff29c0:  80ff29c0 3fff2900 3fff29c8 80ff29c8
                                                                         3fff29d0:  3fff2900 3fff29d0 80ff29d8 00000000
                                         3fff29e0:  00000000 00000000 00000000 00000000
         3fff29f0:  00000000 00000000 03000400 00081000
                                                         3fff2a00:  41c47ae1 00000000 447a3ab3 00000000
                         3fff2a10:  00000009 3ffefdec 80000020 401017d7
                                                                         3fff2a20:  8cbb103b 1e91a88a 00240600 3fff2b28
                                         3fff2a30:  3ffefdcc 3ffefdec 00000004 3fff0618
         3fff2a40:  01eef040 3fff5fc4 00000004 3fff0618
                                                         3fff2a50:  8cbb0f7f 8cbb1174 00000004 402442eb
                         3fff2a60:  8cbb0f7f 3ffefdac 30000004 402691da
                                                                         3fff2a70:  8cbb0f7f 00000004 40254e70 40254ea2
                                         3fff2a80:  00000027 01eeb5a8 00001256 40222e92
         3fff2a90:  00000002 00000000 00000000 3fff2b28
                                                         3fff2aa0:  3fffdad0 3ffe857a 00000000 40254f95
                         3fff2ab0:  3fffda00 00000000 80ff2ae8 401007f5
                                                                         3fff2ac0:  3fffdad0 00000000 3fff2ae8 40268324
                                         3fff2ad0:  feefeffe feefeffe 3ffe871c 40100739
         <<<stack<<<

                    last failed alloc call: 4024201F(40)

ets Jan 8 2013,rst cause:2, boot mode:(3,6)

load 0x4010f000, len 1384, room 16 tail 8 chksum 0x2d csum 0x2d vbc204a9b

From sysinfo:

Boot: | Manual reboot (5) Reset Reason: | Exception Last Task: | Task Device timer, id: 4 SW WD count: | 0

TD-er commented 4 years ago

No, I am using OpenHAB and the Controller does subscribe to %sysname%/# and send a data in this form: %sysname%/%tskname%/%valname%

That's the topic, but how is the payload formatted? That's JSON, right?

ghtester commented 4 years ago

No, just a characters AFAIK - at least what I saw in test client's output on broker's side, subscribed to the same topic. The data were just about 10 - 20 bytes as far as I can remember. I'll test that again... But it's a question what exactly is there - maybe the test client does show "decoded" data. But the data size shown was like that - just a few bytes...

TD-er commented 4 years ago

Well I could also take a look at the source code, but right now I'm way too tired for it. So it is easier to speculate based on assumptions :)

ghtester commented 4 years ago

I understand what you are talking about... :-) So this is a fresh MQTT communication sample grabbed by mosquitto_sub on MQTT broker: Client mosq-dtOMQFh3O6Rixd0e5d received PUBLISH (d0, q0, r1, m0, 'ESP03/status/LWT', ... (9 bytes)) Connected Client mosq-dtOMQFh3O6Rixd0e5d received PUBLISH (d0, q0, r0, m0, 'ESP03/BMX280/Temperature', ... (5 bytes)) 23.73 Client mosq-dtOMQFh3O6Rixd0e5d received PUBLISH (d0, q0, r0, m0, 'ESP03/BMX280/Pressure', ... (6 bytes)) 998.97 Client mosq-dtOMQFh3O6Rixd0e5d received PUBLISH (d0, q0, r0, m0, 'ESP03/Light/Lux', ... (4 bytes)) 0.00 Client mosq-dtOMQFh3O6Rixd0e5d received PUBLISH (d0, q0, r0, m0, 'ESP03/BMX280/Temperature', ... (5 bytes)) 23.74 Client mosq-dtOMQFh3O6Rixd0e5d received PUBLISH (d0, q0, r0, m0, 'ESP03/BMX280/Pressure', ... (6 bytes)) 998.99 Client mosq-dtOMQFh3O6Rixd0e5d received PUBLISH (d0, q0, r0, m0, 'ESP03/Light/Lux', ... (4 bytes)) 0.00 Client mosq-dtOMQFh3O6Rixd0e5d received PUBLISH (d0, q0, r0, m0, 'ESP03/BMX280/Temperature', ... (5 bytes)) 23.73 Client mosq-dtOMQFh3O6Rixd0e5d received PUBLISH (d0, q0, r0, m0, 'ESP03/BMX280/Pressure', ... (6 bytes)) 998.98 Client mosq-dtOMQFh3O6Rixd0e5d received PUBLISH (d0, q0, r0, m0, 'ESP03/BMX280/Temperature', ... (5 bytes)) 23.74 Client mosq-dtOMQFh3O6Rixd0e5d received PUBLISH (d0, q0, r0, m0, 'ESP03/BMX280/Pressure', ... (6 bytes)) 998.98 Client mosq-dtOMQFh3O6Rixd0e5d sending PINGREQ Client mosq-dtOMQFh3O6Rixd0e5d received PINGRESP Client mosq-dtOMQFh3O6Rixd0e5d received PUBLISH (d0, q0, r0, m0, 'ESP03/Light/Lux', ... (4 bytes)) 0.00 Client mosq-dtOMQFh3O6Rixd0e5d received PUBLISH (d0, q0, r0, m0, 'ESP03/BMX280/Temperature', ... (5 bytes)) 23.75 Client mosq-dtOMQFh3O6Rixd0e5d received PUBLISH (d0, q0, r0, m0, 'ESP03/BMX280/Pressure', ... (6 bytes)) 998.95

ghtester commented 3 years ago

It would be great if this issue could be fixed sometimes... It looks even worse in latest builds then before. As soon as the WiFi connection is not absolutely stable and there are disconnects / reconnects, the Home Assistant (openHAB) MQTT controller is quickly exhausting RAM (despite Max Queue Depth = 10, Allow Expire = enabled, De-duplicate = enabled and Full Queue Action = Delete Oldest) and Exception & Reboot happens VERY soon and VERY often. Generic - MQTT Import is used and Environment - DS18b20 device sending temperature every 60 seconds - nothing really special.

Firmware

Build:⋄ | 20113 - Mega System Libraries:⋄ | ESP82xx Core 2843a5ac, NONOS SDK 2.2.2-dev(38a443e), LWIP: 2.1.2 PUYA support Git Build:⋄ | My Build: May 7 2021 20:50:54 Plugin Count:⋄ | 35 Build Origin: | Vagrant Build Time:⋄ | May 7 2021 20:49:56 Binary Filename:⋄ | ESP_Easy_mega_20210507_custom_IR_ESP8266_4M1M Build Platform:⋄ | Linux-4.15.0-51-generic-x86_64-with-glibc2.27 Git HEAD:⋄ | mega_52fb53a

Boot: Manual reboot (237)
Reset Reason: Exception
TD-er commented 3 years ago

How much memory do you usually have free? I thought it was working quite well these days, especially since I started using the move operator instead of copying data (thus requiring twice the amount of memory) N.B. the use of the move operator may have been merged after the 7th of May, not 100% sure.

TD-er commented 3 years ago

Hmm it should have been present if you used the latest sources on the 7th of May: https://github.com/letscontrolit/ESPEasy/commit/3a11a19a49806718a2f40bde7ea697296ec82c85

ghtester commented 3 years ago

Well, it depends and it's not constant. In fact the behavior is the same at 3 different nodes with different plugins, therefore with different free memory but the mentioned node has usually over 10000 bytes free. When there are no disconnects, it looks stable but after 3- 4 reconnects the memory is exhausted. I have some debug logs but as soon as the WiFi connectivity drops, the logging is also interrupted (currently no way to log to serial). I suppose it should be reproducible. I don't know what happens during reconnects, if it's a MQTT queue issue or the protocol itself (in deep past I was testing with much more frequent MQTT commands but currently it's one per 60s and it's crashing more often). Yes I am aware about memory management improvement but in my case it did not help (perhaps the idle free RAM is better but falls down during WiFi reconnect more quickly). Yes the build at May 7 was with current sources (to that date). Thanks for looking around... ;-)

TD-er commented 3 years ago

Can you maybe test the (small) changes made here: https://github.com/letscontrolit/ESPEasy/pull/3648

TD-er commented 3 years ago

Added some more potential fixes to the PubSubClient, based on a number of pending PRs of that library.

You can also download a testbuild here

ghtester commented 3 years ago

Thanks a lot, as I am using Vagrant to compile a Custom build with a preferred set of plugins, could you please let me know which parameters I should currently use to compile sources with necessary PR(s) not merged yet?

TD-er commented 3 years ago

You can perhaps login to the Vagrant environment and then run the build script in the tools dir (~/GitHub/letscontrolit/ESPEasy/tools) with the -p 3648 parameter to force fetching the PR.

ghtester commented 3 years ago

Yeah, thanks. I have found the earlier instructions yet - should I still follow that remaining parts? ....... Then it will checkout Git on that pull request. After that's loaded, you can kill the script and activate the Python virtual environment:

source ~/.venv/python3.8/bin/activate pip install -U tzupdate sudo tzupdate -p

then just run the build via:

cd /home/vagrant/GitHub/letscontrolit/ESPEasy/ pio run -e custom_ESP8266_4M1M

(in fact currently I should use custom_IR_ESP8266_4M1M)

.........

Edit - so I did what's described above and I have the custom build compiled - going to try & let you know.

ghtester commented 3 years ago

OK, so unfortunately after the upgrade the MQTT client was not connected (a red cross at main page):

MQTT Client Connected:

When I disabled the Home Assistant (openHAB) MQTT controller, refreshed the Main page, enabled the Home Assistant (openHAB) MQTT controller again and refreshed the Main page, the MQTT client turned to Connected state but after reboot (invoked manually from Tools page) it's again unconnected... :-/

Firmware

Build:⋄ | 20113 - Mega System Libraries:⋄ | ESP82xx Core 2843a5ac, NONOS SDK 2.2.2-dev(38a443e), LWIP: 2.1.2 PUYA support Git Build:⋄ | My Build: May 18 2021 21:55:18 Plugin Count:⋄ | 35 Build Origin: | Self built Build Time:⋄ | May 18 2021 21:54:21 Binary Filename:⋄ | ESP_Easy_mega_20210518_custom_IR_ESP8266_4M1M Build Platform:⋄ | Linux-4.15.0-51-generic-x86_64-with-glibc2.27 Git HEAD:⋄ | HEAD_cc02b25

TD-er commented 3 years ago

OK, will test it a bit more here. On my test node, it was connected just fine. If you could collect some logs on why it may not be connected, then please let me know.

TD-er commented 3 years ago

OK, can reproduce it, now having to fetch my own logs ;)

TD-er commented 3 years ago

Fixed it in the latest commit I just pushed to that PR.

ghtester commented 3 years ago

OK, thanks a lot, Gijs, so I'll try to recompile again with the same parameter -p 3648 and let you know as soon as I test it.

TD-er commented 3 years ago

I also made a 'normal' test build here: https://www.letscontrolit.com/forum/viewtopic.php?p=52773#p52773 For another user. Not sure if that one suits your needs, but maybe easier to test?

ghtester commented 3 years ago

Thank you again, unfortunately I can't use the 'normal' build anymore as I am using both IRRX and IRTX plugins + as much as possible other plugins (and excluded just devices I don't plan to own). So my custom build is a bit specific...

It's already compiled and one node upgraded, quickly tested and it was looking better regarding to MQTT. No issue with MQTT (re)connection on boot, then I put the node to place on edge WiFi reception. Several disconnects/reconnects was OK but in the end the node could not reconnect to WiFi anymore... Even when moved next to AP. where the signal was excellent.. :-/ Even after RESET by button and I had to unplug it from power and plug it again, then it reconnected quickly. So it needs more time for testing & serial logging... I'll keep you updated when I have some relevant information to share. Thanks again for your support!

TD-er commented 3 years ago

You're now testing 2 issues:

image Make sure to also test with Restart WiFi Lost Conn. as that may turn off WiFi and restart WiFi before reconnecting. The same as if it does a warm reboot.

ghtester commented 3 years ago

Thank you, I have changed WiFi parameters to recommended settings and I'll see what happens. Unfortunately, after several disconnects/reconnects I have to say there's still something 'eating' free RAM:

Free RAM: 6560
Free Stack: 3664

It is decreasing and won't go up (even the WiFi signal changes from a bad to excellent level) to values which are there after (re)boot:

Free RAM: 15368
Free Stack: 3664
TD-er commented 3 years ago

What is your controller setting? Regarding the queue management image The Allow Expire option should clear the messages if it can't deliver them.

See documentation

TD-er commented 3 years ago

N.B. I tried very very hard to get the unit here to not reconnect by setting the WiFi settings as bad as possible and also covering WiFi antenna of the ESP. Also I'm kicking the node actively from the APs I use here, to force it to reconnect. It keeps reconnecting without issues, so I cannot reproduce it here.

ghtester commented 3 years ago

The current settings is this (in past it did not have any effect when I tried to change everything):

image

Regarding to reconnects, it was strange even RESET did not help. I am testing with node move to bad signal / good signal. Now it looks OK but it really needs more time to test. Usually it's not a serious issue with WiFi reconnects in my environment (just sometimes takes long as there is a backup AP configured which is not always active) so the exhausting RAM with MQTT is currently a worse thing.

BTW. on this test node there's currently just one Device configured - Analog input - internal - which sends a data to MQTT controller every 10s.

TD-er commented 3 years ago

Can you also show what you use as WiFi settings (Tools -> Advanced, bottom of the page)

ghtester commented 3 years ago

image Force WiFi B/G and Restart WiFi Lost Conn checkboxes were unchecked when I reported an unsuccesful reconnect earlier - changed after your advice above. Maybe I'll disable the Periodical Scan WiFi option - I think I don't need it and it could perhaps also consume some RAM, despite temporarily.

TD-er commented 3 years ago

If you have lots of APs near you, it may indeed consume some more RAM. "Send with max TX power" will make it less likely your node will disconnect, thus harder to reproduce this behavior.

ghtester commented 3 years ago

Yes that's exactly my case. Also I know it could be a reason for a WiFi reconnecting issues - so from my perspective the WiFi works fine, especially now with the settings changed as you adviced. Overall, it looks now better including the MQTT, I have not enough time for a more deep tests right now but the RAM is stable at the moment... I'll upgrade another 2 nodes in different location in the evening and we'll see. Thanks a lot for your effort & great support & help! :-) I'll keep you updated how it works, currently I would recommend the PR to be merged.

TD-er commented 3 years ago

OK, will merge the MQTT changes later Still have to check it myself on ESP32, as that does have some specific ESP32 changes too.

TD-er commented 3 years ago

Is this still an issue? I think it is no longer an issue, so I will close this now. Please let me know if it is still an issue, then I will re-open it.

ghtester commented 3 years ago

I hope it should be OK, to be tested as soon as another fixes / feature upgrades, important to my configuration, are merged. Thanks a lot!

TD-er commented 3 years ago

Which fixes/features do you mean?

ghtester commented 3 years ago

I meant especially PR # 3653, I also needed P105 plugin for AHT sensors which is not merged yet and I was not sure how to compile the Custom build in Vagrant with more PRs not yet merged (maybe just repeat ./build_ESPeasy.sh -p XXXX and ./build_ESPeasy.sh -p YYYY is enough). As PR # 3653 was merged yet, I have already compiled the Custom build yesterday, updated 3 ESP nodes with the latest custom firmware and so far all three nodes are running without any issue. Thanks to all contributors for a great job!

TD-er commented 3 years ago

You can't checkout multiple branches at the same time, so that's indeed a bit tricky. But since https://github.com/letscontrolit/ESPEasy/pull/3653 was already merged, it should not be a problem for you to build based on a single PR. But to test all new stuff, the pending PR should still be rebased to the current 'mega' branch or else you're testing it against older code, if the pending PR was branched from an older commit in the 'mega' branch.

ghtester commented 3 years ago

Thanks for the explanation, I was afraid of that. But never mind, it's not a problem to always wait a bit, usually checkout of a single branch is enough. ;-)

ghtester commented 3 years ago

Unfortunately the Primary node crashed yet... :-/

Local Time: | 2021-08-15 21:36:49 -- | -- Time Source: | NTP Time Wander: | 0.000 [msec/sec] Uptime: | 0 days 0 hours 16 minutes Load: | 23.97% (LC=727) CPU Eco Mode: | false Boot: | Exception (1) Reset Reason: | Exception Last Action before Reboot: | PLUGIN_READ: timer, id: 7 SW WD count: | 0 Memory Free RAM: | 6048 Heap Max Free Block: | 2272 Heap Fragmentation: | 58% Free Stack: | 3664

The last recorded event in log before crash: EspEasy: DS: SP: 68,0,55,5,7f,a5,a5,66,84,OK,5,31,189

ghtester commented 3 years ago

Another crash of the same node...

Local Time: | 2021-08-15 22:09:17 -- | -- Time Source: | NTP Time Wander: | 0.000 [msec/sec] Uptime: | 0 days 0 hours 6 minutes Load: | 25.16% (LC=557) CPU Eco Mode: | false Boot: | Exception (2) Reset Reason: | Exception Last Action before Reboot: | PLUGIN_READ: timer, id: 7 SW WD count: | 0 Memory Free RAM: | 4880 Heap Max Free Block: | 1376 Heap Fragmentation: | 60% Free Stack: | 3664

The last recorded event in log before crash: EspEasy: DS: SP: 86,0,55,5,7f,a5,a5,66,4e,OK,4,31,189

TD-er commented 3 years ago

Any idea what the plugin is that's crashing? You do have extremely low free memory.

Can you test the core 3.0.0 PR and then the "custom beta" build? This will add 16k of RAM in a second heap which I will then use for "runtime" data like web log, WiFi scan results, controller queue buffering, etc.

Also do you have a lot of WiFi networks near your nodes? What WiFi related settings do you use in the tools/advanced page (bottom of the page) ?

ghtester commented 3 years ago

Any idea what the plugin is that's crashing?

It's hard to say, the node is now crashing at short intervals... :-/ The first crash happened when the Uptime was 1497 and FreeMem (in log) 7744

You do have extremely low free memory.

After the upgrade there was significantly more free RAM (about 7000-9000 on web page so about 9000-11000 in log). I don't know what's eating RAM but maybe it's related to MQTT or HTTP controllers... Also it's strange it did not happen from the very start after upgrade but it may be related to bad WiFi signal level which is not stable.

Can you test the core 3.0.0 PR and then the "custom beta" build?

I am not sure if I am able to compile, at least not with the same plugins set as P105 is not merged yet (but it's not used on this node anyway).

This will add 16k of RAM in a second heap which I will then use for "runtime" data like web log, WiFi scan results, controller queue buffering, etc.

Also do you have a lot of WiFi networks near your nodes?

Yes, currently about 40 APs visible.

What WiFi related settings do you use in the tools/advanced page (bottom of the page) ?

image

TD-er commented 3 years ago

The Extra WiFi scan loops, you've set it to 3. This will cause the scan to take longer (if needed) and it will also show a lot more access points in the scanned list. However this list will be cleared after 5 or 10 minutes, so if the unit remains connected for longer than that, it should not really matter that much. You have set the TX power to 16, but with "G" network it can go higher if you like. I also advice to select "Send with Max TX power" on badly behaving units regarding WiFi stability.

If you don't need the not yet merged PR for P105, then I think you should try core 3.0.x PR for a test run. I will merge the latest changes into it.

TD-er commented 3 years ago

OK, merged the latest changes into the core 3.0.x PR and it is running on my test node. So please can you test a custom_beta_esp8266_.... build? See PR: https://github.com/letscontrolit/ESPEasy/pull/3680

ghtester commented 3 years ago

Thanks for the comments. Scan loops 3 is there because of troubles with reconnection in past, I can later try to decrease it to see if it helps with repeating crashes during first ~10 minutes. Now the node works again without crashes, Uptime is 162 and FreeMem in log 10576 (without any configuration change).

I tried to compile with core 3.0.0.x but it did not work anymore for custom_beta_IR_ESP8266_4M1M used earlier. IR is a must for me.

 default: /home/vagrant/.platformio/packages/framework-arduinoespressif8266*//cores/esp8266/Esp.cpp
    default: : No such file or directory
    default: Patching /home/vagrant/.platformio/packages/framework-arduinoespressif8266*/
    default: patch: **** Can't change to directory '/home/vagrant/.platformio/packages/framework-arduinoespressif8266*/'
    default:  : No such file or directory
    default: Copy /vagrant/Custom.h > /home/vagrant/GitHub/letscontrolit/ESPEasy/src/Custom.h
    default: Error: Unknown environment names 'custom_beta_IR_ESP8266_4M1M'. Valid names are 'test_B_ESP32_4M316k_ETH, test_A_ESP32_4M316k_lolin_d32_pro, test_C_ESP8266_4M1M_VCC, test_B_beta_ESP8266_4M1M, normal_WROOM02_2M256, custom_beta_ESP8266_4M1M, minimal_core_274_sdk3_ESP8285_1M_OTA_FHEM_HA, test_A_ESP32-wrover-kit_4M316k_ETH, test_C_ESP32_IRExt_4M316k, normal_alt_wifi_ESP8266_1M_VCC, test_C_beta_ESP8266_16M_LittleFS, test_D_ESP32-wrover-kit_4M316k_ETH, max_ESP32_16M1M_ETH, minimal_IRext_ESP8266_1M, test_B_ESP32_IRExt_4M316k, minimal_IRext_ESP8266_4M2M, normal_sdk3_ESP8266_1M, test_D_ESP32_4M316k_lolin_d32_pro, normal_ESP32_4M316k_ETH, normal_ESP8266_1M_VCC, normal_alt_wifi_ESP8266_1M, minimal_core_274_ESP8285_1M_OTA_Domoticz, minimal_core_274_ESP8266_1M_OTA_FHEM_HA, custom_IR_ESP32_4M316k, test_A_ESP32_IRExt_4M316k, test_B_ESP32-wrover-kit_4M316k_ETH, normal_ESP32_4M316k, test_C_ESP8266_4M1M, max_ESP32_16M8M_LittleFS, max_ESP32_16M1M, spec_debug_custom_IR_ESP8266_4M1M, normal_WROOM02_2M, spec_debug_custom_ESP32_4M316k, test_A_beta_ESP8266_4M1M, normal_IRext_no_rx_ESP8266_4M2M, normal_beta_ESP8266_16M_LittleFS, test_D_ESP32_4M316k_ETH, test_A_ESP8266_4M1M, hard_other_POW_ESP8285_1M, test_A_ESP8266_4M1M_VCC, test_B_alt_wifi_ESP8266_4M1M_VCC, custom_alt_wifi_ESP8266_1M, test_B_ESP8266_4M1M, test_C_ESP32_4M316k_ETH, test_C_ESP32_4M316k_lolin_d32_pro, test_C_alt_wifi_ESP8266_4M1M_VCC, custom_beta_ESP8266_1M, test_A_beta_ESP8266_16M_LittleFS, hard_LCtech_relay_x2_1M, max_ESP32_16M8M_LittleFS_ETH, test_C_beta_ESP8266_4M1M, test_D_beta_ESP8266_16M_LittleFS, minimal_IRext_ESP8266_4M1M, custom_sdk3_ESP8266_4M1M, test_A_ESP32_4M316k, hard_SONOFF_POW_4M1M, test_D_alt_wifi_ESP8266_4M1M_VCC, energy_ESP8266_4M1M, test_B_ESP32-wrover-kit_4M316k, minimal_core_274_sdk3_ESP8266_1M_OTA_Domoticz, test_C_ESP32_4M316k, custom_ESP32_4M316k, test_B_ESP32_4M316k_lolin_d32_pro, test_D_ESP32_4M316k, test_B_beta_ESP8266_16M_LittleFS, test_B_ESP8266_4M1M_VCC, normal_ESP8285_1M, spec_memanalyze_ESP8266, test_D_ESP8266_4M1M, spec_debug_beta_custom_ESP8266_4M1M, spec_debug_custom_ESP8266_4M1M, test_D_beta_ESP8266_4M1M, custom_ESP8266_1M, hard_Ventus_W266, normal_alt_wifi_ESP8266_4M1M, test_B_ESP32_4M316k, test_D_ESP32-wrover-kit_4M316k, normal_ESP8266_4M1M, custom_ESP8266_4M2M, test_A_ESP32-wrover-kit_4M316k, minimal_core_274_sdk3_ESP8266_1M_OTA_FHEM_HA, minimal_core_274_ESP8266_1M_OTA_Domoticz, custom_ESP8266_4M1M, hard_Shelly_1_2M256, minimal_core_274_ESP8285_1M_OTA_FHEM_HA, custom_IR_ESP8266_4M1M, normal_ESP8266_1M, custom_alt_wifi_ESP8266_4M1M, test_A_alt_wifi_ESP8266_4M1M_VCC, display_ESP32_4M316k, test_C_ESP32-wrover-kit_4M316k_ETH, test_C_ESP32-wrover-kit_4M316k, custom_ESP32_4M316k_ETH, test_D_ESP8266_4M1M_VCC, normal_ESP8266_4M1M_VCC, custom_ESP8266_4M2M_LittleFS, max_ESP32_16M2M_LittleFS_ETH, max_ESP32_16M2M_LittleFS, display_ESP8266_4M1M, hard_Shelly_PLUG_S_2M256, test_D_ESP32_IRExt_4M316k, energy_ESP32_4M316k, test_A_ESP32_4M316k_ETH, minimal_core_274_sdk3_ESP8285_1M_OTA_Domoticz'
    default: Error: -e option requires an argument