letscontrolit / ESPEasy

Easy MultiSensor device based on ESP8266/ESP32
http://www.espeasy.com
Other
3.26k stars 2.21k forks source link

Home Assistant (openHAB) MQTT controller is exhausting RAM during reconnect #2684

Closed ghtester closed 3 years ago

ghtester commented 4 years ago

I am testing a custom build (by Vagrant) from official sources 20190926 with a plugin set customized Custom.h.txt The firmware details: Build:⋄ | 20104 - Mega System Libraries:⋄ | ESP82xx Core 2.6.0-dev, NONOS SDK 2.2.2-dev(38a443e), LWIP: 2.1.2 PUYA support Git Build:⋄ | My Build: Sep 26 201908:20:00 Plugins:⋄ | 34 [Normal] Build Md5: | e6d1ef3eb972c31b6659c853501e48 Md5 check: | passed. Build Time:⋄ | Sep 26 2019 08:09:00 Binary Filename:⋄ | ESP_Easy_20190926_vagrant_custom_ESP8266_4M1M.bin

The ESP8266 node with weak WiFi reception (about -90 or even worse) is reconnecting to AP quite often. Without MQTT controller active it's not an issue but as soon as the MQTT controller is enabled and a data from sensor are transferred to MQTT broker, soon or later the RAM is exhausted (regardless on controller settings), node stops sending data and reboots in the end (usually with Exception 29). The current controller settings is here (started with defaults, tried to change everything without positive effect): MQTT I am sending data from BMP280 every 15 seconds. Also tried 2 different MQTT brokers (openHAB 2.5 embedded - unusable and external - CentOS 7 default) with the same issue.

TD-er commented 4 years ago

Does MQTT also receive data from the broker? Data which is perhaps not processed where it should be processed.

It does subscribe to some rather generic topic so it seems. How much data is one message for this controller? The queue depth is already quite low, so that should not be the most resource demanding.

You could try lowering the minimal send interval (10 msec is also fine for internal networks) Disable check for reply Lower max. retries. Lower client timeout (100 msec is often just fine).

You could also try to fetch the code of PR #2667 and try it. (I will start a test build here, but it may take about an hour) It has some bug fixes related to WiFi reconnects. (changes in publish commands do not relate to the controllers)

ghtester commented 4 years ago

Yes but just occasionally (not every 15 secs) and MQTT import plugin looks working fine. The data are just a few bytes but the openHAB is running on cloud Virtual Machine (yes I know it should not but it's just for a test purposes and the firewall rules are set properly). Thanks for the configuration hints, IMHO I already tried to disable Reply check but no way... Also lowering retries does not help. As I am not sure how to properly put the right sources to Vagrant (so far I let it to download the sources automatically), I'll perhaps wait until the next official release is published. It looks to me that the Home Assistant (openHAB) MQTT controller does not release a memory properly in case it needs to reconnect to broker even though it's configured to discard a data that couldn't be sent due to bad WiFi connectivity. When the WiFi signal is good, no visible issue happens...

TD-er commented 4 years ago

Well it is for sure possible you found bug. Testing issues related to bad WiFi are hard to reproduce.

ghtester commented 4 years ago

Yes I know that WiFi connection stability depends on many things so it's a nightmare to reproduce it somewhere else. But in this case it shouldn't be so hard, I suppose the issue will occur when the connection to broker is interrupted at intervals due to another reason (than a bad WiFi signal) as well. I am going to try... Edit: BTW. I am currently testing the MQTT controller through an AP without antenna and I am interrupting the WiFi signal using metal shield. The MQTT controller is filled in by BMP280 sensor's data every 15 sec and I am also quickly sending the IR codes by Remote Control through IRRX plugin (Communication - TSOP4838). The node usually does not survive 25 - 40 WiFi reconnects and it can be reproduced in several minutes in my case.

TD-er commented 4 years ago

Since you're building yourself. Can you test a new build with the current mega branch? I just merged a very big PR I made which includes a lot of fixes. Maybe argument parsing was one of then causing this issue?

TD-er commented 4 years ago

I've looked a bit into the code of C005, but I really have no clue where to start looking for the memory usage.

So if it still happens, you have to give a bit more detailed information on your setup and what is happening. What data should the ESP handle (also UDP traffic) and what should it reply to.

ghtester commented 4 years ago

Thanks a lot for your effort, unfortunately I can't make a fresh build right now as I am currently away from my development environment. Hopefully I'll give it a try on Tuesday and let you know.

ghtester commented 4 years ago

I have quickly tested the fresh Vagrant build (almost the same Custom.h, only enabled the "Input switch - PCF8574" plugin in addition) but the issue still persists. When the RAM fall under about 3500, the node stops publishing data to MQTT broker (no error in serial console). Usually the RAM is not released anymore. One time after several minutes the free RAM increased to about 5300 but it did not restore the MQTT transfers to broker. The import plugin was still working fine and it was receiving the data from MQTT broker without issue even with low memory.

TD-er commented 4 years ago

What log level do you have active? At least set the log level on the serial port to "none". Maybe the logs for the serial port cannot be flushed in due time and that buffer is hogging the memory?

ghtester commented 4 years ago

Currently I have debug level on the serial port (connected to PC) and Info on Web Log Level but in past it was one of first things I tried...

giig1967g commented 4 years ago

which MQTT broker are you using? Mosquitto?

ghtester commented 4 years ago

@giig1967g Yes, Mosquitto on Centos 7 - looks to work stable. @TD-er Just tried to disable both Web and Serial logging completely... But it's strange sometimes I see the IR output (when pressing buttons on IR Remote Control) in serial console like this:

IR: RAW TIMINGS: uint16_t rawData[133] = {538, 1710, 564, 1684, 540, 1734, 516, 608, 516, 584, 538, 586, 540, 582, 540, 5IR: RAW TIMINGS: uint16_t rawData[103] = {538, 586, 544, 1730, 514, 1734, 516, 614, 510, 1710, 564, 558, 540, 590, 534, 5IR: RAW TIMINGS: uint16_t rawData[135] = {4536, 4514, 540, 1708, 538, 1710, 538, 1734, 540, 558, 542, 582, 540, 582, 540,IR: RAW TIMINGS: uint16_t rawData[135] = {4510, 4540, 538, 1710, 540, 1708, 540, 1734, 514, 586, 540, 608, 514, 584, 538,

So there's perhaps an issue with IRRX plugin which is sending the RAW data (when the IR code is not recognized) to serial even though the logging is turned off. The web log (also turned off) window does not show anything.

BTW.1: Could it be possible that the Max Queue Depth parameter in Home Assistant (openHAB) MQTT Controller settings is ignored somehow? I tried to decrease it from 2 to 1 and when I put WiFi Shield Cover to ESP node (to interrupt the connection), press a several times different RC keys and uncover the node (so the MQTT communication is restored), on openHAB Control Panel I see the value of the IR_Input is quickly changed several times after while... BTW.2: I can confirm even with Web Log and Serial Log disabled the node is rebooting in minutes / hours with Exception 29 when sending BMP280's data every 15 sec and IRRX data at quick intervals to MQTT broker while the ESP node is kept in bad WiFi signal (RSSI about -90) so the MQTT connection is permanently dropping and reconnecting.

TD-er commented 4 years ago

There are 2 buffers active with MQTT controllers:

So it may very well be the controller queue had already sent the message to the MQTT client code and that one stored it in the output buffer.

TD-er commented 4 years ago

Just disable the IRRX plugin and try again. Not sure what the IRRX plugin is doing, but I can imagine it will be blocking for the entire node while receiving. So then it will disrupt WiFi functionality and since the MQTT communication is rather chatty, even when the node itself is not sending a lot of messages, it may fill up buffers which are not being read. Also I don't know how well the IR plugin is handling its buffers when it cannot flush them.

ghtester commented 4 years ago

Thanks for the hint / explanation, unfortunately IRRX is quite important for me and I would like to keep using it. But even though it's not in use (and only BMP280 is sending data every 15 sec), the issue is the same (node reboots, maybe not so often). But I'll try to disable it to test if there's a change...

TD-er commented 4 years ago

"Not in use" as in disabled? (and reboot just to be sure nothing is left active from the IR plugin)

ghtester commented 4 years ago

Not in use = not pressing RC keys but plugin was enabled. Next step is to disable the IRRX plugin as you recommended. Need to move the node to test location so it'll be rebooted for sure. ;-)

TD-er commented 4 years ago

You don't know whether the IR receiver is receiving something and thus keeping the GPIO pin active while not knowing it. So I guess disabling it is the best test to make sure the IR plugin isn't the one causing issues here. Maybe even disconnect the IR receiver, or putting some cap over it.

ghtester commented 4 years ago

IR plugin disabled, logging disabled, node rebooted with Exception 5 times during last 4 hours.

TD-er commented 4 years ago

Exception as in error 29?

Can you try to get some more info, like logs. Also can you log (chart?) the free memory, to see if it degrades over time. Please show what plugins and controllers are active.

ghtester commented 4 years ago

I suppose the error is still the same (Exception 29), as logging is now disabled. I did not make any memory occupation chart, just observing but the degradation was visible. It depends on WiFi signal level & amount data sending to MQTT broker. That means if the WiFi connection is great, no issue with RAM degradation occurs, no reboots. Also if I stop sending MQTT data, even with bad WiFi signal, everything looks OK. There's only one Controller configured and enabled (as number 2 because first controller, default in deep past, was deleted ago). Plugins are currently in accordance with the screenshot but I don't think it's a primary issue: plugins

TD-er commented 4 years ago

OK, I will change some things I'm working on and I'll have a test build for you in about 2 hours. I did fix some potential issues last night after running a static code analysis tool on the code base and will also have a look at the reports of non-working publish commands made this morning. Maybe those are related?

giig1967g commented 4 years ago

it reminds me of the issue that I opened regarding the fact that if the openHAB server is switched off, all my units reboot. I also use Mosquitto. It seems that the MQTT implementation has some issues with Mosquitto and/or OpenHAB.

ghtester commented 4 years ago

@giig1967g It's not exactly the same in my case, when the openHAB server is not accessible, my ESP node still tries to connect it but does not reboot. I see the issue when the plugins are filling MQTT queue and the MQTT client (Controller) is periodically disconnecting and reconnecting... MQTT Import plugin works fine all the time.

TD-er commented 4 years ago

Well seen from the node's perspective, it doesn't really matter whether the MQTT broker is offline or the node is having a very bad WiFi connection. So they could very well be related issues.

giig1967g commented 4 years ago

@TD-er : it's time for you to setup an OpenHAB test plant ! :) ah ah

ghtester commented 4 years ago

@TD-er Maybe I am mistaken but I really see a worse situation with a bad WiFi signal (so the MQTT controller is disconnecting and reconnecting quite often) than if the MQTT broker is inaccessible.

TD-er commented 4 years ago

From the node it is about equally bad. If the node does assume it is "connected" to the network (not the MQTT broker), it will start collecting data. When the broker cannot be reached (either due to bad network connection or broker down), the messages will be queued. Maybe the "node connected, broker offline" is even worse, since the node is then continuing collecting data for the controller.

ghtester commented 4 years ago

OK but I would focus on the queue algorithm, especially what (in fact) happens during MQTT disconnect / reconnect and if the buffer length is not increasing over the designed limit. Also I suppose the ESP node which is "permanently" reconnecting to MQTT broker does make the integrated openHAB's MQTT broker Moquette inoperable in 48 - 72 hours (reproduced 3 times so I moved to external Mosquitto which so far seems to be working stable).

TD-er commented 4 years ago

Can you try this test build: ESPEasy_mega-20191028-21-PR_2698.zip Please let me know if it does work and it would be great if you could do some more thorough testing so I can merge it for tomorrow's build.

It does have some minor fixes on determine the actual size of a queued message object. Not sure if that is the true issue, but it may lead to false positives on free memory checks used in the queue mechanism.

jimmys01 commented 4 years ago

For the record I have 200+ nodes on OpenHab and Mosquitto with absolutely no issues.

giig1967g commented 4 years ago

@TD-er I will be able to test it only on saturday @jimmys01 have you aver tried to switch off just the openHAB server (keeping the wifi on)? If so, do the units reboot with HW WD? In my case they do reboot.

ghtester commented 4 years ago

@TD-er Unfortunately I can't test it right now, perhaps at late evening or tomorrow.

ghtester commented 4 years ago

Still experiencing the issue (with IRRX and logging enabled again to be able to test quickly) on fresh Vagrant build from sources 20191104.

giig1967g commented 4 years ago

Hi @ghtester Which core are you using? I am using 2.6.0 (ESP_Easy_mega-20191104_normal_core_260_sdk222_alpha_ESP8266_4M1M) and so far no reboots.

ghtester commented 4 years ago

@TD-er I believe it can be reproduced quickly - I have removed an antenna from AP and put a grounded metal shield between AP and ESP Easy node to tune up RSSI about -90. Then filling quickly the MQTT queue by data from BMP280 sensor (and by IR RC, optionally). Moving the shield to perform WiFi disconnect / reconnect. Usually after 20 / 40 reconnects, the node's free RAM fall below a safe value and node reboots with Exception 29. @giig1967g Build:⋄ 20104 - Mega System Libraries:⋄ ESP82xx Core 2.6.0-dev stage, NONOS SDK 2.2.2-dev(38a443e), LWIP: 2.1.2 PUYA support Git Build:⋄ My Build: Nov 5 201913:13:00 Plugins:⋄ 35 [Normal] Build Md5: 77687d79dc8e559d0b52a0afa452ff Md5 check: passed. Build Time:⋄ Nov 5 2019 13:14:36 Binary Filename:⋄ ESP_Easy_20191105_vagrant_custom_ESP8266_4M1M.bin

I don't know how to change the core so it's default in Vagrant environment. Just customized the plugin list. If you have a good WiFi connectivity, you can't see an issue. My trouble is that I need to operate one ESP Easy node with bad WiFi reception. As MQTT protocol is designed to overcome the bad connectivity, it should work fine without exhausting memory. I would expect losing some data if the queue is not large enough to keep it due to lost connectivity but I think MQTT Controller should not "eat" all memory because of too many reconnects.

TD-er commented 4 years ago

Not sure if I can have a look at it today, since I'm in Germany right now. Will be back on Friday.

giig1967g commented 4 years ago

I have a setup with OpenHAB+Mosquitto. This weekend I can try to simulate the poor wifi, just creating a new unit and placing it far away and acting as you say.

TD-er commented 4 years ago

putting your finger on the ESP's antenna should also give quite a big attenuation on the WiFi reception. Or placing metal close to the antenna.

TD-er commented 4 years ago

Can you test by removing this line: https://github.com/letscontrolit/ESPEasy/blob/510afb2553b28501db5f5bc0de761c1eb851e2cf/src/Controller.ino#L183

And this line: https://github.com/letscontrolit/ESPEasy/blob/68cae5519fe4043d109d99170fc0cd85c2b071fa/src/_P037_MQTTImport.ino#L49-L50

See this comment: https://github.com/knolleary/pubsubclient/issues/673#issuecomment-548270079

ghtester commented 4 years ago

Thanks a lot, I have edited the outlined files in Vagrant virtual machine, recompiled, updated node and quickly tested but still experiencing the issue with FreeMem decreasing and rebooting due to Fatal exception 29. BTW. it looks that after node reboot due to Exception (or due to REBOOT command invoked) there's an issue with WiFi reconnecting to AP (AP not found). The WIFISCAN command from serial console does show about 5 - 7 APs in my site even after REBOOT command invoked and the node can't reconnect to AP anymore until node is powered off/on. Then the node connects to WiFi AP immediately and WIFISCAN does show about 25 APs around.

TD-er commented 4 years ago

You could check the Restart WiFi after lost connection (or whatever I called it), since that will put the WiFi actually off and on again before reconnecting.

What core lib do you use? Normally I would suggest core 2.6.0, but then only for builds of before November. There has been a recent merge in the core libs which does require the newer build tools or else you will end up with unstable WiFi...

ghtester commented 4 years ago

Yes the "Restart WiFi Lost Conn:" check in Tools - Advanced works fine and the ESP_Easy node reconnects to AP reliably even after warm reboot when this option is selected.

Edit - the statement above is not correct - after several reboots the node can't reconnect after warm reboot anymore and it looks like the WiFi sensitivity is decreased. Cold boot fixes it.

I am using Vagrant default settings regarding to core version so the release is:

Build:⋄ 20104 - Mega
System Libraries:⋄ ESP82xx Core 2.6.0-dev stage, NONOS SDK 2.2.2-dev(5ab15d1), LWIP: 2.1.2 PUYA support
Git Build:⋄ My Build: Nov 9 201923:33:07
Plugins:⋄ 35 [Normal]
Build Md5: 6df129eee8a7d9114df3b7be98d0c9ce
Md5 check: passed.
Build Time:⋄ Nov 9 2019 23:07:30
Binary Filename:⋄ ESP_Easy_20191109_vagrant_custom_ESP8266_4M1M.bin
TD-er commented 4 years ago

I think you are now running into the issue where there needs to be an update of the build tools for the current core 2.6.0 See: https://github.com/platformio/platform-espressif8266/issues/180

ghtester commented 4 years ago

OK but what should I do to reflect that in Vagrant please? I am always creating the fresh Vagrant VM (vagrant destroy before vagrant up) when the ESP_Easy sources are updated.

TD-er commented 4 years ago

OK but what should I do to reflect that in Vagrant please? I am always creating the fresh Vagrant VM (vagrant destroy before vagrant up) when the ESP_Easy sources are updated.

Right now, the staged build (core 2.6.0 effectively) is broken when building with PlatformIO and has to be fixed by PlatformIO. (as you can see in the issue I linked)

The problem is, you need to change the platformio.ini file and that one is fetched in the Vagrant build and cannot (yet) be updated by the user when running the standard Vagrant provision steps.

It will work, but you may experience unstable WiFi. I think the issue will be solved quite soon by the PlatformIO guys, since anyone is running into this, who is building based on core 2.6.0 (or the stage branch which is now core 2.7.0-alpha)

ghtester commented 4 years ago

OK, thanks for the details. I don't currently see any serious issue related to WiFi if the signal is strong, only all the issues mentioned above due to bad WiFi coverage. I already know how to connect to the Vagrant VM and edit necessary files (so I was able to recompile firmware with changes you adviced yesterday). So if it's enough to update the platformio.ini by the similar way and if I would know what to change, I could try. Otherwise I'll wait until some good soul updates the Vagrant environment.

TD-er commented 4 years ago

The thing you have to change: https://github.com/letscontrolit/ESPEasy/blob/510afb2553b28501db5f5bc0de761c1eb851e2cf/platformio.ini#L416-L426

Change both occurrences of beta_platform into regular_platform. It will change the core lib version from 2.6.0 to 2.5.2. (which does have other bugs)

By the way, I just got of the phone with someone reporting the same issues as what this issue is about.

ghtester commented 4 years ago

OK, Gijs, thanks a lot for details, I'll give it a try and let you know if it helps. I am glad if the issues are confirmed so it looks my report was not a hoax... ;-)

giig1967g commented 4 years ago

@ghtester I tested in my environment with your settings, but I can't reproduce your problem. Could it be Vagrant related?