letscontrolit / ESPEasy

Easy MultiSensor device based on ESP8266/ESP32
http://www.espeasy.com
Other
3.28k stars 2.21k forks source link

MQTT stops working since 20181008 #1918

Closed fraeggle closed 5 years ago

fraeggle commented 6 years ago

BUILDS! ---> 20181017

Summarize of the problem/feature request

Since build 20181008 I have the problem that MQTT "hangs" regularly. Then no more values are transferred. For example, I see that the LWT status of Connected is no longer there in the IOBroker. If I take build 20180927, everything is stable again immediately.

Expected behavior

Stable connection of MQTT

Actual behavior

ESP looses connection

Steps to reproduce

using newer builds than 20180927 (have used 20181008..till 20181011 and 20181017) Sreenshot is with build 20180927. With newer Builds connection "Connected" disappears and f. ex. Spannung is not been updated anymore.

Yes, After some time (not always after same time), IOBroker looses connection to client.

Still exits

System configuration

Hardware: ESP Wroom2

ESP Easy version: Release mega-20181017 mqtt device mqttconf

Rules or log data

only one rule: on MQTT#Connected do Publish %sysname%/status/IP,%ip% endon

TD-er commented 6 years ago

In 20181004 the following has changed:

And in 20181006:

Could you perhaps test the 20181003 build and maybe these other 2 also to narrow down the problem?

fraeggle commented 6 years ago

i will trybut I'm afraid I won't be able to do it until Saturday. ;-)

TD-er commented 6 years ago

Do you have any idea on how long it takes before the MQTT connection is lost? And does it somehow coincide with a wifi reconnect?

fraeggle commented 6 years ago

As far as I could see, there was no Wifi loss. that the MQTT connection was lost was quite fast this morning (10 min) but I also had hours according to the last entry from the IOBroker... I will try to start at least one of the 3 builds tonight.

TD-er commented 6 years ago

Just as a check; Do you build it yourself, or use nightly builds?

Sasch600xt commented 6 years ago

another thing to check:

are you sure MQTT is stopping working ? My one here shows after a while "Connection LOST" but everything is still working with MQTT.

The only bug i see is the message "Connection Lost" while everything is still working. For example i get uptime minutes by MQTT controller. After a while (between 10minutes and 10 hours) i see "Connection Lost" but the minutes still counts up send over MQTT.

Greetz Sascha

TD-er commented 6 years ago

MQTT should do a reconnect as soon as it has detected it was no longer connected. See also @Sasch600xt other related issue: https://github.com/letscontrolit/ESPEasy/issues/1873#issuecomment-431170001

A few builds before 20180927, the one you report to be working just fine, I added a fix for MQTT controllers in which the MQTT client on the ESP will give itself a new client ID, just to prevent the broker to refuse a new connection when the broker still assumes an existing connection from that client and the client thinks it is disconnected.

Can you increase the timeout setting of the controller? The default was 1000 msec, before I introduced that setting and the first build had 100 msec as default. Later I changed the default (for new settings) to 300 msec. So maybe you can try to set it to 300 or higher (no higher than 1000 msec) just as a test. (no matter what version, at least one that used to fail)

blb4github commented 6 years ago

I did experience the same issue with 1 (out of 5) units. I have played a bit with the MQTT controller settings (Minimum Send Interval, Max Queue depth, Max retries & Full Queue Action); settings which work now for me with this unit: 1000ms, 5 / 5 and "Delete Oldest".

fraeggle commented 6 years ago

Ok news... Looks like it is the same issue as blb4github commented 12 hours ago is talking about. Took a NEW Unit and it works until now without Problems!!!! Looks like the "old" one does have memory problem or timing problems. I will try to increase the timout in this one... keep you informed ;-) Btw.. (Sorry.. Selfbuild with Atom.. because i don't want all Plugins... But until now it always went good..) You all are doing great work on this... and as far as i discovered it is only this (and therefore the one with the newest SW) which does have this problem. I will increase settings and let you know.

Regards Peter

uzi18 commented 6 years ago

@fraeggle is it possible for you to paste here screenshot of good and bad modules systeminfo pages? (esp and storage sections)

TD-er commented 6 years ago

And some description of the hardware. For example I have the feeling there have been some changes recently in sonoff modules. Sonoff basic and the S20

TD-er commented 6 years ago

@fraeggle Could you perhaps also perform a full wipe of the problematic node and start clean? Empty bin files are included in the release ZIP files.

fraeggle commented 6 years ago

Hi TD-er i had erased the bad node before... Since i have changed Timeout Settings to 800ms it is stable. Voltage is being updated every 120 sec.

Hardware nothing special.... becaus still waiting for INA219 to arrive.. ;-)

esp_good esp_bad mqtt_neu devices Regards Peter

fraeggle commented 6 years ago

ups i unfortunately closed it..

But i think with this "workaroud" it can be closed... I really think it is a Unit problem now.. What do you think about it TD-er?

TD-er commented 6 years ago

Still it is strange the units differ in this aspect. It would suggest the wifi stability is worse in one unit

fraeggle commented 6 years ago

i agree but I think that this unit simply has too large a tolerance range. As i told you, since i changed to 800 ms it works fine...
so if you agree i will close this one, due to the fact that there is the ability to change the timeout.

Thanks for your help..

regards Peter

fraeggle commented 6 years ago

i found "since 20181007 MQTT Open Hab shows "Connection Lost" #1873. Maybe the same issue? TD-er tell me if i can help by creating logs or something like that... Using Openhab (with IOBroker). but until now no error after setting Timeout to 800ms

TD-er commented 6 years ago

Is the MQTT broker for OpenHAB you are running, installed on a computer in your own network, or external? And if local, is it perhaps installed on a computer which lacks resources?

Sasch600xt commented 6 years ago

i have also a test running since 394 minutes at 20181020. Timeout at 800ms. Sometimes it shows up as Disconnected after 24 hours or even longer. But i never reached 2 days. And again, fo me all is still working after it shows "Connection Lost". So it´s just the message what is confusing for me. i´ll keep you informed

TD-er commented 6 years ago

Connection Lost is only an information message indicating... the connection was lost. But it should restart a new connection. In the sysinfo page you can see how often the WiFi connection was lost and rebuilt and also how long it is connected to WiFi since the last (re)connect to the accesspoint.

As soon as the WiFi connection is interrupted, it will also assume the MQTT connection has to be rebuilt, so it will consider the connection to the MQTT broker to be lost. Until 4 weeks ago, it was possible the MQTT broker would not accept the new connection attempt, since the broker still considered the client to be connected. This could result in indefinite reconnect attempts, when the client kept trying to connect and the broker does not accept the connection. I changed the MQTT client ID on every reconnect (adding the number of reconnects to the client ID) so the broker would consider it a new client. This results in a speedy reconnect, with as only drawback, the broker would send the LWT (last will and testament) since it assumes the client to be disconnected.

If that results in undesired behavior, I can change it into a new strategy. For example on a reconnect attempt which fails, I can try to send a proper disconnect message first.

In short, there are numerous options on which the connection to the broker can get lost and I'm not sure why in your case it is lost. Could be a timeout, WiFi disconnect, some unknown response of the broker or any other reason.

Sasch600xt commented 6 years ago

ok.....got it. Very good statement, thank you. I am at minute 507 right now / Open HAB controller / 800ms timeout. So far all is good :)

TD-er commented 6 years ago

Should I set the default timeout for new instances of a controller back to 1000 msec?

Sasch600xt commented 6 years ago

well......gimme 36 more hours.....then i know better about the bug is comming also with 800ms or not.

Minute 629 now without problems...

fraeggle commented 6 years ago

weird... discoverd following after change to 800ms. Last Disconnect Reason | (200) Beacon timeout Number reconnects | 3 What das Beacon Timeout mean? MQTT still running but right now the same as Sasch600txt. But different than before... MQTT still working only broker tells this one is not connected. mqtt

TD-er commented 6 years ago

For more information on the Beacon frame, see Wikipedia In short, the access point sends periodically a packet containing information about the network. This interval is typically 100 TU (102.4 msec). The ESP module is trying to listen to this beacon every time, but for a number of reasons it may miss a beacon frame. The timeout is longer than 100 TU, so it must miss a number of these beacon frames to report a beacon timeout. Reasons to miss such a beacon frame are:

So the "beacon timeout" is happening every now and then on the ESP nodes. I am working hard to get every plugin/controller to use short time slices to make tasks as little blocking as possible.

Sasch600xt commented 6 years ago

i can confirm all fraeggle said. Somewhen tonight i got "Connection Lost"

Have a nice weekend :) Sascha

@fraeggle: if it shows up as "Connection Lost" try to go into the Controllersettings of ESPEasy, simple disable the controller -> save , enable it again -> save. Then at least for me it shows up as connected again. Not a solution, but at least intresting :) is this IPSymcon you are using ?

@TD-er: too busy tasks ? well......at the "TEST UNIT" i have running here, i only send uptime minute every 60 seconds......nothing more is running at this unit.....so probably the smallest possible system

TD-er commented 6 years ago

@Sasch600xt the terms "too busy" is maybe not the best one describing the real issue. The Arduino way of doing things is:

On top of that, the Arduino version for ESP8266 (and ESP32 Arduino) are running some tasks outside the loop() for the Arduino part. These tasks are about background processes, like handling network connection and incoming traffic, etc. The background tasks are executed only at:

If a loop run takes more than 102.4 msec and no calls to yield() or delay() are made, the ESP node will have missed a beacon interval. Also if it is running several blocking tasks which are always busy right at the moment the WiFi access point is sending the beacon, a number of beacons will be missed.

When you look at the serial log (with Debug level enabled) you will see the timing statistics. Some of them may make several tens of msec and thus are candidate for causing these 'disconnect' reasons.

I could add a task to the scheduler to try to listen to the beacon every 102.4 msec. Only thing is, I am not sure how to see when such a beacon has been seen.

About this issue, I could look into the disconnect/reconnect of MQTT when a connection has been lost. What broker are you using? I am using Mosquito here and it is working fine with the current behavior.

Sasch600xt commented 6 years ago

ok, "too busy" was a bit easy saying of me :)

I am using a script running in my ipsymcon housecontrol eviroment for MQTT broker

Greetz Sascha

fraeggle commented 6 years ago

Hi Sascha.. Using IOBroker. @TD-er went back to Build 27092018. Broker Tells me still Connected... Really Confusing.. NO errors within 14 Hours ( Number reconnects | 0)

TD-er commented 6 years ago

I am installing IObroker now on my computer, just to see what's going on.

Edit: 45 minutes later and I stil am not able to get IObroker to run on my computers. Not sure what's going on here, but the Windows installer just doesn't work (the service bat file is nowhere to be found) and installation on Linux just doesn't finish. It keeps trying to do the same npm install over and over again. Tested on Ubuntu 18.04 on an Intel CPU host and Bash on Windows (Ubuntu 18.04)

fraeggle commented 6 years ago

not sure with windows.. i think there are some software dependencies. Runing it on Raspberry. for Windows ioBroker verwendet Node.js als Plattform und setzt diese voraus. (Download: http://nodejs.org/download/) first node.js has to be installed.

fluppie commented 6 years ago

I also have an issue with one module with 4 DS18B20 sensors connected. I thought it was my RasPi but took a clean Raspbian Streth image and installed mosquitto and node-red onto it. Same issue, connection breaks after 6-12 hours.

afbeelding

afbeelding

Dashboard: https://emoncms.org/Edegem/scrtmp2e The 4 curves from the ESP are CV_aanvoer, CV_retour and Sanitair_warm, Sanitair_koud

Grovkillen commented 6 years ago

@fluppie if you use official builds?

fluppie commented 6 years ago

No, I build myself with PlatformIO/Atom EDIT: Ha, I didn't read well, I'll try an official build.

fluppie commented 6 years ago

afbeelding

Let's watch this!

redskinhu commented 5 years ago

Hi

I have/had a same problem, i using HomeAssistant but after ESPEasy_mega-20181111 fw is the problem looks fixed to me.

Thx T

fluppie commented 5 years ago

Mine was still loosing connection after 2-3 days. I'll update to: ESP_Easy_mega-20181112_normal_ESP8266_4096.bin

Let's see!

redskinhu commented 5 years ago

Mine was lose the conn. after 1-10 hours. I have ~10h30min uptime and all (5pcs) the related modules connected. :-)

fraeggle commented 5 years ago

seems it would be worth to try... still on 2709 because need a stable connection. Pls keep me informed. :-))

fluppie commented 5 years ago

You can also follow via: https://emoncms.org/Edegem/scrtmp2e and check if the CV and Sanitair graphs are there :).

redskinhu commented 5 years ago

1 day uptime and still ok. :-)

fraeggle commented 5 years ago

You can also follow via: https://emoncms.org/Edegem/scrtmp2e and check if the CV and Sanitair graphs are there :).

:-D Sanitair_warm nearly 60 C ? little Campfire? ;-) I will try the firm.. Thanks...

fluppie commented 5 years ago

I was showering ;-)

fraeggle commented 5 years ago

1 day... still connected :-D using 12112018

redskinhu commented 5 years ago

After ~3days 3hours one of my unit lost the MQTT connection. :-( ( mega-20181112 4096 VCC fw )

Sasch600xt commented 5 years ago

@redskinhu: just to make sure, is it still working well after the unit lost MQTT connection ? because on my side, it is only showing up as "connection lost" but still working fine with all MQTT actions.

Greetz Sascha

TD-er commented 5 years ago

Indeed, a lost connection is not uncommon every now and then. As long as the connection is being rebuilt without human intervention, there is no problem. WiFi connections will be reset every now and then and there is nothing you can do to prevent it. As soon as such a connection is lost, it notifies MQTT clients on the same node to reconnect.

redskinhu commented 5 years ago

It looks some kind of LWT related problem. MQTT is not totally disconnected, ESPEasy sends / recieve MQTT messages but the LWT is not renewed so the HomeAssistant didn't get the LWT connected message therfore it shows the relevant sensor/switch is unavailable. This is my theory...

fraeggle commented 5 years ago

Mine Still connected and unlike my first post even MQTT LWT tells me connected. Looks good.

Indeed, a lost connection is not uncommon every now and then. As long as the connection is being rebuilt without human intervention, there is no problem. WiFi connections will be reset every now and then and there is nothing you can do to prevent it. As soon as such a connection is lost, it notifies MQTT clients on the same node to reconnect.

OK is there a possibility to renew LWT connected in this case?

TD-er commented 5 years ago

It is supposed to send the LWT at the moment it renews the connection.