letscontrolit / ESPEasy

Easy MultiSensor device based on ESP8266/ESP32
http://www.espeasy.com
Other
3.23k stars 2.19k forks source link

MQTT disconnects every minute, re-publishes all data on connect #2840

Open jefft4 opened 4 years ago

jefft4 commented 4 years ago

Checklist

I have...

Steps already tried...

Summarize of the problem/feature request

MQTT disconnects every minute, then immediately reconnects and subscribes to the topics (in this case, domoticz). ESP publishes all sensor data after MQTT connects, causing the sensor to report every minute - excessive.

Need to make the MQTT timeout user-configurable, I think, or at least make it much longer. Why so short? MQTT can stay connected for hours with no problem, if it sends heartbeats occasionally. It should only disconnect if told to, or if it loses the connection.

System configuration

Wifi AP 3m from the ESP, good signal. Zero wifi reconnects shown on status page. MQTT broker is mosquitto, server (cabled) on same subnet as ESP.

ESP Easy version:

Build:⋄20104  - Mega
System Libraries:⋄ESP82xx Core 2_5_2, NONOS SDK 2.2.1(cfd48f3), LWIP: 2.1.2 PUYA support
Git Build:⋄mega-20191103
Plugins:⋄46 [Normal]
Build Time:⋄Nov  3 2019 03:18:10
Binary Filename:⋄ESP_Easy_mega-20191103_normal_ESP8266_4M1M.bin

Replicated also on build 20191208; exactly the same behaviour. Replicated on three different ESP modules.

Rules

Rules on one ESP are minimal - just a timer to reboot if wifi is disconnected for 5 mins. Rules on the other include 7-segment display of sensor values.

Log data

15371459: EVENT: Rules#Timer=1
15371671: ACT  : 7dt,                
15371677: 7DGT : Show Temperature=0.00
15371691: ACT  : TaskValueSet,12,1,4+1             
15371703: Command: taskvalueset
15371716: ACT  : TaskValueSet,12,1,1
15371728: Command: taskvalueset
15371736: ACT  : timerSet,1,5
15371748: Command: timerset
15377463: EVENT: Rules#Timer=1
15377554: ACT  : 7dt,23.3             
15377560: 7DGT : Show Temperature=23.30
15377696: ACT  : TaskValueSet,12,1,1+1             
15377709: Command: taskvalueset
15377727: ACT  : timerSet,1,5
15377738: Command: timerset
15377778: MQTT : Connection lost, state: Connection timeout
15377779: EVENT: MQTT#Disconnected
15377896: MQTT : Connected to broker with client ID: GymClimate_12
15377897: Subscribed to: domoticz/out
15377900: EVENT: MQTT#Connected
15378123: DHT  : Temperature: 25.10
15378123: DHT  : Humidity: 42.20
15378127: EVENT: DHT#Temperature=23.34
15378205: EVENT: DHT#Humidity=42.20
15378306:  Domoticz: Sensortype: 2 idx: 15 values: 23.3;42.2;1
15379749: DS   : Temperature: 32.44 (28-ff-c4-43-91-16-4-b9)
15379752: EVENT: DA#Temperature=32.44
15379853:  Domoticz: Sensortype: 1 idx: 13 values: 32.4
15379885: DS   : Temperature: 36.63 (28-ff-83-a6-c4-17-4-43)
15379888: EVENT: DB#Temperature=36.63
15379988:  Domoticz: Sensortype: 1 idx: 14 values: 36.6
15383459: EVENT: Rules#Timer=1
15383612: ACT  : 7dn,42.2          
15383617: 7DGT : Show Number=42
15383754: ACT  : TaskValueSet,12,1,2+1             
15383769: Command: taskvalueset
15383845: ACT  : timerSet,1,5
15383857: Command: timerset
15388459: EVENT: Clock#Time=Thu,11:58
15389459: EVENT: Rules#Timer=1
15389663: ACT  : 7dt,32.4            
15389668: 7DGT : Show Temperature=32.40
15389751: ACT  : TaskValueSet,12,1,3+1             
15389764: Command: taskvalueset
15389809: ACT  : timerSet,1,5
15389823: Command: timerset
15392730: WD   : Uptime 257 ConnectFailures 0 FreeMem 18104 WiFiStatus 3
15395466: EVENT: Rules#Timer=1
15395739: ACT  : 7dt,36.6            
15395744: 7DGT : Show Temperature=36.60
15395787: ACT  : TaskValueSet,12,1,4+1             
15395800: Command: taskvalueset
15395839: ACT  : TaskValueSet,12,1,1
15395852: Command: taskvalueset
15395860: ACT  : timerSet,1,5
15395872: Command: timerset
15401459: EVENT: Rules#Timer=1
15401559: ACT  : 7dt,23.3             
15401565: 7DGT : Show Temperature=23.30
15401747: ACT  : TaskValueSet,12,1,1+1             
15401760: Command: taskvalueset
15401805: ACT  : timerSet,1,5
15401818: Command: timerset
15407459: EVENT: Rules#Timer=1
15407610: ACT  : 7dn,42.2          
15407616: 7DGT : Show Number=42
15407749: ACT  : TaskValueSet,12,1,2+1             
15407762: Command: taskvalueset
15407807: ACT  : timerSet,1,5
15407819: Command: timerset
15413463: EVENT: Rules#Timer=1
15413664: ACT  : 7dt,32.4            
15413670: 7DGT : Show Temperature=32.40
15413759: ACT  : TaskValueSet,12,1,3+1             
15413772: Command: taskvalueset
15413819: ACT  : timerSet,1,5
15413831: Command: timerset
15419459: EVENT: Rules#Timer=1
15419705: ACT  : 7dt,36.6            
15419710: 7DGT : Show Temperature=36.60
15419752: ACT  : TaskValueSet,12,1,4+1             
15419765: Command: taskvalueset
15419806: ACT  : TaskValueSet,12,1,1
15419818: Command: taskvalueset
15419827: ACT  : timerSet,1,5
15419840: Command: timerset
15421163: Dummy: value 1: 1.00
15421163: Dummy: value 2: 0.00
15421163: Dummy: value 3: 0.00
15421163: Dummy: value 4: 0.00
15421165: EVENT: MyVars#cycle=1.00
15421243: EVENT: MyVars#=0.00
15421321: EVENT: MyVars#=0.00
15421399: EVENT: MyVars#=0.00
15422729: WD   : Uptime 257 ConnectFailures 0 FreeMem 19064 WiFiStatus 3
15425459: EVENT: Rules#Timer=1
15425552: ACT  : 7dt,23.3             
15425558: 7DGT : Show Temperature=23.30
15425727: ACT  : TaskValueSet,12,1,1+1             
15425739: Command: taskvalueset
15425757: ACT  : timerSet,1,5
15425768: Command: timerset
15431459: EVENT: Rules#Timer=1
15431589: ACT  : 7dn,42.2          
15431594: 7DGT : Show Number=42
15431693: ACT  : TaskValueSet,12,1,2+1             
15431706: Command: taskvalueset
15431725: ACT  : timerSet,1,5
15431737: Command: timerset
15437459: EVENT: Rules#Timer=1
15437630: ACT  : 7dt,32.4            
15437635: 7DGT : Show Temperature=32.40
15437720: ACT  : TaskValueSet,12,1,3+1             
15437734: Command: taskvalueset
15437752: ACT  : timerSet,1,5
15437764: Command: timerset
15440916: MQTT : Connection lost, state: Connection timeout
15440917: EVENT: MQTT#Disconnected
15441032: MQTT : Connected to broker with client ID: GymClimate_12
15441034: Subscribed to: domoticz/out
15441036: EVENT: MQTT#Connected
15441260: DHT  : Temperature: 25.10
15441260: DHT  : Humidity: 42.20
15441264: EVENT: DHT#Temperature=23.34
15441342: EVENT: DHT#Humidity=42.20
15441442:  Domoticz: Sensortype: 2 idx: 15 values: 23.3;42.2;1
TD-er commented 4 years ago

The connection timeout reported is probably the timeout of the broker, not the ESP side. In the logs you showed the time between the reconnect is about 63 seconds. So maybe the keep alive is not sent often enough, or the broker has a timeout which is too short? What keep alive (or whatever it is called) time do you use in Mosquito? I think we use 10 or 15 seconds as timeout in PubSubClient.

jefft4 commented 4 years ago

As far as I understand mosquitto & MQTT, the client decides the keepalive timeout. The mosquitto broker appears able to put a limit on it, in MQTT v5+ only, but not to fix it. From what I can find, its default timeout seems to be 60 seconds; that should result in clients being cut off if they've not sent a ping for 90 seconds (spec calls for a +50% tolerance). The client could request any timeout up to 18 hours in its initial connection request.

From the MQTT spec:

When the client establishes a connection to the broker, the client communicates a time interval in seconds to the broker. This interval defines the maximum length of time that the broker and client may not communicate with each other.
...
The broker must disconnect a client that does not send a a message or a PINGREQ packet in one and a half times the keep alive interval. Likewise, the client is expected to close the connection if it does not receive a response from the broker in a reasonable amount of time.

Here's a snippet from my mosquitto log file; interesting to see what the broker is seeing:

1577946354: Client ESPClient_5C:CF:7F:6E:43:C4 disconnected.
1577946354: New client connected from 192.168.8.174 as ESPClient_5C:CF:7F:6E:43:C4 (c0, k10, u'').
1577946416: Socket error on client ESPClient_5C:CF:7F:6E:43:C4, disconnecting.
1577946416: New connection from 192.168.8.174 on port 1883.
1577946416: Client ESPClient_5C:CF:7F:6E:43:C4 disconnected.
1577946416: New client connected from 192.168.8.174 as ESPClient_5C:CF:7F:6E:43:C4 (c0, k10, u'').
1577946458: Socket error on client ESPClient_5C:CF:7F:6E:43:C4, disconnecting.
1577946458: New connection from 192.168.8.174 on port 1883.
1577946458: Client ESPClient_5C:CF:7F:6E:43:C4 disconnected.
1577946458: New client connected from 192.168.8.174 as ESPClient_5C:CF:7F:6E:43:C4 (c0, k10, u'').
1577946479: Socket error on client ESPClient_5C:CF:7F:6E:43:C4, disconnecting.

The socket error is interesting - suggests that the ESP has either dropped the connection or had some other problem with it. It seems to happen with this ESP, 42 seconds after the connection is first made.

I'm not sure what's going on here. If the broker timeout is 60s, the client end is 10s and it's the client's responsibility to initiate the ping, then we should never get anywhere close to the 90s broker limit. Maybe the broker is trying to send the ping response message when it hits that socket error.

If you can think of how to get any more information about what's going on at the ESP end to cause that socket error, I'm happy to try some more debugging...

TD-er commented 4 years ago

Can you keep sending a ping from any host on the network to this ESP? It looks like the ESP is not receiving a reply from the broker or not replying to some request from the broker and thus is considered to be offline.

jefft4 commented 4 years ago

I set a 2-second repeating ICMP ping running from the server to that ESP and the MQTT problems continued; the ICMP ping was fine, no packets dropped at all. Also, I have the web UI of the ESP open while I'm doing this, watching its log, which is working fine. It looks like it's not a generic network issue, rather something MQTT-specific.

Sending a MQTT PINGREQ from anything else won't be possible, as MQTT is client-to-server. I will try subscribing the ESP to some random topic and using mosquitto_pub to spam that topic every 5 seconds, see if that keeps it alive.

How do (can?) one convert the ESP log timestamps into real times? If I could line up the broker and ESP logs, we might be able to see which disconnects first (I suspect the ESP as it says 'timeout', but I think we need to confirm that).

jefft4 commented 4 years ago

Thought I'd try it from the ESP side first...

jefft4 commented 4 years ago

Re-flashed the ESP with the 20191208 test binary to get the MQTTImport plug-in. Interestingly, I see this in mosquitto's log with the test version - looks like k10 might indicate the client asked for a keepalive of 10s: New client connected from 192.168.8.66 as TESTESP-Import (c0, k10)

With MQTTimport configured, but nothing publishing its topic, we get disconnections roughly every 30 seconds! If I spam the MQTTimport topic from another machine, the ESP is still disconnecting every 30s, even with spamming every 5s.

Analysis:

jefft4 commented 4 years ago

[Sorry if you get repeated notifications of edits... trying not to make the thread any longer than necessary!]

Test broker worked fine, no disconnections and the PINGREQ/PINGRESP pairs are there. In the end, I had to run mosquitto in the foreground on the live system with debugging on; not pleasant, but I think it may have exposed what's causing the problems.

On my live server, I saw the socket errors; they were coming after a barrage of domoticz/out messages being sent to the ESP - sometimes as many as 20 or 30 messages in a couple of seconds. I think that what's happening, is that the ESP is simply being swamped and either the MQTT client crashes out or simply stops responding, causing the server to see the ESP as unavailable.

I've found a workaround, I think: if the ESP doesn't need to subscribe, then set its subscription topic to 'notinterested'. This needs a restart of mosquitto, as the ESP doesn't seem to unsubscribe and disconnect cleanly when its config is changed, so otherwise the broker remembers its old subscription. Fortunately, this applies to most of my ESPs :-)

Does the 'swamped' hypothesis sound correct, TD-er? If so, is there anything we can do with the ESPEasy controller parameters (timeout, queue, etc) to lessen its impact for ESPs that do need to subscribe to domoticz's spammy output?

I've also suggested to the domoticz developers, that its protocol needs to be changed to include the 'idx' in the MQTT topic - e.g. 'domoticz/out/999' - so that devices have the option to subscribe more precisely and reduce the network traffic.

giig1967g commented 4 years ago

Latest versions of ESPEasy are very stable in my opinion. I have been using in my production ESP units (10 units so far) version 20191104 for 37 days with MQTT Mosquitto.. In one unit I use 3 instances of MQTT Import without any problem. In the last 37 days I have never had any disconnection issue, not a single one. (I have a rule that creates an alert if the ESP units reconnects).

See screenshots: Screenshot 2020-01-03 at 17 44 14

Screenshot 2020-01-03 at 17 44 05

Screenshot 2020-01-03 at 17 45 35

jefft4 commented 4 years ago

You don't appear to be using domoticz. Like I said, the issue appears (subject to TD-er, who knows more than either of us, confirming my hypothesis) to stem from the ESP's inability to cope with the volume of MQTT traffic created by domoticz's broadcast approach.

TD-er commented 4 years ago

Well I am using Domoticz with MQTT on my nodes. image Obviously this one is not running the latest builds, but it is running a self made build, I made on Sept 24.

So you could try a build from around that time just to make sure I didn't break anything recently.

But I do think your Mosquito may be using a shorter timeout. It is also possible the node gets swamped, by traffic, but this one was also running with about 15 nodes all subscribing to the same MQTT server. I've also ran a few tests every now and then to see if the nodes can handle very short "time between messages" and other stuff to flood the network. Having a log level on debug or "debug more" may also put a strain on the node as that does result in quite a few verbose log lines. The serial log and webserver log both use their own log buffer, which may put quite some strain on the resources. Also I have the feeling high frequent reloads, which the log viewer does with lots of logs, does sometimes trigger a crash or at least keep some resources allocated.

jefft4 commented 4 years ago

Interesting; potentially a number of things combining to overload the ESP then. My 'spammer' script caused the ESP to drop when it was sending around 25~30 messages in a second. They weren't huge messages, about 50 bytes. It didn't have to be sustained; one second like that was enough to break things. MQTT keepalive timeout, FYI, is set by the client; mosquitto has no control over it.

I do have a lot of sensors - screenshot below is just the temperature page; utility page is even longer - so high message volume is quite likely if several reporting intervals coincide. Some of my ESPs are publishing 4 or 5 sensors each time; I guess I could adjust the reporting intervals of each sensor by a few seconds +/-, might help there.

I changed all my ESPs to subscribe to a dummy topic where possible and they're happy little things now; with less in the log, I then spotted that devices running Tasmota were suffering in the same way... so the problem is something that ESP and Tasmota have in common (which I understand is quite a lot) - buffer space definitely plausible, either at the MQTT or the TCP level.

image

I'll pester the domoticz dev's a bit about their spammy messaging approach; I think that's inefficient on large systems and unwise for small-CPU IoT devices, so I'd like them to think about some small improvements to it.

Do you want to close this issue, or keep it open for investigation of what's breaking at the ESP end and whether the performance could degrade somehow more gracefully?

For what it's worth, I think that a feature request to make the keepalive timeout configurable would be useful; and also the issue of not unsubscribing when the ESP MQTT config is changed needs a fix (though low-priority). I have a couple of other ideas, which I'll raise if they've not already been raised... and send a few coffees over via Patreon!

TD-er commented 4 years ago

Well the subscribing itself should not really be an issue as they are not really causing a lot of memory allocations. It is keeping the node busy I guess, but should not cause these issues. Like I said, it may cause excessive memory usage when you have the log levels set high.

Tasmota is using our version of PubSubClient, so there should not be a lot of difference in there.

Delfuego-NL commented 3 years ago

I had the same issue. I changed the mqtt settings in Domoticz to: Hierarchical - publish outgoing messagen on topic domoticz/out/${floorplan name}/${plan name} After that I put the ESP in a seperate room in Domoticz and subscribed the controller on the ESP to that specific topic. After that the Lost connection events did not happen anymore. Maybe this helps?!

TD-er commented 3 years ago

Hmm did not know about this feature of Domoticz. Makes sense to have it.

What do you use as floor plan name and plan name? Do you use variables in the topic of ESPEasy, or hard coded per unit?

Delfuego-NL commented 3 years ago

I don't think it matters what names you use. Just create a room/plan (I used "Tuin" (Dutch for garden)) and then select the devices you want in that room/plan, then create a floor plan (I used "Lunar") and put that room in your floor plan. That's all there is to it. In ESPEasy, under Controllers, select Domoticz MQTT. At Controller Subscribe, fill in: domoticz/out/$Lunar/$Tuin Of course, in Domoticz you must choose Hierarchical as described in my previous post.

Interesting is that when you choose Combined in Domoticz, the Disconnections appear again. I had not expected this.

TD-er commented 3 years ago

Hmm that's strange indeed. I will have a look at it as I'm also curious about this (for me) new feature of Domoticz.

I have not seen disconnects, but it may have to do with the ESP subscribing to the same topic as all nodes do which may cause a lot of traffic. Whet this traffic reaches some kind of threshold, it may cause the ESP to run out of resources or time to process the wifi traffic causing the disconnect.

Delfuego-NL commented 3 years ago

Perhaps it is also worth adding that I use 38 shelly devices that all communicate via mqtt. Each device has more than one IDX. A total of 185 IDXs are shouting on the mqtt channel. I can imagine a very compact ESP28 device having trouble evaluating all those messages.

TD-er commented 3 years ago

Do you also have logging enabled to either a log server or serial (or both) ? And if so, what log levels?

Delfuego-NL commented 3 years ago

I'm sorry. I don't have any logging enabled. Only log I use is the standard log in de Web-UI. This ESP is my first and only one and I only got it up and running for three days now.

I'm happy to provide some logfiles but I have no clue how to set it up (log server).

TD-er commented 3 years ago

Nope, that's fine. It is just that generating logs may also use resources and thus push the ESP into instability a bit sooner.

H4nsie commented 1 year ago

Had the same connect/reconnects of MQTT and Domoticz. Changed the setting of the MQTT gateway on Domoticz from topic 'Flat' to topic 'Index (with retain) and no more reconnects now.

Scherm­afbeelding 2023-07-11 om 10 15 10

Thanks!

TD-er commented 1 year ago

OK, seems that we should update the Domoticz controller(s) to support the new hierarchical structures.

However I do not understand why you can now use "Index" since ESPEasy will not publish messages like this. Only reason I can think of why it may experience disconnects when using the "Flat" structure is that it may cause a lot of incoming messages to each node as there is a lot of traffic to the same topics. If the node is already on a poor WiFi connection, this may cause all kinds of issues.

H4nsie commented 1 year ago

That setting is for publishing Domoticz's outgoing messages only. So that's why published messages from ESP to Domoticz will keep working.

The solution only worked for 1 ESP. My 'ESP_Easy_WATERMETER_1' has stable connection, my 'ESP_Rfid_reader_2' is still reconnecting every approx. 20 sec. Both ESP's have (almost) same config.

mosquitto.log

2023-07-11T11:41:53: New client connected from 192.168.10.223:56787 as ESP_Easy_WATERMETER_1 (p2, c0, k10).
2023-07-11T11:42:01: Client ESP_Rfid_reader_2 closed its connection.
2023-07-11T11:42:01: New connection from 192.168.10.220:55688 on port 1883.
2023-07-11T11:42:01: New client connected from 192.168.10.220:55688 as ESP_Rfid_reader_2 (p2, c0, k10).
2023-07-11T11:42:39: Client ESP_Rfid_reader_2 closed its connection.
2023-07-11T11:42:39: New connection from 192.168.10.220:54374 on port 1883.
2023-07-11T11:42:39: New client connected from 192.168.10.220:54374 as ESP_Rfid_reader_2 (p2, c0, k10).
2023-07-11T11:42:59: Client ESP_Rfid_reader_2 closed its connection.
2023-07-11T11:42:59: New connection from 192.168.10.220:55002 on port 1883.
2023-07-11T11:42:59: New client connected from 192.168.10.220:55002 as ESP_Rfid_reader_2 (p2, c0, k10).
2023-07-11T11:43:20: Client ESP_Rfid_reader_2 closed its connection.
2023-07-11T11:43:20: New connection from 192.168.10.220:55444 on port 1883.
2023-07-11T11:43:20: New client connected from 192.168.10.220:55444 as ESP_Rfid_reader_2 (p2, c0, k10).
2023-07-11T11:43:43: Client ESP_Rfid_reader_2 closed its connection.
2023-07-11T11:44:10: New connection from 192.168.10.220:52634 on port 1883.
2023-07-11T11:44:10: New client connected from 192.168.10.220:52634 as ESP_Rfid_reader_2 (p2, c0, k10).
2023-07-11T11:44:48: Client ESP_Rfid_reader_2 closed its connection.
2023-07-11T11:44:48: New connection from 192.168.10.220:65239 on port 1883.
2023-07-11T11:44:48: New client connected from 192.168.10.220:65239 as ESP_Rfid_reader_2 (p2, c0, k10).
2023-07-11T11:45:08: Client ESP_Rfid_reader_2 closed its connection.
2023-07-11T11:45:08: New connection from 192.168.10.220:59602 on port 1883.
2023-07-11T11:45:08: New client connected from 192.168.10.220:59602 as ESP_Rfid_reader_2 (p2, c0, k10).
2023-07-11T11:45:28: Client ESP_Rfid_reader_2 closed its connection.
2023-07-11T11:45:28: New connection from 192.168.10.220:57266 on port 1883.
2023-07-11T11:45:28: New client connected from 192.168.10.220:57266 as ESP_Rfid_reader_2 (p2, c0, k10).
2023-07-11T11:45:48: Client ESP_Rfid_reader_2 closed its connection.
2023-07-11T11:45:48: New connection from 192.168.10.220:64739 on port 1883.
2023-07-11T11:45:48: New client connected from 192.168.10.220:64739 as ESP_Rfid_reader_2 (p2, c0, k10).
2023-07-11T11:46:08: Client ESP_Rfid_reader_2 closed its connection.

On the manpage I read 'Only the first controller can be a MQTT capable controller.' What is meant by 'controller'? The configured controller in the ESP? Or the node-number of the ESP? (Since my 'ESP_Easy_WATERMETER_1' has node 1 and is working...)

TD-er commented 1 year ago

The controller in ESPEasy. You can have several controllers active, like P2P, some HTTP controller, etc. We have several MQTT controllers (Domoticz MQTT, OpenHAB MQTT, and some more). Of those MQTT controllers, you can only have one enabled since running multiple instances of PubSubClient (the library we use to connect to a MQTT broker) causes issues.

TD-er commented 1 year ago

It seems the ESP looses its connection every 20 seconds. Is it possible you have some keep-alive setting in your MQTT broker set to 20 seconds or less?

H4nsie commented 1 year ago

'Almost same config', but different builds...

ESP_Easy_WATERMETER (working correct) : Build: [ESP_Easy_mega_20230409_normal_ESP8266_4M1M Apr 9 2023](https://github.com/letscontrolit/ESPEasy/releases)
ESP_Rfid_reader (reconnects) : Build: [ESP_Easy_mega_20230623_normal_ESP8266_4M1M Jun 23 2023](https://github.com/letscontrolit/ESPEasy/releases)
TD-er commented 1 year ago

What is the Client Timeout in your controller settings on ESPEasy?

Does this behavior change when you start sending pings from whatever machine in your network to the ESPEasy node that reconnects?

Can you enable the TimingStats on the Tools-Advanced page of that reconnecting node and let it run for a few minutes and then visit the TimingStats page (button on Tools page)? N.B. on every load of that timing stats page, the stats are cleared.

H4nsie commented 1 year ago

MQTT Client Timeout is on both EPS's the default: 100ms Continuous Ping (from mosquitto server) does not change behavior. Timing Stats: ESP_Rfid_reader Timing Stats.pdf

Thanks!

TD-er commented 1 year ago

Hmm print to PDF and then posting it here is apparently not that useable for the timingstats. In PDF the last column gets lost and when trying to copy/paste the text into Excel/LibreOffice Calc results in horrible formatting.

You can copy/paste the text from the browser to GitHub and it will try to keep the table formatting. Or just make a screenshot and paste it here.

Anyway, from what I could read, it doesn't look like there is something causing a lot of delays.

Is your MQTT broker located on your local network? Have you tried increasing the timeout to something like 300 msec or even 1000 msec?

H4nsie commented 1 year ago

Mosquitto is on a Raspberry PI 4 with Raspbian and wired with UTP to the local lan. The ESP is connected to the same lan (and subnet) via Wifi.

Excel did the trick: esp_stats_13-7-23.xlsx

Changing client timeout of controller on ESP to 1000ms does not change anything.

I did some IP traffic analysis, will describe in following post.

H4nsie commented 1 year ago

I did some analysis on the TCP traffic.

The good ESP (ip end on 223) subscribes to the broker and pings every 20 secs:

4   0.004256    192.168.10.223  192.168.10.45   MQTT    118 Connect Command
5   0.004283    192.168.10.45   192.168.10.223  TCP 54  1883 → 53714 [ACK] Seq=1 Ack=65 Win=64176 Len=0
6   0.004415    192.168.10.45   192.168.10.223  MQTT    58  Connect Ack
7   0.019994    192.168.10.223  192.168.10.45   MQTT    67  Subscribe Request (id=2) [ha/out]
8   0.020077    192.168.10.45   192.168.10.223  MQTT    59  Subscribe Ack (id=2)
9   0.024145    192.168.10.223  192.168.10.45   MQTT    77  Publish Message [ha/out/LWT]
10  0.065278    192.168.10.45   192.168.10.223  TCP 54  1883 → 53714 [ACK] Seq=10 Ack=101 Win=64140 Len=0
11  10.269591   192.168.10.223  192.168.10.45   MQTT    60  Ping Request
12  10.269641   192.168.10.45   192.168.10.223  TCP 54  1883 → 53714 [ACK] Seq=10 Ack=103 Win=64138 Len=0
13  10.269700   192.168.10.45   192.168.10.223  MQTT    56  Ping Response
14  10.393458   192.168.10.223  192.168.10.45   TCP 60  53714 → 1883 [ACK] Seq=103 Ack=12 Win=5829 Len=0
15  20.323020   192.168.10.223  192.168.10.45   MQTT    60  Ping Request

The reconnecting ESP (ip end on 220) subscribes to the broker and get immediately loads of published messages from Domoticz from all devices ([ha/out/1206] etc). So much that tcp retransmissions follow (packet 22 below), and finally it chokes and the TCP is reset (packet 105) after 20 sec. It than reconnects and all is repeated.

4   0.003265    192.168.10.220  192.168.10.45   MQTT    114 Connect Command
5   0.003286    192.168.10.45   192.168.10.220  TCP 54  1883 → 51962 [ACK] Seq=1 Ack=61 Win=64180 Len=0
6   0.003384    192.168.10.45   192.168.10.220  MQTT    58  Connect Ack
7   0.010168    192.168.10.220  192.168.10.45   MQTT    67  Subscribe Request (id=2) [ha/out]
8   0.010242    192.168.10.45   192.168.10.220  MQTT    59  Subscribe Ack (id=2)
9   0.015244    192.168.10.220  192.168.10.45   MQTT    77  Publish Message [ha/out/LWT]
10  0.015365    192.168.10.45   192.168.10.220  MQTT    77  Publish Message [ha/out/LWT]
11  0.130747    192.168.10.45   192.168.10.220  MQTT    1514    Publish Message [ha/out/1216], Publish Message [ha/out/1206], Publish Message [ha/out/1207], Publish Message [ha/out/1208]
12  0.148227    192.168.10.220  192.168.10.45   TCP 60  51962 → 1883 [ACK] Seq=97 Ack=1493 Win=4348 Len=0
13  0.148318    192.168.10.45   192.168.10.220  MQTT    663 Publish Message [ha/out/1209], Publish Message [ha/out/1210], Publish Message [ha/out/1211]
14  0.177052    192.168.10.220  192.168.10.45   TCP 60  51962 → 1883 [ACK] Seq=97 Ack=2102 Win=3739 Len=0
15  0.231674    192.168.10.45   192.168.10.220  MQTT    333 Publish Message [ha/out/1212]
16  0.333079    192.168.10.45   192.168.10.220  MQTT    1514    Publish Message [ha/out/1213], Publish Message [ha/out/1214], Publish Message [ha/out/1547], Publish Message [ha/out/1548]
17  0.333690    192.168.10.45   192.168.10.220  MQTT    1514    Publish Message [ha/out/1525], Publish Message [ha/out/1526], Publish Message [ha/out/1527], Publish Message [ha/out/1545]
18  0.390942    192.168.10.220  192.168.10.45   TCP 60  51962 → 1883 [ACK] Seq=97 Ack=3841 Win=2000 Len=0
19  0.390987    192.168.10.45   192.168.10.220  MQTT    403 Publish Message [ha/out/1564], Publish Message [ha/out/1567]
20  0.394633    192.168.10.220  192.168.10.45   TCP 60  51962 → 1883 [ACK] Seq=97 Ack=5650 Win=191 Len=0
21  0.904314    192.168.10.45   192.168.10.220  TCP 245 [TCP Window Full] 1883 → 51962 [PSH, ACK] Seq=5650 Ack=97 Win=64144 Len=191 [TCP segment of a reassembled PDU]
22  1.192307    192.168.10.45   192.168.10.220  TCP 245 [TCP Window Full] [TCP Retransmission] 1883 → 51962 [PSH, ACK] Seq=5650 Ack=97 Win=64144 Len=191
23  1.202518    192.168.10.220  192.168.10.45   TCP 60  [TCP ZeroWindow] 51962 → 1883 [ACK] Seq=97 Ack=5841 Win=0 Len=0
24  1.203774    192.168.10.220  192.168.10.45   TCP 60  [TCP ZeroWindow] 51962 → 1883 [ACK] Seq=97 Ack=5841 Win=0 Len=0

..

93  17.513747   192.168.10.45   192.168.10.220  MQTT    1514    [TCP Window Full] , Publish Message [ha/out/1214], Publish Message [ha/out/1547], Publish Message [ha/out/1548], Publish Message [ha/out/1216]
94  17.731548   192.168.10.220  192.168.10.45   TCP 60  [TCP ZeroWindow] 51962 → 1883 [ACK] Seq=99 Ack=26281 Win=0 Len=0
95  18.120343   192.168.10.45   192.168.10.220  TCP 54  [TCP Keep-Alive] 1883 → 51962 [ACK] Seq=26280 Ack=99 Win=64142 Len=0
96  18.132489   192.168.10.220  192.168.10.45   TCP 60  [TCP ZeroWindow] 51962 → 1883 [ACK] Seq=99 Ack=26281 Win=0 Len=0
97  18.770064   192.168.10.220  192.168.10.45   TCP 60  [TCP Window Update] 51962 → 1883 [ACK] Seq=99 Ack=26281 Win=1460 Len=0
98  18.770124   192.168.10.45   192.168.10.220  MQTT    1514    [TCP Window Full] , Publish Message [ha/out/1206], Publish Message [ha/out/1207], Publish Message [ha/out/1208], Publish Message [ha/out/1209], Publish Message [ha/out/1210], Publish Message [ha/out/1211]
99  18.994143   192.168.10.220  192.168.10.45   TCP 60  [TCP ZeroWindow] 51962 → 1883 [ACK] Seq=99 Ack=27741 Win=0 Len=0
100 19.432315   192.168.10.45   192.168.10.220  TCP 54  [TCP Keep-Alive] 1883 → 51962 [ACK] Seq=27740 Ack=99 Win=64142 Len=0
101 19.435748   192.168.10.220  192.168.10.45   TCP 60  [TCP ZeroWindow] 51962 → 1883 [ACK] Seq=99 Ack=27741 Win=0 Len=0
102 20.013626   192.168.10.220  192.168.10.45   TCP 60  [TCP Window Update] 51962 → 1883 [ACK] Seq=99 Ack=27741 Win=1460 Len=0
103 20.013687   192.168.10.45   192.168.10.220  MQTT    1514    [TCP Window Full] , Publish Message [ha/out/1212], Publish Message [ha/out/1213], Publish Message [ha/out/1214], Publish Message [ha/out/1547]
104 20.044206   192.168.10.220  192.168.10.45   TCP 60  [TCP ZeroWindow] 51962 → 1883 [ACK] Seq=99 Ack=29201 Win=0 Len=0
105 20.268474   192.168.10.220  192.168.10.45   TCP 60  51962 → 1883 [RST, ACK] Seq=99 Ack=29201 Win=53270 Len=0

What wonders me, why is ESP 223 not getting all info of the Domoticz devices, and ESP 220 does? The subscribe request is the same on both ESP’s (see below)

When I stop Domoticz, the messages on MQTT stop and the ESP 220 stays connected!

I cannot understand why the broker is sending to ESP 220 so much messages, and not to ESP 223.

Connect to Broker ESP 220

MQ Telemetry Transport Protocol, Connect Command
    Header Flags: 0x10, Message Type: Connect Command
    Msg Len: 58
    Protocol Name Length: 4
    Protocol Name: MQTT
    Version: MQTT v3.1.1 (4)
    Connect Flags: 0x24, Will Retain, QoS Level: At most once delivery (Fire and Forget), Will Flag
        0... .... = User Name Flag: Not set
        .0.. .... = Password Flag: Not set
        ..1. .... = Will Retain: Set
        ...0 0... = QoS Level: At most once delivery (Fire and Forget) (0)
        .... .1.. = Will Flag: Set
        .... ..0. = Clean Session Flag: Not set
        .... ...0 = (Reserved): Not set
    Keep Alive: 10
    Client ID Length: 17
    Client ID: ESP_Rfid_reader_2
    Will Topic Length: 10
    Will Topic: ha/out/LWT
    Will Message Length: 15
    Will Message: 436f6e6e656374696f6e204c6f7374

Subscribe request from ESP 220

MQ Telemetry Transport Protocol, Subscribe Request
    Header Flags: 0x82, Message Type: Subscribe Request
        1000 .... = Message Type: Subscribe Request (8)
        .... 0010 = Reserved: 2
    Msg Len: 11
    Message Identifier: 2
    Topic Length: 6
    Topic: ha/out
    Requested QoS: At most once delivery (Fire and Forget) (0)

Publish from 220

MQ Telemetry Transport Protocol, Publish Message
    Header Flags: 0x31, Message Type: Publish Message, QoS Level: At most once delivery (Fire and Forget), Retain
        0011 .... = Message Type: Publish Message (3)
        .... 0... = DUP Flag: Not set
        .... .00. = QoS Level: At most once delivery (Fire and Forget) (0)
        .... ...1 = Retain: Set
    Msg Len: 21
    Topic Length: 10
    Topic: ha/out/LWT
    Message: 436f6e6e6563746564

Connect to Broker ESP 223

MQ Telemetry Transport Protocol, Connect Command
    Header Flags: 0x10, Message Type: Connect Command
    Msg Len: 62
    Protocol Name Length: 4
    Protocol Name: MQTT
    Version: MQTT v3.1.1 (4)
    Connect Flags: 0x24, Will Retain, QoS Level: At most once delivery (Fire and Forget), Will Flag
        0... .... = User Name Flag: Not set
        .0.. .... = Password Flag: Not set
        ..1. .... = Will Retain: Set
        ...0 0... = QoS Level: At most once delivery (Fire and Forget) (0)
        .... .1.. = Will Flag: Set
        .... ..0. = Clean Session Flag: Not set
        .... ...0 = (Reserved): Not set
    Keep Alive: 10
    Client ID Length: 21
    Client ID: ESP_Easy_WATERMETER_1
    Will Topic Length: 10
    Will Topic: ha/out/LWT
    Will Message Length: 15
    Will Message: 436f6e6e656374696f6e204c6f7374

Subscribe request from ESP 223

MQ Telemetry Transport Protocol, Subscribe Request
    Header Flags: 0x82, Message Type: Subscribe Request
        1000 .... = Message Type: Subscribe Request (8)
        .... 0010 = Reserved: 2
    Msg Len: 11
    Message Identifier: 2
    Topic Length: 6
    Topic: ha/out
    Requested QoS: At most once delivery (Fire and Forget) (0)

Publish from 223

MQ Telemetry Transport Protocol, Publish Message
    Header Flags: 0x31, Message Type: Publish Message, QoS Level: At most once delivery (Fire and Forget), Retain
        0011 .... = Message Type: Publish Message (3)
        .... 0... = DUP Flag: Not set
        .... .00. = QoS Level: At most once delivery (Fire and Forget) (0)
        .... ...1 = Retain: Set
    Msg Len: 21
    Topic Length: 10
    Topic: ha/out/LWT
    Message: 436f6e6e6563746564
TD-er commented 1 year ago

The keep-alive is set to 10 seconds on your broker. This at least explains the 20 seconds interval for disconnects as the broker considers a client being no longer connected when the client doesn't send any message in 2x seconds

So can you set the keep-alive on the broker to 20 or 30 seconds, to see if this makes any difference? Apparently 223 does respond frequently enough, either by sending messages more often, or maybe a better WiFi connection.

H4nsie commented 1 year ago

In the MQTT protocol the keep-alive is set bij the client in the connect header. In the above connect command you can see that both ESP's request a 10 sec keep-alive from the broker and that it is accepted.

H4nsie commented 1 year ago

Both ESP's are 1m apart, both on the same Access Point.

Can we explain why ESP 220 receives all published (Domoticz) messages from the broker and 223 doesn't?

tonhuisman commented 1 year ago

Does either of the ESPs have the MQTT Import plugin active?

H4nsie commented 1 year ago

Found the cause!

There are too many messages published. With MQTTlens I monitored the messages. There are approx 750 messages / minute, that is too much for the ESP to handle.

[TCP Analysis Flags] [Expert Info (Warning/Sequence): TCP window specified by the receiver is now completely full]

The broker then sends a zero window, and that causes the ESP to disconnect and reconnect again. I think the issue starter (@jefft4 ) has the same experience "My 'spammer' script caused the ESP to drop when it was sending around 25~30 messages in a second."

"Window Full" from Wireshark means that Wireshark has calculated that the current packet will fill the receiver's buffer. Window Full is normally, but not always, followed by zero window from the receiver. (source: https://osqa-ask.wireshark.org/questions/40461/the-transmission-window-is-now-completely-full/ )

My solution...

...(workaround) sending less messages by Domoticz. I noticed that 80% of the messages were updates from my P1 meter (electric). That is unnecessary! In the hardware settings I limited the P1 update frequency to once in 5 sec:

Scherm­afbeelding 2023-07-13 om 16 11 38

Now all ESP's are happy and not reconnecting anymore.

Thanks all for thinking with me.

TD-er commented 1 year ago

OK, so the good news is that ESPEasy can handle 25 - 30 MQTT messages per second :) That's more than I expected for an ESP8266, especially since it does quite a lot of work decoding Domoticz MQTT messages.

kr0815 commented 9 months ago

If somebody with same problem finds this thread - i found a solution

I´m using the PiDome Controller on a mosquitto-server My goal is to fetch data from a panasonic heat pump (heishamon)

The unit also sends to much messages

I thought i have to subscribe to the main topic in controller settings - then filter the relevant subtopic in Generic - MQTT Import Device

It looks like this device subscribes to he wanted topic, regardles of what topic is subscribed to in controller settings

As soon as i change to another topic there, my problems were gone

tonhuisman commented 9 months ago

It looks like this device subscribes to he wanted topic, regardles of what topic is subscribed to in controller settings

The documentation for the Generic - MQTT Import plugin has a paragraph dedicated to this behavior, specifically to avoid/prevent overloading the ESP... 🤔

kr0815 commented 9 months ago

yes, but what i mean is something different:

I need the value of these 4 topics / these are my settings in MQTT Import:

image

So i expected i need to specify in the PiDome MQTT Controller under "Controller subscribe" the Topic: /panasonic_heat_pump/main/#

I changed it to something completely different now.

tonhuisman commented 9 months ago

The MQTT Controller needs to be connected to the MQTT server you are trying to use, the topics are indeed subscribed independently for the MQTT Import plugin, as you can have multiple instances/tasks using that plugin (that's only advisable on ESP32 though, because of the resources it uses).

I'll clarify the documentation on this.

kr0815 commented 9 months ago

btw, what is the difference between the 3 different MQTT Controller? (PiDome, Domoticz and HA) I´d say onle the pre-filled subscribe and publish topic?

Is there a prefered one to use for a generic MQTT installation like mosquitto?

tonhuisman commented 9 months ago

Is there a prefered one to use for a generic MQTT installation like mosquitto?

Well, the Domoticz MQTT Controller handles the quirks required for sending data to Domoticz (often used with mosquitto as the MQTT 'frontend'), so that won't be really usable as a generic MQTT Controller, and the PiDome Controller is about as basic as it gets, so for the most versatile MQTT Controller you'd best use the Home Assistant (openHAB) MQTT controller, or when using a Collection build, you can also select the Homie MQTT Controller, that's also quite versatile, and supports the Homie auto-discovery feature.

Via PR #4770 I'm bringing some of the features available only in a some of the MQTT controllers to all MQTT controllers, like the ability to set the value of a Dummy Device task directly via the controller /set topic, or execute a command on ESPEasy using the /cmd topic (both should be the last part of the topic). That's currently most mature in the Home Assistant (openHAB) MQTT Controller.

kr0815 commented 9 months ago

now it gets really strange: 2 ESP32, both the same settings, copied back and forth one is still throwing out "connection lost", the other seems fine

One is from AZDeliverys, the other, i don´t know

So next try, i changed the setup first i had it like this, as the topic was to long to fit in : image

i changed the topic "panasonic_heat_pump" to" Heishamon", so all fits in one Line --> no more errors ( I know the last topic is now not fully :written, out, i checked already without that topic subscribed)

image

How handles ESPEasy "Prefix for all topics" ? Are they all subscribed?

tonhuisman commented 9 months ago

The prefix is added before all Topics when subscribing the topics, as if it was part of the topic, nothing special there, just to overcome the 40 character limit of the Topic field (that has to stay that size for backward compatibility). Any spaces before or after the prefix or topic are removed when appending these values.

When looking at the log when starting the plugin you can see the exact topic that is subscribed to: IMPT : [<taskname>#<valuename>] subscribed to <topic> (The values in sharp brackets are replaced with the matching names, as expected.) You can (re)start the task by using command taskrun,<taskname> from the Tools page.

TD-er commented 9 months ago

A MQTT broker has only one way of letting you know about an error and that's by disconnecting the client.

For example if there would have been a bug in the MQTT code where a packet header states ESPEasy would send X bytes and the broker would see something after those X bytes other than a valid header of a new packet, then the client will be disconnected immediately... Ask me how I know :)

Anyway, there are lots of things that can go wrong here and one of them could be that you try to publish to a topic to which the user has no access to. Or someone is logged in to the MQTT broker using the same credentials twice.

So if you experience constant MQTT broker disconnects and those are not caused by WiFi disconnected, then please check for those possible issues like credentials, access rights, simultaneous logged in users with the same credentials.

Another thing that might be wrong with MQTT connections is when a client doesn't send out a keep-alive packet every N seconds (typically 10 sec). This can happen if the ESP can't reach the broker in due time, or when multiple ESP's use the same IP-config or when your node is missing packets. This last one can be checked by sending pings from another host to the ESP and see if there are missing replies or maybe very high or unstable ping response times.

sincze commented 4 months ago

Got a new Smartmeter and experienced the timeouts as well.

Modified Domoticz P1 settings to rate-limit to 10 (was 5) image

Both ESP's now purring like kittens ;-) with far less disconnects. Normally the log would fill up like a Christmas tree.

Republishing the topics also worked but would break the rest of my automation. image

Tnx @H4nsie for clarifying.

Running: [ESP_Easy_mega_20240229]

As my ESP do not need incoming traffic (to switch something) I guess I can also remove the subscribe/to topic, my units only report stuff to the main controller.

Did some further digging and moved from Domoticz MQTT to Home Assistant (openHAB) MQTT, so nothing more to do with domoticz/in or out (except publishing to it through rules). With a mqtt timeout of 200 I do still see

142493089: 
MQTT : Connection lost, state: Disconnected
142493118: MQTT : Connected to broker with client ID: Watermeter

On a regular basis, although not that often as before.

TD-er commented 4 months ago

Can you check the GH Actions test build which is now running for this PR: https://github.com/letscontrolit/ESPEasy/pull/5004

Your issues might be related to some weird behavior on ESP8266 regarding how union types are handled.

sincze commented 4 months ago

Ok let me check, currently on Build: ESP_Easy_mega_20240311_collection_B_ESP8266_4M1M Mar 11 2024

Do you want me to try specific settings ??

53820: WD : Uptime 1 ConnectFailures 0 FreeMem 14056 WiFiStatus: 3 ESPeasy internal wifi status: Conn. IP Init
65859: MQTT : Connection lost, state: Disconnected
65898: MQTT : Connected to broker with client ID: Jablotron
65902: Subscribed to: ESP_Easy/Jablotron/in/#
65906: EVENT: MQTT#Disconnected
65944: EVENT: MQTT#Connected
65958: ACT : PublishR,homeassistant/binary_sensor/Jablotron_door/config,'{'name': 'Jablotron_door','device_class': 'opening','state_t
65977: ACT : PublishR,homeassistant/binary_sensor/Jablotron_contact/config,'{'name': 'Jablotron_alarm','device_class': 'safety','stat
65992: ACT : PublishR,espeasy-discovery/binary_sensor/Jablotron_door/config,'{'name': 'Jablotron_door','device_class': 'opening','sta
66008: ACT : PublishR,espeasy-discovery/binary_sensor/Jablotron_contact/config,'{'name': 'Jablotron_alarm','device_class': 'safety','
82484: EVENT: Rules#Timer=1,1
82513: ACT : TimerSet,1,60
82523: ACT : Publish domoticz/in,'{'command':'udevice', 'idx':1526, 'svalue':'-58.00'}'
82546: EVENT: Jablotron_RSSI#rssi=-58.00
83820: WD : Uptime 1 ConnectFailures 0 FreeMem 19384 WiFiStatus: 3 ESPeasy internal wifi status: Conn. IP Init
98287: EVENT: Clock#Time=Mon,18:50
113820: WD : Uptime 2 ConnectFailures 0 FreeMem 14816 WiFiStatus: 3 ESPeasy internal wifi status: Conn. IP Init
123413: MQTT : Connection lost, state: Disconnected
123451: MQTT : Connected to broker with client ID: Jablotron
123456: Subscribed to: ESP_Easy/Jablotron/in/#
123461: EVENT: MQTT#Disconnected
123496: EVENT: MQTT#Connected
123510: ACT : PublishR,homeassistant/binary_sensor/Jablotron_door/config,'{'name': 'Jablotron_door','device_class': 'opening','state_t
123526: ACT : PublishR,homeassistant/binary_sensor/Jablotron_contact/config,'{'name': 'Jablotron_alarm','device_class': 'safety','stat
123545: ACT : PublishR,espeasy-discovery/binary_sensor/Jablotron_door/config,'{'name': 'Jablotron_door','device_class': 'opening','sta
123561: ACT : PublishR,espeasy-discovery/binary_sensor/Jablotron_contact/config,'{'name': 'Jablotron_alarm','device_class': 'safety','
TD-er commented 4 months ago

Do you have any specific configuration in the MQTT broker? For example a relatively short keep alive interval?

Can you post the settings of the MQTT controller page? Is the MQTT broker local or some online service? Can you try to setup some sysinfo task and try to send some info like free RAM (doesn't matter what) periodically to this MQTT broker? For example every 10 seconds.