Open jefft4 opened 4 years ago
The connection timeout reported is probably the timeout of the broker, not the ESP side. In the logs you showed the time between the reconnect is about 63 seconds. So maybe the keep alive is not sent often enough, or the broker has a timeout which is too short? What keep alive (or whatever it is called) time do you use in Mosquito? I think we use 10 or 15 seconds as timeout in PubSubClient.
As far as I understand mosquitto & MQTT, the client decides the keepalive timeout. The mosquitto broker appears able to put a limit on it, in MQTT v5+ only, but not to fix it. From what I can find, its default timeout seems to be 60 seconds; that should result in clients being cut off if they've not sent a ping for 90 seconds (spec calls for a +50% tolerance). The client could request any timeout up to 18 hours in its initial connection request.
From the MQTT spec:
When the client establishes a connection to the broker, the client communicates a time interval in seconds to the broker. This interval defines the maximum length of time that the broker and client may not communicate with each other.
...
The broker must disconnect a client that does not send a a message or a PINGREQ packet in one and a half times the keep alive interval. Likewise, the client is expected to close the connection if it does not receive a response from the broker in a reasonable amount of time.
Here's a snippet from my mosquitto log file; interesting to see what the broker is seeing:
1577946354: Client ESPClient_5C:CF:7F:6E:43:C4 disconnected.
1577946354: New client connected from 192.168.8.174 as ESPClient_5C:CF:7F:6E:43:C4 (c0, k10, u'').
1577946416: Socket error on client ESPClient_5C:CF:7F:6E:43:C4, disconnecting.
1577946416: New connection from 192.168.8.174 on port 1883.
1577946416: Client ESPClient_5C:CF:7F:6E:43:C4 disconnected.
1577946416: New client connected from 192.168.8.174 as ESPClient_5C:CF:7F:6E:43:C4 (c0, k10, u'').
1577946458: Socket error on client ESPClient_5C:CF:7F:6E:43:C4, disconnecting.
1577946458: New connection from 192.168.8.174 on port 1883.
1577946458: Client ESPClient_5C:CF:7F:6E:43:C4 disconnected.
1577946458: New client connected from 192.168.8.174 as ESPClient_5C:CF:7F:6E:43:C4 (c0, k10, u'').
1577946479: Socket error on client ESPClient_5C:CF:7F:6E:43:C4, disconnecting.
The socket error is interesting - suggests that the ESP has either dropped the connection or had some other problem with it. It seems to happen with this ESP, 42 seconds after the connection is first made.
I'm not sure what's going on here. If the broker timeout is 60s, the client end is 10s and it's the client's responsibility to initiate the ping, then we should never get anywhere close to the 90s broker limit. Maybe the broker is trying to send the ping response message when it hits that socket error.
If you can think of how to get any more information about what's going on at the ESP end to cause that socket error, I'm happy to try some more debugging...
Can you keep sending a ping from any host on the network to this ESP? It looks like the ESP is not receiving a reply from the broker or not replying to some request from the broker and thus is considered to be offline.
I set a 2-second repeating ICMP ping running from the server to that ESP and the MQTT problems continued; the ICMP ping was fine, no packets dropped at all. Also, I have the web UI of the ESP open while I'm doing this, watching its log, which is working fine. It looks like it's not a generic network issue, rather something MQTT-specific.
Sending a MQTT PINGREQ from anything else won't be possible, as MQTT is client-to-server. I will try subscribing the ESP to some random topic and using mosquitto_pub to spam that topic every 5 seconds, see if that keeps it alive.
How do (can?) one convert the ESP log timestamps into real times? If I could line up the broker and ESP logs, we might be able to see which disconnects first (I suspect the ESP as it says 'timeout', but I think we need to confirm that).
Thought I'd try it from the ESP side first...
Re-flashed the ESP with the 20191208 test binary to get the MQTTImport plug-in. Interestingly, I see this in mosquitto's log with the test version - looks like k10 might indicate the client asked for a keepalive of 10s: New client connected from 192.168.8.66 as TESTESP-Import (c0, k10)
With MQTTimport configured, but nothing publishing its topic, we get disconnections roughly every 30 seconds! If I spam the MQTTimport topic from another machine, the ESP is still disconnecting every 30s, even with spamming every 5s.
Analysis:
[Sorry if you get repeated notifications of edits... trying not to make the thread any longer than necessary!]
Test broker worked fine, no disconnections and the PINGREQ/PINGRESP pairs are there. In the end, I had to run mosquitto in the foreground on the live system with debugging on; not pleasant, but I think it may have exposed what's causing the problems.
On my live server, I saw the socket errors; they were coming after a barrage of domoticz/out messages being sent to the ESP - sometimes as many as 20 or 30 messages in a couple of seconds. I think that what's happening, is that the ESP is simply being swamped and either the MQTT client crashes out or simply stops responding, causing the server to see the ESP as unavailable.
I've found a workaround, I think: if the ESP doesn't need to subscribe, then set its subscription topic to 'notinterested'. This needs a restart of mosquitto, as the ESP doesn't seem to unsubscribe and disconnect cleanly when its config is changed, so otherwise the broker remembers its old subscription. Fortunately, this applies to most of my ESPs :-)
Does the 'swamped' hypothesis sound correct, TD-er? If so, is there anything we can do with the ESPEasy controller parameters (timeout, queue, etc) to lessen its impact for ESPs that do need to subscribe to domoticz's spammy output?
I've also suggested to the domoticz developers, that its protocol needs to be changed to include the 'idx' in the MQTT topic - e.g. 'domoticz/out/999' - so that devices have the option to subscribe more precisely and reduce the network traffic.
Latest versions of ESPEasy are very stable in my opinion. I have been using in my production ESP units (10 units so far) version 20191104 for 37 days with MQTT Mosquitto.. In one unit I use 3 instances of MQTT Import without any problem. In the last 37 days I have never had any disconnection issue, not a single one. (I have a rule that creates an alert if the ESP units reconnects).
See screenshots:
You don't appear to be using domoticz. Like I said, the issue appears (subject to TD-er, who knows more than either of us, confirming my hypothesis) to stem from the ESP's inability to cope with the volume of MQTT traffic created by domoticz's broadcast approach.
Well I am using Domoticz with MQTT on my nodes. Obviously this one is not running the latest builds, but it is running a self made build, I made on Sept 24.
So you could try a build from around that time just to make sure I didn't break anything recently.
But I do think your Mosquito may be using a shorter timeout. It is also possible the node gets swamped, by traffic, but this one was also running with about 15 nodes all subscribing to the same MQTT server. I've also ran a few tests every now and then to see if the nodes can handle very short "time between messages" and other stuff to flood the network. Having a log level on debug or "debug more" may also put a strain on the node as that does result in quite a few verbose log lines. The serial log and webserver log both use their own log buffer, which may put quite some strain on the resources. Also I have the feeling high frequent reloads, which the log viewer does with lots of logs, does sometimes trigger a crash or at least keep some resources allocated.
Interesting; potentially a number of things combining to overload the ESP then. My 'spammer' script caused the ESP to drop when it was sending around 25~30 messages in a second. They weren't huge messages, about 50 bytes. It didn't have to be sustained; one second like that was enough to break things. MQTT keepalive timeout, FYI, is set by the client; mosquitto has no control over it.
I do have a lot of sensors - screenshot below is just the temperature page; utility page is even longer - so high message volume is quite likely if several reporting intervals coincide. Some of my ESPs are publishing 4 or 5 sensors each time; I guess I could adjust the reporting intervals of each sensor by a few seconds +/-, might help there.
I changed all my ESPs to subscribe to a dummy topic where possible and they're happy little things now; with less in the log, I then spotted that devices running Tasmota were suffering in the same way... so the problem is something that ESP and Tasmota have in common (which I understand is quite a lot) - buffer space definitely plausible, either at the MQTT or the TCP level.
I'll pester the domoticz dev's a bit about their spammy messaging approach; I think that's inefficient on large systems and unwise for small-CPU IoT devices, so I'd like them to think about some small improvements to it.
Do you want to close this issue, or keep it open for investigation of what's breaking at the ESP end and whether the performance could degrade somehow more gracefully?
For what it's worth, I think that a feature request to make the keepalive timeout configurable would be useful; and also the issue of not unsubscribing when the ESP MQTT config is changed needs a fix (though low-priority). I have a couple of other ideas, which I'll raise if they've not already been raised... and send a few coffees over via Patreon!
Well the subscribing itself should not really be an issue as they are not really causing a lot of memory allocations. It is keeping the node busy I guess, but should not cause these issues. Like I said, it may cause excessive memory usage when you have the log levels set high.
Tasmota is using our version of PubSubClient, so there should not be a lot of difference in there.
I had the same issue. I changed the mqtt settings in Domoticz to: Hierarchical - publish outgoing messagen on topic domoticz/out/${floorplan name}/${plan name} After that I put the ESP in a seperate room in Domoticz and subscribed the controller on the ESP to that specific topic. After that the Lost connection events did not happen anymore. Maybe this helps?!
Hmm did not know about this feature of Domoticz. Makes sense to have it.
What do you use as floor plan name and plan name? Do you use variables in the topic of ESPEasy, or hard coded per unit?
I don't think it matters what names you use. Just create a room/plan (I used "Tuin" (Dutch for garden)) and then select the devices you want in that room/plan, then create a floor plan (I used "Lunar") and put that room in your floor plan. That's all there is to it. In ESPEasy, under Controllers, select Domoticz MQTT. At Controller Subscribe, fill in: domoticz/out/$Lunar/$Tuin Of course, in Domoticz you must choose Hierarchical as described in my previous post.
Interesting is that when you choose Combined in Domoticz, the Disconnections appear again. I had not expected this.
Hmm that's strange indeed. I will have a look at it as I'm also curious about this (for me) new feature of Domoticz.
I have not seen disconnects, but it may have to do with the ESP subscribing to the same topic as all nodes do which may cause a lot of traffic. Whet this traffic reaches some kind of threshold, it may cause the ESP to run out of resources or time to process the wifi traffic causing the disconnect.
Perhaps it is also worth adding that I use 38 shelly devices that all communicate via mqtt. Each device has more than one IDX. A total of 185 IDXs are shouting on the mqtt channel. I can imagine a very compact ESP28 device having trouble evaluating all those messages.
Do you also have logging enabled to either a log server or serial (or both) ? And if so, what log levels?
I'm sorry. I don't have any logging enabled. Only log I use is the standard log in de Web-UI. This ESP is my first and only one and I only got it up and running for three days now.
I'm happy to provide some logfiles but I have no clue how to set it up (log server).
Nope, that's fine. It is just that generating logs may also use resources and thus push the ESP into instability a bit sooner.
Had the same connect/reconnects of MQTT and Domoticz. Changed the setting of the MQTT gateway on Domoticz from topic 'Flat' to topic 'Index (with retain) and no more reconnects now.
Thanks!
OK, seems that we should update the Domoticz controller(s) to support the new hierarchical structures.
However I do not understand why you can now use "Index" since ESPEasy will not publish messages like this. Only reason I can think of why it may experience disconnects when using the "Flat" structure is that it may cause a lot of incoming messages to each node as there is a lot of traffic to the same topics. If the node is already on a poor WiFi connection, this may cause all kinds of issues.
That setting is for publishing Domoticz's outgoing messages only. So that's why published messages from ESP to Domoticz will keep working.
The solution only worked for 1 ESP. My 'ESP_Easy_WATERMETER_1' has stable connection, my 'ESP_Rfid_reader_2' is still reconnecting every approx. 20 sec. Both ESP's have (almost) same config.
mosquitto.log
2023-07-11T11:41:53: New client connected from 192.168.10.223:56787 as ESP_Easy_WATERMETER_1 (p2, c0, k10).
2023-07-11T11:42:01: Client ESP_Rfid_reader_2 closed its connection.
2023-07-11T11:42:01: New connection from 192.168.10.220:55688 on port 1883.
2023-07-11T11:42:01: New client connected from 192.168.10.220:55688 as ESP_Rfid_reader_2 (p2, c0, k10).
2023-07-11T11:42:39: Client ESP_Rfid_reader_2 closed its connection.
2023-07-11T11:42:39: New connection from 192.168.10.220:54374 on port 1883.
2023-07-11T11:42:39: New client connected from 192.168.10.220:54374 as ESP_Rfid_reader_2 (p2, c0, k10).
2023-07-11T11:42:59: Client ESP_Rfid_reader_2 closed its connection.
2023-07-11T11:42:59: New connection from 192.168.10.220:55002 on port 1883.
2023-07-11T11:42:59: New client connected from 192.168.10.220:55002 as ESP_Rfid_reader_2 (p2, c0, k10).
2023-07-11T11:43:20: Client ESP_Rfid_reader_2 closed its connection.
2023-07-11T11:43:20: New connection from 192.168.10.220:55444 on port 1883.
2023-07-11T11:43:20: New client connected from 192.168.10.220:55444 as ESP_Rfid_reader_2 (p2, c0, k10).
2023-07-11T11:43:43: Client ESP_Rfid_reader_2 closed its connection.
2023-07-11T11:44:10: New connection from 192.168.10.220:52634 on port 1883.
2023-07-11T11:44:10: New client connected from 192.168.10.220:52634 as ESP_Rfid_reader_2 (p2, c0, k10).
2023-07-11T11:44:48: Client ESP_Rfid_reader_2 closed its connection.
2023-07-11T11:44:48: New connection from 192.168.10.220:65239 on port 1883.
2023-07-11T11:44:48: New client connected from 192.168.10.220:65239 as ESP_Rfid_reader_2 (p2, c0, k10).
2023-07-11T11:45:08: Client ESP_Rfid_reader_2 closed its connection.
2023-07-11T11:45:08: New connection from 192.168.10.220:59602 on port 1883.
2023-07-11T11:45:08: New client connected from 192.168.10.220:59602 as ESP_Rfid_reader_2 (p2, c0, k10).
2023-07-11T11:45:28: Client ESP_Rfid_reader_2 closed its connection.
2023-07-11T11:45:28: New connection from 192.168.10.220:57266 on port 1883.
2023-07-11T11:45:28: New client connected from 192.168.10.220:57266 as ESP_Rfid_reader_2 (p2, c0, k10).
2023-07-11T11:45:48: Client ESP_Rfid_reader_2 closed its connection.
2023-07-11T11:45:48: New connection from 192.168.10.220:64739 on port 1883.
2023-07-11T11:45:48: New client connected from 192.168.10.220:64739 as ESP_Rfid_reader_2 (p2, c0, k10).
2023-07-11T11:46:08: Client ESP_Rfid_reader_2 closed its connection.
On the manpage I read 'Only the first controller can be a MQTT capable controller.' What is meant by 'controller'? The configured controller in the ESP? Or the node-number of the ESP? (Since my 'ESP_Easy_WATERMETER_1' has node 1 and is working...)
The controller in ESPEasy. You can have several controllers active, like P2P, some HTTP controller, etc. We have several MQTT controllers (Domoticz MQTT, OpenHAB MQTT, and some more). Of those MQTT controllers, you can only have one enabled since running multiple instances of PubSubClient (the library we use to connect to a MQTT broker) causes issues.
It seems the ESP looses its connection every 20 seconds. Is it possible you have some keep-alive setting in your MQTT broker set to 20 seconds or less?
'Almost same config', but different builds...
ESP_Easy_WATERMETER (working correct) : Build: [ESP_Easy_mega_20230409_normal_ESP8266_4M1M Apr 9 2023](https://github.com/letscontrolit/ESPEasy/releases)
ESP_Rfid_reader (reconnects) : Build: [ESP_Easy_mega_20230623_normal_ESP8266_4M1M Jun 23 2023](https://github.com/letscontrolit/ESPEasy/releases)
What is the Client Timeout in your controller settings on ESPEasy?
Does this behavior change when you start sending pings from whatever machine in your network to the ESPEasy node that reconnects?
Can you enable the TimingStats on the Tools-Advanced page of that reconnecting node and let it run for a few minutes and then visit the TimingStats page (button on Tools page)? N.B. on every load of that timing stats page, the stats are cleared.
MQTT Client Timeout is on both EPS's the default: 100ms Continuous Ping (from mosquitto server) does not change behavior. Timing Stats: ESP_Rfid_reader Timing Stats.pdf
Thanks!
Hmm print to PDF and then posting it here is apparently not that useable for the timingstats. In PDF the last column gets lost and when trying to copy/paste the text into Excel/LibreOffice Calc results in horrible formatting.
You can copy/paste the text from the browser to GitHub and it will try to keep the table formatting. Or just make a screenshot and paste it here.
Anyway, from what I could read, it doesn't look like there is something causing a lot of delays.
Is your MQTT broker located on your local network? Have you tried increasing the timeout to something like 300 msec or even 1000 msec?
Mosquitto is on a Raspberry PI 4 with Raspbian and wired with UTP to the local lan. The ESP is connected to the same lan (and subnet) via Wifi.
Excel did the trick: esp_stats_13-7-23.xlsx
Changing client timeout of controller on ESP to 1000ms does not change anything.
I did some IP traffic analysis, will describe in following post.
I did some analysis on the TCP traffic.
The good ESP (ip end on 223) subscribes to the broker and pings every 20 secs:
4 0.004256 192.168.10.223 192.168.10.45 MQTT 118 Connect Command
5 0.004283 192.168.10.45 192.168.10.223 TCP 54 1883 → 53714 [ACK] Seq=1 Ack=65 Win=64176 Len=0
6 0.004415 192.168.10.45 192.168.10.223 MQTT 58 Connect Ack
7 0.019994 192.168.10.223 192.168.10.45 MQTT 67 Subscribe Request (id=2) [ha/out]
8 0.020077 192.168.10.45 192.168.10.223 MQTT 59 Subscribe Ack (id=2)
9 0.024145 192.168.10.223 192.168.10.45 MQTT 77 Publish Message [ha/out/LWT]
10 0.065278 192.168.10.45 192.168.10.223 TCP 54 1883 → 53714 [ACK] Seq=10 Ack=101 Win=64140 Len=0
11 10.269591 192.168.10.223 192.168.10.45 MQTT 60 Ping Request
12 10.269641 192.168.10.45 192.168.10.223 TCP 54 1883 → 53714 [ACK] Seq=10 Ack=103 Win=64138 Len=0
13 10.269700 192.168.10.45 192.168.10.223 MQTT 56 Ping Response
14 10.393458 192.168.10.223 192.168.10.45 TCP 60 53714 → 1883 [ACK] Seq=103 Ack=12 Win=5829 Len=0
15 20.323020 192.168.10.223 192.168.10.45 MQTT 60 Ping Request
The reconnecting ESP (ip end on 220) subscribes to the broker and get immediately loads of published messages from Domoticz from all devices ([ha/out/1206] etc). So much that tcp retransmissions follow (packet 22 below), and finally it chokes and the TCP is reset (packet 105) after 20 sec. It than reconnects and all is repeated.
4 0.003265 192.168.10.220 192.168.10.45 MQTT 114 Connect Command
5 0.003286 192.168.10.45 192.168.10.220 TCP 54 1883 → 51962 [ACK] Seq=1 Ack=61 Win=64180 Len=0
6 0.003384 192.168.10.45 192.168.10.220 MQTT 58 Connect Ack
7 0.010168 192.168.10.220 192.168.10.45 MQTT 67 Subscribe Request (id=2) [ha/out]
8 0.010242 192.168.10.45 192.168.10.220 MQTT 59 Subscribe Ack (id=2)
9 0.015244 192.168.10.220 192.168.10.45 MQTT 77 Publish Message [ha/out/LWT]
10 0.015365 192.168.10.45 192.168.10.220 MQTT 77 Publish Message [ha/out/LWT]
11 0.130747 192.168.10.45 192.168.10.220 MQTT 1514 Publish Message [ha/out/1216], Publish Message [ha/out/1206], Publish Message [ha/out/1207], Publish Message [ha/out/1208]
12 0.148227 192.168.10.220 192.168.10.45 TCP 60 51962 → 1883 [ACK] Seq=97 Ack=1493 Win=4348 Len=0
13 0.148318 192.168.10.45 192.168.10.220 MQTT 663 Publish Message [ha/out/1209], Publish Message [ha/out/1210], Publish Message [ha/out/1211]
14 0.177052 192.168.10.220 192.168.10.45 TCP 60 51962 → 1883 [ACK] Seq=97 Ack=2102 Win=3739 Len=0
15 0.231674 192.168.10.45 192.168.10.220 MQTT 333 Publish Message [ha/out/1212]
16 0.333079 192.168.10.45 192.168.10.220 MQTT 1514 Publish Message [ha/out/1213], Publish Message [ha/out/1214], Publish Message [ha/out/1547], Publish Message [ha/out/1548]
17 0.333690 192.168.10.45 192.168.10.220 MQTT 1514 Publish Message [ha/out/1525], Publish Message [ha/out/1526], Publish Message [ha/out/1527], Publish Message [ha/out/1545]
18 0.390942 192.168.10.220 192.168.10.45 TCP 60 51962 → 1883 [ACK] Seq=97 Ack=3841 Win=2000 Len=0
19 0.390987 192.168.10.45 192.168.10.220 MQTT 403 Publish Message [ha/out/1564], Publish Message [ha/out/1567]
20 0.394633 192.168.10.220 192.168.10.45 TCP 60 51962 → 1883 [ACK] Seq=97 Ack=5650 Win=191 Len=0
21 0.904314 192.168.10.45 192.168.10.220 TCP 245 [TCP Window Full] 1883 → 51962 [PSH, ACK] Seq=5650 Ack=97 Win=64144 Len=191 [TCP segment of a reassembled PDU]
22 1.192307 192.168.10.45 192.168.10.220 TCP 245 [TCP Window Full] [TCP Retransmission] 1883 → 51962 [PSH, ACK] Seq=5650 Ack=97 Win=64144 Len=191
23 1.202518 192.168.10.220 192.168.10.45 TCP 60 [TCP ZeroWindow] 51962 → 1883 [ACK] Seq=97 Ack=5841 Win=0 Len=0
24 1.203774 192.168.10.220 192.168.10.45 TCP 60 [TCP ZeroWindow] 51962 → 1883 [ACK] Seq=97 Ack=5841 Win=0 Len=0
..
93 17.513747 192.168.10.45 192.168.10.220 MQTT 1514 [TCP Window Full] , Publish Message [ha/out/1214], Publish Message [ha/out/1547], Publish Message [ha/out/1548], Publish Message [ha/out/1216]
94 17.731548 192.168.10.220 192.168.10.45 TCP 60 [TCP ZeroWindow] 51962 → 1883 [ACK] Seq=99 Ack=26281 Win=0 Len=0
95 18.120343 192.168.10.45 192.168.10.220 TCP 54 [TCP Keep-Alive] 1883 → 51962 [ACK] Seq=26280 Ack=99 Win=64142 Len=0
96 18.132489 192.168.10.220 192.168.10.45 TCP 60 [TCP ZeroWindow] 51962 → 1883 [ACK] Seq=99 Ack=26281 Win=0 Len=0
97 18.770064 192.168.10.220 192.168.10.45 TCP 60 [TCP Window Update] 51962 → 1883 [ACK] Seq=99 Ack=26281 Win=1460 Len=0
98 18.770124 192.168.10.45 192.168.10.220 MQTT 1514 [TCP Window Full] , Publish Message [ha/out/1206], Publish Message [ha/out/1207], Publish Message [ha/out/1208], Publish Message [ha/out/1209], Publish Message [ha/out/1210], Publish Message [ha/out/1211]
99 18.994143 192.168.10.220 192.168.10.45 TCP 60 [TCP ZeroWindow] 51962 → 1883 [ACK] Seq=99 Ack=27741 Win=0 Len=0
100 19.432315 192.168.10.45 192.168.10.220 TCP 54 [TCP Keep-Alive] 1883 → 51962 [ACK] Seq=27740 Ack=99 Win=64142 Len=0
101 19.435748 192.168.10.220 192.168.10.45 TCP 60 [TCP ZeroWindow] 51962 → 1883 [ACK] Seq=99 Ack=27741 Win=0 Len=0
102 20.013626 192.168.10.220 192.168.10.45 TCP 60 [TCP Window Update] 51962 → 1883 [ACK] Seq=99 Ack=27741 Win=1460 Len=0
103 20.013687 192.168.10.45 192.168.10.220 MQTT 1514 [TCP Window Full] , Publish Message [ha/out/1212], Publish Message [ha/out/1213], Publish Message [ha/out/1214], Publish Message [ha/out/1547]
104 20.044206 192.168.10.220 192.168.10.45 TCP 60 [TCP ZeroWindow] 51962 → 1883 [ACK] Seq=99 Ack=29201 Win=0 Len=0
105 20.268474 192.168.10.220 192.168.10.45 TCP 60 51962 → 1883 [RST, ACK] Seq=99 Ack=29201 Win=53270 Len=0
What wonders me, why is ESP 223 not getting all info of the Domoticz devices, and ESP 220 does? The subscribe request is the same on both ESP’s (see below)
When I stop Domoticz, the messages on MQTT stop and the ESP 220 stays connected!
I cannot understand why the broker is sending to ESP 220 so much messages, and not to ESP 223.
Connect to Broker ESP 220
MQ Telemetry Transport Protocol, Connect Command
Header Flags: 0x10, Message Type: Connect Command
Msg Len: 58
Protocol Name Length: 4
Protocol Name: MQTT
Version: MQTT v3.1.1 (4)
Connect Flags: 0x24, Will Retain, QoS Level: At most once delivery (Fire and Forget), Will Flag
0... .... = User Name Flag: Not set
.0.. .... = Password Flag: Not set
..1. .... = Will Retain: Set
...0 0... = QoS Level: At most once delivery (Fire and Forget) (0)
.... .1.. = Will Flag: Set
.... ..0. = Clean Session Flag: Not set
.... ...0 = (Reserved): Not set
Keep Alive: 10
Client ID Length: 17
Client ID: ESP_Rfid_reader_2
Will Topic Length: 10
Will Topic: ha/out/LWT
Will Message Length: 15
Will Message: 436f6e6e656374696f6e204c6f7374
Subscribe request from ESP 220
MQ Telemetry Transport Protocol, Subscribe Request
Header Flags: 0x82, Message Type: Subscribe Request
1000 .... = Message Type: Subscribe Request (8)
.... 0010 = Reserved: 2
Msg Len: 11
Message Identifier: 2
Topic Length: 6
Topic: ha/out
Requested QoS: At most once delivery (Fire and Forget) (0)
Publish from 220
MQ Telemetry Transport Protocol, Publish Message
Header Flags: 0x31, Message Type: Publish Message, QoS Level: At most once delivery (Fire and Forget), Retain
0011 .... = Message Type: Publish Message (3)
.... 0... = DUP Flag: Not set
.... .00. = QoS Level: At most once delivery (Fire and Forget) (0)
.... ...1 = Retain: Set
Msg Len: 21
Topic Length: 10
Topic: ha/out/LWT
Message: 436f6e6e6563746564
Connect to Broker ESP 223
MQ Telemetry Transport Protocol, Connect Command
Header Flags: 0x10, Message Type: Connect Command
Msg Len: 62
Protocol Name Length: 4
Protocol Name: MQTT
Version: MQTT v3.1.1 (4)
Connect Flags: 0x24, Will Retain, QoS Level: At most once delivery (Fire and Forget), Will Flag
0... .... = User Name Flag: Not set
.0.. .... = Password Flag: Not set
..1. .... = Will Retain: Set
...0 0... = QoS Level: At most once delivery (Fire and Forget) (0)
.... .1.. = Will Flag: Set
.... ..0. = Clean Session Flag: Not set
.... ...0 = (Reserved): Not set
Keep Alive: 10
Client ID Length: 21
Client ID: ESP_Easy_WATERMETER_1
Will Topic Length: 10
Will Topic: ha/out/LWT
Will Message Length: 15
Will Message: 436f6e6e656374696f6e204c6f7374
Subscribe request from ESP 223
MQ Telemetry Transport Protocol, Subscribe Request
Header Flags: 0x82, Message Type: Subscribe Request
1000 .... = Message Type: Subscribe Request (8)
.... 0010 = Reserved: 2
Msg Len: 11
Message Identifier: 2
Topic Length: 6
Topic: ha/out
Requested QoS: At most once delivery (Fire and Forget) (0)
Publish from 223
MQ Telemetry Transport Protocol, Publish Message
Header Flags: 0x31, Message Type: Publish Message, QoS Level: At most once delivery (Fire and Forget), Retain
0011 .... = Message Type: Publish Message (3)
.... 0... = DUP Flag: Not set
.... .00. = QoS Level: At most once delivery (Fire and Forget) (0)
.... ...1 = Retain: Set
Msg Len: 21
Topic Length: 10
Topic: ha/out/LWT
Message: 436f6e6e6563746564
The keep-alive is set to 10 seconds on your broker.
This at least explains the 20 seconds interval for disconnects as the broker considers a client being no longer connected when the client doesn't send any message in 2x
So can you set the keep-alive on the broker to 20 or 30 seconds, to see if this makes any difference? Apparently 223 does respond frequently enough, either by sending messages more often, or maybe a better WiFi connection.
In the MQTT protocol the keep-alive is set bij the client in the connect header. In the above connect command you can see that both ESP's request a 10 sec keep-alive from the broker and that it is accepted.
Both ESP's are 1m apart, both on the same Access Point.
Can we explain why ESP 220 receives all published (Domoticz) messages from the broker and 223 doesn't?
Does either of the ESPs have the MQTT Import plugin active?
Found the cause!
There are too many messages published. With MQTTlens I monitored the messages. There are approx 750 messages / minute, that is too much for the ESP to handle.
[TCP Analysis Flags] [Expert Info (Warning/Sequence): TCP window specified by the receiver is now completely full]
The broker then sends a zero window, and that causes the ESP to disconnect and reconnect again. I think the issue starter (@jefft4 ) has the same experience "My 'spammer' script caused the ESP to drop when it was sending around 25~30 messages in a second."
"Window Full" from Wireshark means that Wireshark has calculated that the current packet will fill the receiver's buffer. Window Full is normally, but not always, followed by zero window from the receiver. (source: https://osqa-ask.wireshark.org/questions/40461/the-transmission-window-is-now-completely-full/ )
My solution...
...(workaround) sending less messages by Domoticz. I noticed that 80% of the messages were updates from my P1 meter (electric). That is unnecessary! In the hardware settings I limited the P1 update frequency to once in 5 sec:
Now all ESP's are happy and not reconnecting anymore.
Thanks all for thinking with me.
OK, so the good news is that ESPEasy can handle 25 - 30 MQTT messages per second :) That's more than I expected for an ESP8266, especially since it does quite a lot of work decoding Domoticz MQTT messages.
If somebody with same problem finds this thread - i found a solution
I´m using the PiDome Controller on a mosquitto-server My goal is to fetch data from a panasonic heat pump (heishamon)
The unit also sends to much messages
I thought i have to subscribe to the main topic in controller settings - then filter the relevant subtopic in Generic - MQTT Import Device
It looks like this device subscribes to he wanted topic, regardles of what topic is subscribed to in controller settings
As soon as i change to another topic there, my problems were gone
It looks like this device subscribes to he wanted topic, regardles of what topic is subscribed to in controller settings
The documentation for the Generic - MQTT Import plugin has a paragraph dedicated to this behavior, specifically to avoid/prevent overloading the ESP... 🤔
yes, but what i mean is something different:
I need the value of these 4 topics / these are my settings in MQTT Import:
So i expected i need to specify in the PiDome MQTT Controller under "Controller subscribe" the Topic:
/panasonic_heat_pump/main/#
I changed it to something completely different now.
The MQTT Controller needs to be connected to the MQTT server you are trying to use, the topics are indeed subscribed independently for the MQTT Import plugin, as you can have multiple instances/tasks using that plugin (that's only advisable on ESP32 though, because of the resources it uses).
I'll clarify the documentation on this.
btw, what is the difference between the 3 different MQTT Controller? (PiDome, Domoticz and HA) I´d say onle the pre-filled subscribe and publish topic?
Is there a prefered one to use for a generic MQTT installation like mosquitto?
Is there a prefered one to use for a generic MQTT installation like mosquitto?
Well, the Domoticz MQTT Controller handles the quirks required for sending data to Domoticz (often used with mosquitto as the MQTT 'frontend'), so that won't be really usable as a generic MQTT Controller, and the PiDome Controller is about as basic as it gets, so for the most versatile MQTT Controller you'd best use the Home Assistant (openHAB) MQTT controller, or when using a Collection build, you can also select the Homie MQTT Controller, that's also quite versatile, and supports the Homie auto-discovery feature.
Via PR #4770 I'm bringing some of the features available only in a some of the MQTT controllers to all MQTT controllers, like the ability to set the value of a Dummy Device task directly via the controller /set
topic, or execute a command on ESPEasy using the /cmd
topic (both should be the last part of the topic). That's currently most mature in the Home Assistant (openHAB) MQTT Controller.
now it gets really strange: 2 ESP32, both the same settings, copied back and forth one is still throwing out "connection lost", the other seems fine
One is from AZDeliverys, the other, i don´t know
So next try, i changed the setup first i had it like this, as the topic was to long to fit in :
i changed the topic "panasonic_heat_pump" to" Heishamon", so all fits in one Line --> no more errors ( I know the last topic is now not fully :written, out, i checked already without that topic subscribed)
How handles ESPEasy "Prefix for all topics" ? Are they all subscribed?
The prefix is added before all Topics when subscribing the topics, as if it was part of the topic, nothing special there, just to overcome the 40 character limit of the Topic field (that has to stay that size for backward compatibility). Any spaces before or after the prefix or topic are removed when appending these values.
When looking at the log when starting the plugin you can see the exact topic that is subscribed to:
IMPT : [<taskname>#<valuename>] subscribed to <topic>
(The values in sharp brackets are replaced with the matching names, as expected.)
You can (re)start the task by using command taskrun,<taskname>
from the Tools page.
A MQTT broker has only one way of letting you know about an error and that's by disconnecting the client.
For example if there would have been a bug in the MQTT code where a packet header states ESPEasy would send X bytes and the broker would see something after those X bytes other than a valid header of a new packet, then the client will be disconnected immediately... Ask me how I know :)
Anyway, there are lots of things that can go wrong here and one of them could be that you try to publish to a topic to which the user has no access to. Or someone is logged in to the MQTT broker using the same credentials twice.
So if you experience constant MQTT broker disconnects and those are not caused by WiFi disconnected, then please check for those possible issues like credentials, access rights, simultaneous logged in users with the same credentials.
Another thing that might be wrong with MQTT connections is when a client doesn't send out a keep-alive packet every N seconds (typically 10 sec). This can happen if the ESP can't reach the broker in due time, or when multiple ESP's use the same IP-config or when your node is missing packets. This last one can be checked by sending pings from another host to the ESP and see if there are missing replies or maybe very high or unstable ping response times.
Got a new Smartmeter and experienced the timeouts as well.
Modified Domoticz P1 settings to rate-limit to 10 (was 5)
Both ESP's now purring like kittens ;-) with far less disconnects. Normally the log would fill up like a Christmas tree.
Republishing the topics also worked but would break the rest of my automation.
Tnx @H4nsie for clarifying.
Running: [ESP_Easy_mega_20240229]
As my ESP do not need incoming traffic (to switch something) I guess I can also remove the subscribe/to topic, my units only report stuff to the main controller.
Did some further digging and moved from Domoticz MQTT to Home Assistant (openHAB) MQTT, so nothing more to do with domoticz/in or out (except publishing to it through rules). With a mqtt timeout of 200 I do still see
142493089:
MQTT : Connection lost, state: Disconnected
142493118: MQTT : Connected to broker with client ID: Watermeter
On a regular basis, although not that often as before.
Can you check the GH Actions test build which is now running for this PR: https://github.com/letscontrolit/ESPEasy/pull/5004
Your issues might be related to some weird behavior on ESP8266 regarding how union types are handled.
Ok let me check, currently on Build: ESP_Easy_mega_20240311_collection_B_ESP8266_4M1M Mar 11 2024
Do you want me to try specific settings ??
53820: WD : Uptime 1 ConnectFailures 0 FreeMem 14056 WiFiStatus: 3 ESPeasy internal wifi status: Conn. IP Init
65859: MQTT : Connection lost, state: Disconnected
65898: MQTT : Connected to broker with client ID: Jablotron
65902: Subscribed to: ESP_Easy/Jablotron/in/#
65906: EVENT: MQTT#Disconnected
65944: EVENT: MQTT#Connected
65958: ACT : PublishR,homeassistant/binary_sensor/Jablotron_door/config,'{'name': 'Jablotron_door','device_class': 'opening','state_t
65977: ACT : PublishR,homeassistant/binary_sensor/Jablotron_contact/config,'{'name': 'Jablotron_alarm','device_class': 'safety','stat
65992: ACT : PublishR,espeasy-discovery/binary_sensor/Jablotron_door/config,'{'name': 'Jablotron_door','device_class': 'opening','sta
66008: ACT : PublishR,espeasy-discovery/binary_sensor/Jablotron_contact/config,'{'name': 'Jablotron_alarm','device_class': 'safety','
82484: EVENT: Rules#Timer=1,1
82513: ACT : TimerSet,1,60
82523: ACT : Publish domoticz/in,'{'command':'udevice', 'idx':1526, 'svalue':'-58.00'}'
82546: EVENT: Jablotron_RSSI#rssi=-58.00
83820: WD : Uptime 1 ConnectFailures 0 FreeMem 19384 WiFiStatus: 3 ESPeasy internal wifi status: Conn. IP Init
98287: EVENT: Clock#Time=Mon,18:50
113820: WD : Uptime 2 ConnectFailures 0 FreeMem 14816 WiFiStatus: 3 ESPeasy internal wifi status: Conn. IP Init
123413: MQTT : Connection lost, state: Disconnected
123451: MQTT : Connected to broker with client ID: Jablotron
123456: Subscribed to: ESP_Easy/Jablotron/in/#
123461: EVENT: MQTT#Disconnected
123496: EVENT: MQTT#Connected
123510: ACT : PublishR,homeassistant/binary_sensor/Jablotron_door/config,'{'name': 'Jablotron_door','device_class': 'opening','state_t
123526: ACT : PublishR,homeassistant/binary_sensor/Jablotron_contact/config,'{'name': 'Jablotron_alarm','device_class': 'safety','stat
123545: ACT : PublishR,espeasy-discovery/binary_sensor/Jablotron_door/config,'{'name': 'Jablotron_door','device_class': 'opening','sta
123561: ACT : PublishR,espeasy-discovery/binary_sensor/Jablotron_contact/config,'{'name': 'Jablotron_alarm','device_class': 'safety','
Do you have any specific configuration in the MQTT broker? For example a relatively short keep alive interval?
Can you post the settings of the MQTT controller page? Is the MQTT broker local or some online service? Can you try to setup some sysinfo task and try to send some info like free RAM (doesn't matter what) periodically to this MQTT broker? For example every 10 seconds.
Checklist
I have...
ESP_Easy_mega-20181001_test_ESP8266_4096_VCC.bin
)Steps already tried...
.bin
files are included in the ZIP)Summarize of the problem/feature request
MQTT disconnects every minute, then immediately reconnects and subscribes to the topics (in this case, domoticz). ESP publishes all sensor data after MQTT connects, causing the sensor to report every minute - excessive.
Need to make the MQTT timeout user-configurable, I think, or at least make it much longer. Why so short? MQTT can stay connected for hours with no problem, if it sends heartbeats occasionally. It should only disconnect if told to, or if it loses the connection.
System configuration
Wifi AP 3m from the ESP, good signal. Zero wifi reconnects shown on status page. MQTT broker is mosquitto, server (cabled) on same subnet as ESP.
ESP Easy version:
Replicated also on build 20191208; exactly the same behaviour. Replicated on three different ESP modules.
Rules
Rules on one ESP are minimal - just a timer to reboot if wifi is disconnected for 5 mins. Rules on the other include 7-segment display of sensor values.
Log data