letscontrolit / ESPEasy

Easy MultiSensor device based on ESP8266/ESP32
http://www.espeasy.com
Other
3.26k stars 2.21k forks source link

MQTT Connection Lost in heavy WIFI enviroment #2861

Closed Heinz-Peter closed 2 years ago

Heinz-Peter commented 4 years ago

Checklist

I have...

Steps already tried...

If you self compile, please state this and PLEASE try to ONLY REPORT ISSUES WITH OFFICIAL BUILDS!

Summarize of the problem/feature request

Having ESP8266 connected to a MQTT broker via WIFI AP with more than 40 WIFI devices connected the log shows MQTT connection lost after some seconds, reconnecting ..... you already know this behaviour. Havig absolutely the same configuration connected to another WIFI AP(in parallel) with only this ESP8266 connected a connection lost will never appear. This shows that the timeouts for connections are too short. The TCP IP Timing for lots of slots seems not working good enough. Anywhere to make them some 6 seconds or more? Other forums found 6 seconds workable... Something like this available?:

define CONNECT_TIMEOUT_MS 6000

define PUBLISH_TIMEOUT_MS 6000 //set from 500 to 1000 to see if its better for discons/hr

define PING_TIMEOUT_MS 6000 //set from 500 to 1000 to see if its better for discons/hr

define SUBACK_TIMEOUT_MS 6000 //set from 500 to 1000 to see if its better for discons/hr

Thanks Peter

Expected behavior

YOUR TEXT GOES HERE

Actual behavior

YOUR TEXT GOES HERE

Steps to reproduce

System configuration

Hardware:

ESP Easy version:

ESP Easy settings/screenshots:

Rules or log data

TD-er commented 4 years ago

What access point are you using? Some (cheap) access points only can handle 32 connected devices.

Heinz-Peter commented 4 years ago

Hi, ESP8266 connected to AVM Fritz!Box 6490 with around 50 WIFI Modules also connected to CH 13 -> Connection Lost, having second Fritz!Box 7270 in parallel but on CH 1 and going through same FB 6490 via LAN to same MQTT Broker but only one ESP8266 connected to CH 1-> never Connection Lost. When measuring with MQTT.fx on MQTT Broker the traffic: I can see Data coming from ESP8266 through CH 13 stumbling a lot, but coming from CH 1 coming smooth Data for Data. I thing too much traffic on CH 13. Maybe a Hidden-Station- or Exposed-Station Problem in CSMA/CA or whatever although I had ESP8266 both times only 1 meter away from both Frotz!Boxes. Wanted to use Wireshark, but Fritz!Box 6490 doesnt allow to use :-( as Firmware is controlled by internet cable-provider :-(

TD-er commented 4 years ago

Channel 13 is not a standard channel. It is not allowed in all regions.

What I do for Wireshark is having a separate AP connected to a "smart" switch (108e from TP-link or Netgear, mind the "E" in the number) which can be set to mirror a port. Then connect the laptop to that mirrored port and you can sniff the data.

Heinz-Peter commented 4 years ago

Hi TD-er, Channel 13 is not only allowed in Europe, but should be taken among channel 1,5,9,13 for 802.11g and 802.11n !! https://de.wikipedia.org/wiki/Wireless_Local_Area_Network#Rechtliche_Lage_in_Deutschland, Can also be chosed In Fritz!Box. What the problem is: If I hook the ESP8266 alone to that TP-Link it will work without any error. I cannot change all my 50 and more devices to get then connected to the TP-Link to produce the error. Will see if I can catch the error in an other way with some TP-Link as Bridge to the 6490 main WIFI ....

TD-er commented 4 years ago

You should also look at the mode the nodes use. Some AP's don't like to switch a lot between B/G/N. So you could force B/G mode on all ESP's and let those be the only ones to use that AP.

Also keep the channel fixed to either 1, 6 or 11. I know channel 13 can be legally used in a lot of countries, but it can also be one of those channels that may act strange in some devices if their setting for regional regulations is set incorrect.

Also select sending ARP packets, which will help in routing packets to the ESP as they not always reply to ARP requests. Especially on busy AP's they will miss some ARP requests.

Heinz-Peter commented 4 years ago

Hi TD-er I got a Wireshark protocol with that Lost Connection two times and at the end starting the third time. All two seconds ESP8266 sends some SYS.. That works, but two times inside protocol connection will be lost. I do not know why, seems Broker doesnt answer quick enough?! Maybe you find the time onde day to have a look inside. wlan-128_01.02.20_1946.zip

Heinz-Peter commented 4 years ago

Another one Wireshark protocol more easy to see the connection lost error, it is another ESP8266 200km away connected at least with VPN to my (not only for) MQTT Broker server 192.168.26.2. All 30 seconds ESP8266 sends MQTT message.... MQTT Error 4 11_171.zip

Heinz-Peter commented 4 years ago

Another two Wireshark protocols of another ESP8266 direct token from LAN Port of the MQTT Broker Server (WIN 10 , yes I installed Wireshark now there). Absolute the same HW ESP8266 and conditions around WIFI and so on, one time with FW 147.8 another time with FW 2019_12_08. You will wonder what the new firmware struggles.... Both times ESP8266-01 cleared / flashed with zeros before FW loading. Tried it several times. The new FW seeem instable as the ESP8266 can not be reached some times with browser (timeout).... Something seems not good with WIFI in genral in newer FW ..... As I do not know what to do more than this investigation I would appreciate if someone can have look and give new hints... Heinz-Peter 2020_02_02_26_191_good_bad_FW.zip

TD-er commented 4 years ago

Can you try this test build ? And then especially a build with "beta" in the name?

Heinz-Peter commented 4 years ago

Tested above mentioned beta build. WIFI seems a little more stable, but I think not really better, please have a look. 2020_02_04_16_56_ESP_Easy_mega-20191208-40-PR_2864_test_beta_ESP8266_4M1M.bin_20104_Mega.zip

TD-er commented 4 years ago

What settings do you use regarding WiFi? (ARP / wifi sleep / BG mode) And is the broker also connected via WiFi or is that one connected to wired network? Can you post the timing stats?

Heinz-Peter commented 4 years ago

Fritz!Box 192.168.26.1 Server WIN 10 with MQTT 192.168.26.2 Subnetzmask 255.255.254.0 DNS 192.168.0.1 No special host table anywhere, ARP Ether as usual Broadcast ARP Who has ..? Tell ... , No Wifi sleep, 100 percent power, 2,4 GHz, No possibility to swithc BG or N or whatever, see here: https://avm.de/produkte/fritzbox/fritzbox-6490-cable/technische-daten/ Broker inside Server WIN 10 connected 1 GBit LAN. Timing every 30 second MQTT, what else timing you mean please?

TD-er commented 4 years ago

I meant ESPEasy settings :)

Heinz-Peter commented 4 years ago

Can you see this video stepping through ESP8266? https://www.dropbox.com/s/8visz68iuybo6ai/Video_2020-02-05_112255.wmv?dl=0 Regs Peter

TD-er commented 4 years ago

Please try to experiment with these: image

With force B/G you will improve sensitivity of the wifi. Force no sleep should (not always, not sure why) keep the WiFi awake. Sending ARP will also help keep the unit responsive on the network.

Heinz-Peter commented 4 years ago

Force WiFI B/G w/o success, same with no sleep. ARP is on. But error only comes the fith of before. FW your new one from today, No beta. I have a new protocol for you, I set up another second Fritz!box in parallel with new SSID connected to same LAN and Broker and so on. New SSID PETER_2123_300_T with T at the end, Only EPS8266 connected to this WiFI, Dump taken from WLAN AP ath0 inside Fritz!Box. I send complete Dump File, You have to set Filter yourself in Wireshark. Espressi_5d:89:b0 is the one with IP 192.168.26.191. MQTT Server has 192.168.26.2 AP is the AVM Fritz!Box. I wonder about No 745 in protocol, its a double isnt it? No 1648 is Port Number 60637 why change new port 55253? You are the chief with Networks :-) Maybe you can see whats going on there. 192.168.26.40 is the Fritz!Box Itsef, 192.168.26.42 is my Mobile Phone.... do not know what this one is doing there hihi. Sorry I will not send the dump here as I do not know what all is inside. Would send it to your Private Adress if possible. Where can I see your PA? Regs Peter

TD-er commented 4 years ago

The sending port number may vary, as long as the receiving port is determined for the protocol.

But maybe I'm missing something here? I cannot see a Wireshark dump.

Heinz-Peter commented 4 years ago

Wireshark File came to your PA. 2020-02-06_224012

Is it possible that there is an issue with the timers? In this log wlan-130_06.02.20_1559.eth line 421 is a retransmission of line 420 inside som 20 milliseconds. With flag Retransmission signed on. Why in such short time? Did I miss something? Will have a look at the rest soon...

RFC793 says: Retransmission Timeout

Because of the variability of the networks that compose an internetwork system and the wide range of uses of TCP connections the retransmission timeout must be dynamically determined. One procedure for determining a retransmission time out is given here as an illustration.

An Example Retransmission Timeout Procedure

  Measure the elapsed time between sending a data octet with a
  particular sequence number and receiving an acknowledgment that
  covers that sequence number (segments sent do not have to match
  segments received).  This measured elapsed time is the Round Trip
  Time (RTT).  Next compute a Smoothed Round Trip Time (SRTT) as:

    SRTT = ( ALPHA * SRTT ) + ((1-ALPHA) * RTT)

  and based on this, compute the retransmission timeout (RTO) as:

    RTO = min[UBOUND,max[LBOUND,(BETA*SRTT)]]

  where UBOUND is an upper bound on the timeout (e.g., 1 minute),
  LBOUND is a lower bound on the timeout (e.g., 1 second), ALPHA is
  a smoothing factor (e.g., .8 to .9), and BETA is a delay variance
  factor (e.g., 1.3 to 2.0).   
Heinz-Peter commented 4 years ago

Same with line 497 498 501, retransmission inside milliseconds

Heinz-Peter commented 4 years ago

Why has Frame 728 the Retransmission Flag set? It is the first transmission, and, why is frame 729 exact the same frame? Maybe this is the reason why in Frame 829 The server 192.168.26.2 says FIN? In Frame 830 ... 834 the Seq and Ack numbers are wrong I thing, Seq 169 after Seq 171, and why 171 in Frame 830 when length before was zero? I do not understand ..... Maybe I have to study again :-)

TD-er commented 4 years ago

I have not yet had the time to look at your file. Please be aware that a retransmission also can come from the access point. It is possible ESPEasy sent out a packet, which needs an acknowledgement. If the ESP does not receive an ack, it may send it again. But what if the access point does only then receive (or process) the initial packet, then it may seem in Wireshark as if the delay was only 20 msec. Maybe the access point also does send out a packet again if it does not receive an acknowledgement in due time. So then it is not the ESP that's sending the retransmission.

We simply have too many variables and unknowns here:

All these (and lots of other possible issues) may lead to lost connections, or an access point kicking out connected "stations" thus forcing them to reconnect. This makes interpreting Wireshark dumps next to impossible as you simply don't know why and in what order packets happen as they appear when sniffing.

Heinz-Peter commented 4 years ago

Hmmmmm, wondering why Version 147.8 works in absolute same conditions and enviroments...... Have 10th ESPs here with that firmware 147.8 in same network, working since two years w/o any losts etc. See file above 2020_02_02_26_191_good_bad_FW.zip Will check this again in completely other enviroment 200 km far away from here this weekend.

TD-er commented 4 years ago

Well the older firmware versions (before March 2018 as I remember) all were running on core 2.3.0 or older and also the LWIP (IP stack) has been changed to version 2. N.B. I still regret I have been moving on and not been able to go back to that older core versions, but that's how it is now.

But anyway, we now are approaching stable versions at last, so this is also something we will be able to fix. The LWIP does a lot of things different and one of them is to allow more connections. This does put a strain on the resources and I guess that's also one of the reasons we now see disconnects like these. Another thing I changed in the WiFi connectivity is that we react on WiFi events. This does make the unit a lot (!!) more responsive, but it may also result in disconnects we did not notice before.

So it is very hard to compare the version from now with a version of over 2 years ago. The versions do differ so much, it can be just about anything causing these issues.

TD-er commented 2 years ago

Should be fixed by these PRs:

Please let me know if it is indeed fixed now... finally