letscontrolit / ESPEasy

Easy MultiSensor device based on ESP8266/ESP32
http://www.espeasy.com
Other
3.26k stars 2.21k forks source link

[BUG][WiFi stability] ESP Exception 3/29 when layer 2 disconnects #1987

Closed clumsy-stefan closed 5 years ago

clumsy-stefan commented 5 years ago

Summarize of the problem/feature request

When there is a lot of traffic on the WiFi network or the node is too busy it seems that some send/ack frames on layer 2 get lost and are net or not in time resent by the ESP. Therefore the connection on layer 2 is dropped by the accesspoint.

The ESP does not seem to handle this situation correctly and still tries to send data to the controller/server. This increases the load on the node to 100% and a renegotiation of the WiFi handshake fails (possibly due to not enough time in the WiFi core to do the handshake).

After some time (1-2min) the ESP runs into an exception (mostly 3 or 29) and reboots. Depending on the state of the WiFi and AP the connection to the AP is never established anymore.

See also discussion here with detailed information about the issue and possible workaround

Expected behavior

The ESP should check for that condition and reinitiate a handshake/connection to the AP before continuing to send data to the controller.

Actual behavior

The ESP sends data to the controller until it raises an exception

Steps to reproduce

  1. Reduce the time to wait for a frame ack on the router (eg. on Mikrotik set distance to "indoors" or below 5(km)
  2. make a lot of ESP's (~20) send regular data to a controller
  3. wait for it to crash

Problem persists after powercycle as well as normal reboots.

Current workaround is increasing the time for frame ack's to a higher value (eg. on Mikrotiks set the "distance" value of the interface to 50(km).

System configuration

Hardware: wemos D1 mini, Sonoff Basic, Sonoff Pow, Wemos Pro, others

ESP Easy version: SELF COMPILED!! Latest GIT version! esp8266 core 2.4.2 LWIP 2.0.1 low memory

Rules or log data

All debug logs and other information documented in #1957 See also PR #1979 for additional debug feature and basic check of sending data.

TD-er commented 5 years ago

Do you have the WiFi channel fixed, or is it variable? I've been reading on the trouble-shooting page of Tasmota and they strongly advice to have the WiFi channel fixed. If channel 14 is not usable, then something regarding regional settings may have changed somewhere, since only channel 1-11 are allowed in all countries. Also channel 1, 6 & 11 are actually the only channels usable without interference from other channels.

0ki commented 5 years ago

only channels 1-10 work, not 11. See this: https://github.com/letscontrolit/ESPEasy/issues/1337#issuecomment-394118989

Wifi channel is fixed, of course.

Any channel can have interference from other channels, no way to control it. Hell, there can even be interference from the same channel.

btw, regarding the issue I'm facing - wifi status led (inversed GPIO13) is normally solid blue, but when the device is offline, it goes in the pattern 0,0,0,0,0,1,2,3,4,5,6,7,8,9,0,0,0,0,0,0,0,0,1,2,3,4,5,6,7,8,9,0,0,0,0,0,... of brightness.

NorbertRoller commented 5 years ago

Fixed IP and DCS, but the channels never changed in the past.

Von: Gijs Noorlander [mailto:notifications@github.com] Gesendet: Freitag, 11. Januar 2019 21:47 An: letscontrolit/ESPEasy Cc: Subscribed Betreff: Re: [letscontrolit/ESPEasy] [BUG][WiFi stability] ESP Exception 3/29 when layer 2 disconnects (#1987)

Do you have the WiFi channel fixed, or is it variable? I've been reading on the trouble-shooting page of Tasmota and they strongly advice to have the WiFi channel fixed. If channel 14 is not usable, then something regarding regional settings may have changed somewhere, since only channel 1-11 are allowed in all countries. Also channel 1, 6 & 11 are actually the only channels usable without interference from other channels. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/letscontrolit/ESPEasy/issues/1987#issuecomment-453651579 , or mute https://github.com/notifications/unsubscribe-auth/AomgSxPjVKWrp0MSQTtbESKxFqaPVt90ks5vCPgxgaJpZM4YDivP the thread. https://github.com/notifications/beacon/AomgS_fhHQnmg1AdpX11mVL9yNra8S_kks5vCPgxgaJpZM4YDivP.gif

jimmys01 commented 5 years ago

Mikrotik stuff give great power but even default settings for wifi are not right.

I don’t generally have disconnects (except when i had this PMKID thing)

It would be insightfull anyway to post your AP settings anyway :-)

0ki commented 5 years ago

Those are good tips. I am quite well versed with networks though and even more experienced in mikrotik and I've configured all as you say except group key.

Group key update only has to do with the group key, not wit the unicast key where the problem lies as well, but I've downgraded that setting for the purposes of the experiment.

jimmys01 commented 5 years ago

@0ki Cool didn't know you where an expert on RouterOS!

0ki commented 5 years ago

I've concluded that it's clearly a bug in the ESPEasy software. You can see the problematic behaviour on the right that matches with RouterOS disconnecting the misbehaving client on the left. The radio spectrum is clear as you see. test

Unfortunately I don't have the time to delve into ESPEasy codebase right now, but I trust this comparison of bad vs good communication will be of use. Legend: red - access point | blue - espeasy | green - another device on my iot network test2

I believe that ESP_easy for some reason decides to switch to AP mode after receiving assoc response (frame 380 on left) and not continue the four-way handshake. Note the change of espeasy macaddress as well - the first octet changes from 80 to 82.

TD-er commented 5 years ago

@0ki

I've concluded that it's clearly a bug in the ESPEasy software.

Me too, but I have not an idea where the bug should be. It seems like ESPeasy 's internal state concerning disconnected state is not in sync with the true state. It is very well possible I've made some error in writing that code, or it is a bug in the core libraries.

0ki commented 5 years ago

Which file specifically do you mean by "that code"? If you point to a file or two, I could take a look at it.

Otherwise - since now you see the exact moment where it happens, I hope you can take a closer look at it.

clumsy-stefan commented 5 years ago

As I'm olso digging into this issue (for about 2 months now) I'm not completely sure anymore if it's the ESPEasy code itself. more an interaction between ESPEasy and the WiFi core/library... My debugs/logs show that the "Disconnect" always happens after a group key exchange / rekeeying. BUT the callback is only called after the rekeying timeout and subsequently after the authentication failed... So curently my guess is, that ESPEasy does not leave enough cpu-time to the core to actually do the rekeeying... up to now I didn't find a way though to find out when the node is in this rekeeying node or in the reauth mode and therefore beeing able to add enough delay() calls so the rekeeying reauth is succesfull...

0ki commented 5 years ago

To help pinpoint the problem timewise - it was working ok with:

Build   20100 - Mega (core 2_3_0)
GIT version mega-20180224
Plugins 48 [Normal]
Build Md5   719b94a4d6bc257b86916e4989eed3a0
Md5 check   passed.
Build time  Feb 24 2018 03:03:12

And by ok I mean that not only there were no disconnect, but also that connecting to hidden AP on channel 14 was not a problem.

0ki commented 5 years ago

@clumsy-stefan, I fully support catching the problem that causes it do disconnect in the first place, but honestly, there may be a number of reasons why a wifi assoc may drop, of course a group key exchange shouldn't be one of them.

The more pressing problem is - why can't it reconnect after it has disconnected. It can't be a timing issue, as I'm giving espeasy a whopping 5000ms to reply with message 2 of the 4way HS.

clumsy-stefan commented 5 years ago

I think the issue is, that you only get the disconnect AFTER the keyexchange fails... so you never now when the exchange actually starts, that's where you would need to give more time to the core...

in the logs you can see that ESPEasy thinks for quite some time it's still connected, even if it's not (even a ping to the node fails), but the node still tries to send data (and obviously connection fails then)...

the second issue I agree completely is, even if it gets a disconnect, why can't it reconnect with a completely new connection sequence... probbaly the module or core is stuck in some strange state?

EDIT: PS: the 4way HS does ot always fail, (I do it every 10min.), so it seems to be a combination of the state / busyiness of the ESPEasy and the time then the HS happens..

TD-er commented 5 years ago

What I can do, is try to completely drop WiFi (turn off modem also) and start a complete fresh reconnect when an unexpected disconnect is happening. Not sure if it was in this thread, but in one of these issues, someone replied with a tweet of someone at Shelly, who mentioned they had to restart the wifi stack completely after disconnect.

clumsy-stefan commented 5 years ago

@TD-er that's also what I put in in my latest version. within processDisconnect() I added a setWifiMode(WIFI_OFF); not sure though, if that's enough... currently it's running on some test nodes (since about 1h)...

0ki commented 5 years ago

@TD-er that's a cool idea. push a git release and I'll flash my stuff!

I just started reading some of the codebase 5 minutes ago, so this may be irrelevant, but: Also make sure that wifi_connect_attempt is set to 0 at that point.

clumsy-stefan commented 5 years ago

hmm... adding some delay(0) in backgroundtasks() after each of the subroutine calls, seems to further stabilize the first issue a bit (4way-HS). Not sure if it is a coincidence tough... @TD-er could it be, that backgroundtasks() is blocking execution at some point in time?

clumsy-stefan commented 5 years ago

generally speaking: adding delay(0) in all kind of different places where the timingstats show long processing times seems to improve the 4way-HS handling a lot. I have more SW/HW watchdogs kicking in now than rekeeying issues (but still have some)... symptomatically so I would say it's really related to the core/wifi part not getting enough cycles to actually do (quite swift) things like rekeeying and the AP then getting tired of waiting for an answer...

What I also see is that sometimes the client.connect() or the sendData() can take quite a lot of time (up to 2sec). In some documentation I found that if something takes more than 50ms, you should call delay(0) inbetween.... but you can't split these calls as they are performed by the library...

There is also another callback onStationModeAuthModeChanged in the core. Probably it would be worth looking at, as this one is probbaly triggered before the actual disconnect happens. but it's very little documented... Anyone have experience with this?

EDIT: it seems to be a coincidence/race condition with some other things beeing done at the time when the AP requests a rekeying.... but again just observations...

TD-er commented 5 years ago

I don't know how often these requests sent by the AP are retransmitted. Normally the beacon signal from the AP is sent every 102.4 msec. If some tasks takes more than that, the ESP misses one of such beacon frames. So I'm not sure how bad this is. I do know that ARP requests are missed sometimes, which leads to issues that a switch does not know anymore how to route packets to that MAC address and thus renders the ESP unreachable, while it can still send data itself.

0ki commented 5 years ago

missing beacon frames is nothing. it should not even need beacon frames as it should be actively probing for my hidden network anyway.

TD-er commented 5 years ago

I am no expert in WiFi. As I understood it, the beacon interval is the only -somewhat- guaranteed moment the ESP is trying to listen to the WiFi and inbetween it will try to sleep as much as possible.

And as I understood it, the only difference between 'hidden' networks and 'not hidden' networks is that the SSID is not being broadcast. So in fact it is connecting only based on BSSID (MAC-address of AP). That's also what the ESP does when connecting to (not hidden) WiFi networks, with the only difference that it will perform a scan before and try to find a BSSID which has a SSID which matches the given one. When performing an automatic reconnect, it will try the last known BSSID + channel combination first without performing a scan. In other words, it only differs when making a connection.

So to my knowledge, as long as there is a connection to a network, it will also listen to the beacon frames.

jimmys01 commented 5 years ago

I am no expert in WiFi. As I understood it, the beacon interval is the only -somewhat- guaranteed moment the ESP is trying to listen to the WiFi and inbetween it will try to sleep as much as possible.

And as I understood it, the only difference between 'hidden' networks and 'not hidden' networks..........

Oh that reminds me something that may be quite important. I run the my ESPs in a hidden SSID By the way still no disconnects or reboots on my nodes (trying 5 min group key updates)

TD-er commented 5 years ago

One of my nodes was rebooted at December 5th due to a power-outage.

This is from that one:

Unit: 5
Local Time: 2019-01-15 23:27:32
Uptime: 41 days 12 hours 46 minutes
Load: 15.80% (LC=9135)
Free Mem: 12160 (5360 - sendContentBlocking)
Free Stack: 3552 (1520 - SensorSendTask)
Boot: Cold boot (0)
Reset Reason: External System

Network ❔ Wifi: | 802.11N (RSSI -67 dB) IP config: | DHCP IP / subnet: | 192.168.1.89 / 255.255.255.0 GW: | 192.168.1.1 Client IP: | 192.168.1.67 DNS: | 192.168.1.1 / 0.0.0.0 Allowed IP Range: | All Allowed STA MAC: | 5C:CF:7F:B1:3F:12 AP MAC: | 5E:CF:7F:B1:3F:12 SSID: | Lurch4 (34:31:C4:B1:8D:C7) Channel: | 11 Connected: | 20h04m Last Disconnect Reason: | (202) Auth fail Number reconnects: | 61 Firmware Build:⋄ | 20102 - Mega Libraries:⋄ | ESP82xx Core 2_4_2, NONOS SDK 2.2.1(cfd48f3), LWIP: 2.0.3 PUYA support GIT version:⋄ | mega-20181109 Plugins:⋄ | 46 [Normal] Build Md5: | 47f19d99d0c3f083314b45668b1f5c Md5 check: | passed. Build time:⋄ | Nov 9 2018 03:23:22 Binary filename:⋄ | ESP_Easy_mega-20181109_normal_ESP8266_4096.bin

As you can see, on average more than one disconnect a day. It is connected to an AVM Fritzbox 1750E

clumsy-stefan commented 5 years ago

I think the beacon frames are not so critical... I'm still not clear why after a disconnect the normal WiFi.begin() sequence always fails with a '(2) Auth expire'.

The only indication is, that CPU Load at this point always shows 100% and hints that the core just does not get enough time to do the handshake...

TD-er commented 5 years ago

My guess is that there's something wrong in the core libraries (LWIP or Arduino or ESP) and we should reset the wifi. Thus turn off wifi and start all over again. Maybe also wait between those steps to make sure buffers are emptied or at least flushed and outstanding requests are actively been cancelled.

0ki commented 5 years ago

@TD-er can you push this change , so I can see if it works?

TD-er commented 5 years ago

The change discussed in the last 2 posts? It hasn't been implemented yet, but I can try this evening to code it and make a test build.

0ki commented 5 years ago

Yep, resetting the wifi.

clumsy-stefan commented 5 years ago

@TD-er wrote:

Thus turn off wifi and start all over again. Maybe also wait between those steps to make sure buffers are emptied or at least flushed and outstanding requests are actively been cancelled.

Not sure aout flushing... since I removed all client.flush() calls before the client.stop() It also seemed that the disconnect happens less often. While reading some documentation I saw that the flush emties all (tcp)input buffers, so theoretically it could happen that replies get lost...

LeeNX commented 5 years ago

Just adding my 2c, I have a few Mikrotik APs, with a few sonoffs running espurna and a few NodeMCUs running ESPEasy. I had to add an extra AP for the ESP devices, because during a high traffic period in my network, like backups, I would lose my ESP devices. I think this was WiFi signal strength and traffic related. Once the extra AP was added, I don't remember seeing this problem again.

But my NodeMCUs and ESPEasy devices do have some disconnects. I am going to setup a netdata ping monitor to devices and see if I can figure out when, how and why this might be happening and see if I give some info back.

LeeNX commented 5 years ago

Another note, I don't have 40+ devices like @clumsy-stefan , only about 5 users and their mobile devices and the odd smart device like TV, but could it be that either the Mikrotik device is overloaded or the WiFi RF network saturated?

I wish I had a way to see the amount of RF noise. I have an app (WiFi Analyzer) that shows me the number of APs and how they spread of the WiFi channels. I used this tool to set my WiFi channels for different WiFi AP's.

But can't really tell if there might be other RF interference.

As for the idea of overloaded Mikrotik device, I am using MikroTik 951UI-2HND for my main router and MikroTik RB9412nD hAP lite as my ESP AP.

I wonder if that info is useful in any way?

clumsy-stefan commented 5 years ago

@LeeNX thanks for the hints. my nodes are distributed over 3 AP's and the RF is somewhat balanced (max. 20 clients per AP or so), so that there are as little interference as possible (of course there can't be none...). So I don't think the MT's are overloaded (I work in a company providing network backbone connectivity and WLAN for professional events, etc. so I have quite some in depth tests of the infrastructer), but I agree that overloaded air-time could be an issue... however even if so, a node never beeing able to reconnect for >250 times after a disconnect and then after rebooting it it reconnects immediately I think can't be an issue of airtime... besides I don't see that on any other device... so I still think it's some coincidence or race condition between the state of the internal wifi chip, the amount of cpu-time the wifi core gets and what other tasks are running on the ESPEasy part...

LeeNX commented 5 years ago

@clumsy-stefan I concur with your insights. I just wanted to point out that using the little Mikrotik AP with 40+ nodes, could have been a Mikrotik CPU or RAM limited, but that does not seem to be the problem.

I am wondering if dpeloying a ESP32 and NodeMCU device with nothing other than the base ESPEasy core and let that run, so that we could test ESP CPU and no other plugin influence. See what I can deploy and netdata ping test.

If this gives us any useful info, I will report back.

Thanks Guys!!!

TD-er commented 5 years ago

I do have a node here running next to nothing (IR RX/TX) and it is running for weeks without an issue. The other node which has an uptime of over 40 days is running Domoticz MQTT and BME280 + MH-Z19 CO2. So it will probably also depend on which plugin is running and how often and maybe also if the read-interval always matches the read-interval of other plugins.

I already have set some periodical function calls (10/sec, 1/sec and 30sec) with an initial offset in the scheduler so they keep running as much as possible in different loop calls.

Maybe I should also add some interrupt timer driven event like Tasker does (or simply use tasker instead of the scheduler I wrote) and set one task at 10 msec interval to run delay(0). Only thing is, it may interrupt other GPIO transfers and thus disturb sensor readings.

clumsy-stefan commented 5 years ago

that would probably solve the 4way-HS issue, but I guess not the reconnect problem... that's still the most strange one... I just had my test node failing again for 12 hours trying to reconnect for >300times to the AP without success. after (soft-)rebooting it, it reconnected instantly (reboot+reconnect took some 5sec. or so)...

TD-er commented 5 years ago

I just found this issue which looks quite related to what we're seeing here: https://github.com/esp8266/Arduino/issues/5527

clumsy-stefan commented 5 years ago

looks similar, yes... but calling wifi OFF does not seem to change it (tried that), currently I'm trying with WiFi.setOutputPower(0) before calling WiFi.begin() as suggested here https://github.com/esp8266/Arduino/issues/2235 and here https://github.com/esp8266/Arduino/issues/2186#issuecomment-233853152 still waiting for it to happen now tough...

giig1967g commented 5 years ago

it's similar but it's not the same. Infact in our case rebooting the AP solves the issue, in the issue posted by Gijs, rebooting the AP does not solve the issue.

clumsy-stefan commented 5 years ago

@giig1967g rebooting the AP does not solve it for me! only rebooting the node makes it reconnect again (in my environment though)...

TD-er commented 5 years ago

@giig1967g In the issue mentioned by @clumsy-stefan this reboot of the AP is also mentioned: https://github.com/esp8266/Arduino/issues/2235#issuecomment-248851270

So it seems we have several issues here at hand.

giig1967g commented 5 years ago

@clumsy-stefan Ah ok. And in my case only with Mikrotik... :( @TD-er maybe you should ask which AP brand/model they are using...

TD-er commented 5 years ago

@giig1967g I also had some nodes here being unreachable, but the interval of those events was much more than 10 days, so it is a bit hard to say if it is related to one of these issues. About half of my nodes now are connected to a Mikrotik and the other half is connected to Fritzbox 1750E

clumsy-stefan commented 5 years ago

@giig1967g I also only have MT's... and it seems to happen more often to the nodes with weaker signal and nodes with GPIO's in use (specifically PCF's)...

0ki commented 5 years ago

I wish I had a way to see the amount of RF noise. I have an app (WiFi Analyzer) that shows me the number of APs and how they spread of the WiFi channels. I used this tool to set my WiFi channels for different WiFi AP's.

But can't really tell if there might be other RF interference.

You can do /interface wireless registration-table print interval=1 on your mikrotik to see the snr.

clumsy-stefan commented 5 years ago

unfortunately the WiFi.setOutputPower(0) did not help either... after nearly 6 hours smoothly running (without any changes to the wifi infrastructure, suddenly it get's into the deadlock again (see serial output below) from which it never ever recovers, except if by coincidence a SW or HW WDT kicks in...

this only seems to happen after a "Group key update timeout" or a "4way handshake failed". if I kick the node off the AP manually it also gets a "auth expire" but can reconnect instantly...

serial output (nothe the load goes to 100% then)...

20457441 : WIFI : Disconnected! Reason: '(16) Group key update timeout' Connected for 2h19m
20457593 : WIFI : Connecting clumsy_ap2 attempt #0
20458607 : WIFI : Not configured in Station Mode!!: clumsy_ap2
20459707 : EVENT: WiFi#Disconnected
20459763 : WIFI : Disconnected! Reason: '(2) Auth expire' Connected for 2012 ms
20470762 : SYS  : 31.00,20048.00,66.70,341.00
20470766 : EVENT: sysinfo#rssi=31.00
20470824 : EVENT: sysinfo#mem=20048.00
20470882 : EVENT: sysinfo#load=66.70
20470940 : EVENT: sysinfo#uptime=341.00
20472658 : EVENT: Clock#Time=Thu,14:22
20473264 : WD   : Uptime 341 ConnectFailures 22 FreeMem 20040 WiFiStatus 0
20477609 : WIFI : Connecting clumsy_ap2 attempt #1
20478135 : WIFI : Not configured in Station Mode!!: clumsy_ap2
20482577 : EVENT: WiFi#Disconnected
20482635 : WIFI : Disconnected! Reason: '(2) Auth expire' Connected for 4771 ms
20503270 : WD   : Uptime 342 ConnectFailures 22 FreeMem 18440 WiFiStatus 0
20505709 : WIFI : Connecting clumsy_ap2 attempt #2
20506235 : WIFI : Not configured in Station Mode!!: clumsy_ap2
20509784 : EVENT: WiFi#Disconnected
20509845 : WIFI : Disconnected! Reason: '(2) Auth expire' Connected for 3879 ms
20530897 : SYS  : 31.00,16248.00,100.00,342.00
20530902 : EVENT: sysinfo#rssi=31.00
20530963 : EVENT: sysinfo#mem=16248.00
20531025 : EVENT: sysinfo#load=100.00
20531087 : EVENT: sysinfo#uptime=342.00
20532688 : EVENT: Clock#Time=Thu,14:23
20533158 : WD   : Uptime 342 ConnectFailures 22 FreeMem 16240 WiFiStatus 0
20533716 : WIFI : Connecting clumsy_ap2 attempt #3
20534242 : WIFI : Not configured in Station Mode!!: clumsy_ap2
20537788 : EVENT: WiFi#Disconnected
20537851 : WIFI : Disconnected! Reason: '(2) Auth expire' Connected for 3865 ms
clumsy-stefan commented 5 years ago

just a quick remark: adding wifi_set_phy_mode(PHY_MODE_11G); and forcing the node into 802.11g mode seems to be at least a workaround. I have no reboots of any of the units since a couple of hours and also once or twice a unit recovered from the group key timeout issue above... probably related to

• ESP8266 SoftAP only support 802.11b/g.

which is noted in the docs...

I'll update later/tomorrow if it stays stable...

TD-er commented 5 years ago

As already noted (a few times) by someone else, the sensitivity should also be increased by switching to 802.11G. My objective against it is (was?) that most AP have issues when switching between N, G and B. Do you have mixed clients? (G and N) active on the same AP?

clumsy-stefan commented 5 years ago

I use different modes on all Mikrotik. B/G/N/AC mixed, also both bands 2/5GHz including virtual AP's defined... Different clients connect with different modes on the same AP, without issues (except the esp8266)..

clumsy-stefan commented 5 years ago

good news (at least for me)... my nodes ran for the last 12 hours without disconnects/reboots/freezes!

forcing the node into 802.11g seems to work smoothly together with the MikroTik AP's. at least this is a simple workaround (as long as you don't need the additional speed of 802.11n).

if someone want's to test themselves, just add in ESPEasyWifi.ino around line 644 in function tryConnectWiFi() the following line (after setupStaticIPconfig();): WiFi.setPhyMode(WIFI_PHY_MODE_11G);

I can only guess that the issues raises because the SoftAP mode does not support 802.11n and the node is connected in N mode the core gets "confused" somehow... but I guess this is an issue to raise with the core-people...

For ESPEasy this could be made configurable I guess (@TD-er ?) so that it could be choosed on the config page which mode the ESP should run in (B/G/N)...

TD-er commented 5 years ago

Yep I will add a selection option in the advanced settings. I forgot who asked it first, but will look it up and credit him too in the commit ;)