letscontrolit / ESPEasy

Easy MultiSensor device based on ESP8266/ESP32
http://www.espeasy.com
Other
3.22k stars 2.2k forks source link

[BUG][WiFi stability] ESP Exception 3/29 when layer 2 disconnects #1987

Closed clumsy-stefan closed 4 years ago

clumsy-stefan commented 5 years ago

Summarize of the problem/feature request

When there is a lot of traffic on the WiFi network or the node is too busy it seems that some send/ack frames on layer 2 get lost and are net or not in time resent by the ESP. Therefore the connection on layer 2 is dropped by the accesspoint.

The ESP does not seem to handle this situation correctly and still tries to send data to the controller/server. This increases the load on the node to 100% and a renegotiation of the WiFi handshake fails (possibly due to not enough time in the WiFi core to do the handshake).

After some time (1-2min) the ESP runs into an exception (mostly 3 or 29) and reboots. Depending on the state of the WiFi and AP the connection to the AP is never established anymore.

See also discussion here with detailed information about the issue and possible workaround

Expected behavior

The ESP should check for that condition and reinitiate a handshake/connection to the AP before continuing to send data to the controller.

Actual behavior

The ESP sends data to the controller until it raises an exception

Steps to reproduce

  1. Reduce the time to wait for a frame ack on the router (eg. on Mikrotik set distance to "indoors" or below 5(km)
  2. make a lot of ESP's (~20) send regular data to a controller
  3. wait for it to crash

Problem persists after powercycle as well as normal reboots.

Current workaround is increasing the time for frame ack's to a higher value (eg. on Mikrotiks set the "distance" value of the interface to 50(km).

System configuration

Hardware: wemos D1 mini, Sonoff Basic, Sonoff Pow, Wemos Pro, others

ESP Easy version: SELF COMPILED!! Latest GIT version! esp8266 core 2.4.2 LWIP 2.0.1 low memory

Rules or log data

All debug logs and other information documented in #1957 See also PR #1979 for additional debug feature and basic check of sending data.

wolverinevn commented 5 years ago

Sound like I have same issue. Sometimes my esp lost wifi connect and keeps “lost” from wifi for hours. I though they must reboot and reconnect to wifi thanks to watchdog but they dont. A cold boot solve the issue.

TD-er commented 5 years ago

That last thing described by @wolverinevn is something I have seen happening here too.

clumsy-stefan commented 5 years ago

as we discussed in #1957 I'm quite sure a lot of these strange WiFi/networking issues come from this layer 2 instability... I've seen all kind of strange behaviour before changing my AP to some increased timings...

wolverinevn commented 5 years ago

as we discussed in #1957 I'm quite sure a lot of these strange WiFi/networking issues come from this layer 2 instability... I've seen all kind of strange behaviour before changing my AP to some increased timings...

Hope you will find the solution. One of my Nodemcu hangs and disappear from router for 5 hours until now without reason. I hope it can recover from watchdog but no thing happen. I have to reboot it manually now. Very annoyed!

clumsy-stefan commented 5 years ago

@wolverinevn when you have access to the node again go to tools=> advanced and set the "Connection Failure Threshold" to something else than 0 (I suggest something between 50 and 100, depending on the nr. of tasks you have). This does actually not change the problem but increases the chances that the node reboots and reconnects significantly!

the other workaround would be if you can tweak some parameters in your accesspoint, depending actually what type of AP you have if that can be tweaked...

TD-er commented 5 years ago

@clumsy-stefan Should we set the default in ESPeasy to this level too?

And maybe we should also display this value in the sysinfo page and make it available for rules?

clumsy-stefan commented 5 years ago

@TD-er setting it to some level by default is probably not a bad thing if it's not too low (it can always happen that a connection fails).

When debugging the issue I thought about how this is done, currently every unsuccelsful connection increases the counter and ever succesfull connection decreases it. I thought about if it would be more logical to reset it to 0 as soon as a succeful connection happend, but I guess that's a bit a ideological question what makes more sense.

The issue with that number is, if you have 10 tasks, each of them with a retry count of 10 and a resend delay of 100ms, the reboots happen quite quickly if there is a real comms problem (100 retries within about 10 sec.).

now if you have for example always 5 comms failing and 1 successfull, you'll be continiously increasing connection failures. if this happens all the time you will reboot the node sooner or later even though all data could be delivered.

the main issue I'm seeing though is, that somehow the node is not realizing that the connection on layer 2 is actually gone and continues to send data (I guess). besides this what I realized tonight, what happens to syslog (and other comms like NTP etc.) if there is no wifi connection? Is this also stopped? this could explain why my nodes suddenly jump to 100%cpu when layer 2 is gone. probably no more task data is sent, but it tries to get rid of the UDP syslog packets and can't... just a guess though...

sorry, long text for two simple questions... in short: default level: yes I'd set it to the max (100) or so by default... if everything is ok it does no harm if not, the unit gets accessible again... sysinfo page and rules: I'd say no, why should this be dynamically changed? it's an emergency values...

wolverinevn commented 5 years ago

@clumsy-stefan I've already set it to 50. Lets wait. ;)

clumsy-stefan commented 5 years ago

@wolverinevn hmm... 50 should happen quite quickly depending on the number of tasks interval and retries you have (5-15min.)... if this does not help I think the node is actually not frozen, but it just can't reconnect to the network. I had this also even after a WD reboot. can you see if the node tries going to AP mode? do you see the AP-WLAN of the node?

TD-er commented 5 years ago

sysinfo page and rules: I'd say no, why should this be dynamically changed? it's an emergency values...

I meant to be inspected in rules using a system variable like %conn_fail% and show it on the sysinfo page, next to the number of wifi reconnects. After all, it is a performance statistics value

clumsy-stefan commented 5 years ago

I meant to be inspected in rules using a system variable like %conn_fail% and show it on the sysinfo page, next to the number of wifi reconnects. After all, it is a performance statistics value

ah, yes, agree, that would make sense! that's also a bit related to the issue #1993. Having a plugin that sends a number of system/performance variables regularly to the controller (without wasting the limited available tasks) would be really great!

wolverinevn commented 5 years ago

@wolverinevn hmm... 50 should happen quite quickly depending on the number of tasks interval and retries you have (5-15min.)... if this does not help I think the node is actually not frozen, but it just can't reconnect to the network. I had this also even after a WD reboot. can you see if the node tries going to AP mode? do you see the AP-WLAN of the node?

I have 9 tasks, 3 of them are Dummy and MQTT_import. I think the rules is a little bit busy with computing and reading sensors, I tried to limit mqtt_publish by calling in rules every few minutes. Load is arround 29%. As I remember, last time it was frozen this morning, I can't find the AP of Espeasy (if you mean AP_WLAN is operating in AP mode). My setup (network, location of ESP) was working greate with another Nodemcu running 2.3 or 2.4 which was released on March.

Uptime is 7hrs and 20mins, RSSI is -71dbm, there are a few wifi around me. Last Disconnect Reason: | (200) Beacon timeout Number reconnects: | 35

clumsy-stefan commented 5 years ago

@wolverinevn the problem with this issue is, that it happens completely random. I have ~30 nodes running, some of them faced the issue some of them not, some rebooted, some wnt to AP mode...

It really seem to be a combination of how busy the node is, how busy the air is (eg. numebr of wifi devices) and how your AP acutally handles certain conditions (missnig layer 2 acks etc.)...

so I guess until we find a way within the application (ESPEasy) to reliably detect this condition and act on it, there is no "real" solution....

clumsy-stefan commented 5 years ago

@wolverinevn PS: you're not using mikrotik AP's by chance?

TD-er commented 5 years ago

@wolverinevn About the number of reconnects (in your edit) 35 reconnects in about 8 hours is a lot. I have nodes here running for days which only have a handful of reconnects. The most stable one is running for 20 days 11 hours 46 minutes now and only 1 reconnect.

Connected 19d22h33m
Last Disconnect Reason (202) Auth fail
Number reconnects 1
wolverinevn commented 5 years ago

@wolverinevn PS: you're not using mikrotik AP's by chance?

No. I'm using router running Padavan firmware (kind of ASUS).

@TD-er I knew it. I'm inspecting the reason, may be noise from buck module nearby. Another one has 0 reconnect after 2 hours.

clumsy-stefan commented 5 years ago

No. I'm using router running Padavan firmware (kind of ASUS).

Unfortunately I don't know this FW at all... Any chance to tweak layer 2 parameters? Something like frame ack timeouts or similar? Some kind od "distance" settings?

wolverinevn commented 5 years ago

@clumsy-stefan Unfortunately, I don’t see anything like that.

wolverinevn commented 5 years ago

@clumpsy-stefan The unit was rebooted 2 times last night with 50 failure threshold set. Good news is there no frozen any more. Today I will try to improve wifi connect by some minor changes in hardware setup.

Domosapiens commented 5 years ago

3 Wemos units in the same room, connected to the same AP. Reconnects in the last 16 hours or so, With Rule: On WiFi#Connected ....

26 WD reboots and 104 re-connections: muc21_capture

9 WD reboots and 32 re-connections muc19_capture

2WD reboots and 40 re-connections muc14_capture

All have 50 failure threshold set

clumsy-stefan commented 5 years ago

@Domosapiens & @wolverinevn one more thing you can try is increasing the group-key-timeout on your AP (if you have such option). Normally that's around 5min. You can try to increase to 30min. or even 1h and see if it improwves (as long as it's not in a super high security network, which I don't assume if you have IoT's in it)...

clumsy-stefan commented 5 years ago

@TD-er

The most stable one is running for 20 days 11 hours 46 minutes now and only 1 reconnect.

I also have currently units that ran for over 3 days now and other that rebooted within a day...

I did see some issues with the rekeeying of the group key. it somehow seems, that in newer versions of the core it can happen that the rekeying runs into a timeout... however the application should act on this and not go into some high-load not responsive mode... but I'm not sure where it's failing..

Domosapiens commented 5 years ago

@clumsy-stefan thanks for your suggestion. I see in my dd-wrt router a param. "Key Renewal Interval" with value 3600 (in seconds). So that should be fine ?

rekeying runs into a timeout

That would explain only an hourly re-connect imho. Thanks to the great rule feature WiFi#Connected we can see this. No idea how older versions performed on this. A hidden problem already for a long time?

clumsy-stefan commented 5 years ago

after setting my group key reneval from 5min. to 1h the units run much more stable. I assume, that with 40 clients connected to one AP doing every 5min. a rekey just got the air a bit too busy.. after that I could even decrease the frame-ack timeout again and the units got more responsive again, also I got less "connection timeouts" after changing these parameters...

@TD-er still I think this "group key timeout" and the "assoc fail" afterwards needs to be captured and handled by the application. I get the feeling that the unit continues to try to send queued messages to the controller/server while trying to do a rekey which makes the unit's too slow and rekeeying fails... therefore disconnecting on layer 2.

just a rough idea though, I'm still trying to really pin it, but that^s the closest I could get until now.

clumsy-stefan commented 5 years ago

@TD-er jsut now I experienced again a node that goes into a indefinite loop of reconnect tries and always gets expires (see log below). intersting is, that it obvisouly realizes it's not connected and does not try to send data to the controller, which means the "failed connect attempts" does not increase anymore and therefore never hits the failed connection limit.

also the load shows always 100%, that's why I guess it does not succeed to reconnect as it's always too slow to actually do the handshake.... sems like a tail-bite to me... just memory is slowly decreasing (until I assume it crashes because out of mem)...

I'm going to test with #2073 and see if this situation occurs again... however, it occurs very rarely on the two serial connected nodes, so that I'm able to really track what's going on...

105587 : EVENT: WiFi#Disconnected
3105617 : WIFI : Disconnected! Reason: '(203) Assoc fail' Connected for 129 ms
3131325 : WD   : Uptime 52 ConnectFailures 84 FreeMem 12976
3134621 : EVENT: WiFi#Disconnected
3134652 : WIFI : Disconnected! Reason: '(15) 4way handshake timeout' Connected for 5141 ms
3141487 : WIFI : AP Mode ssid will be wemos-mini-18_18 with address 192.168.4.1
3141488 : WIFI : Connecting clumsy_ap2 attempt #53
3142671 : EVENT: WiFi#Disconnected
3142700 : WIFI : Disconnected! Reason: '(2) Auth expire' Connected for 1122 ms
3153443 : WIFI : AP Mode ssid will be wemos-mini-18_18 with address 192.168.4.1
3153444 : WIFI : Connecting clumsy_ap2 attempt #54
3153713 : EVENT: WiFi#Disconnected
3153743 : WIFI : Disconnected! Reason: '(203) Assoc fail' Connected for 157 ms
3161324 : WD   : Uptime 53 ConnectFailures 84 FreeMem 12976
3165518 : WIFI : AP Mode ssid will be wemos-mini-18_18 with address 192.168.4.1
3165519 : WIFI : Connecting clumsy_ap2 attempt #55
3178542 : WIFI : AP Mode ssid will be wemos-mini-18_18 with address 192.168.4.1
3178543 : WIFI : Connecting clumsy_ap2 attempt #56
3179728 : EVENT: WiFi#Disconnected
3179757 : WIFI : Disconnected! Reason: '(2) Auth expire' Connected for 1122 ms
3189704 : SYS  : 31.00,12808.00,100.00,53.00
3189708 : EVENT: sysinfo#rssi=31.00
3189743 : EVENT: sysinfo#mem=12808.00
3189773 : EVENT: sysinfo#load=100.00
3189804 : EVENT: sysinfo#uptime=53.00
3191297 : WD   : Uptime 53 ConnectFailures 84 FreeMem 12800
3191441 : WIFI : AP Mode ssid will be wemos-mini-18_18 with address 192.168.4.1
3191442 : WIFI : Connecting clumsy_ap2 attempt #57
3191576 : EVENT: WiFi#Disconnected
3191606 : WIFI : Disconnected! Reason: '(203) Assoc fail' Connected for 125 ms
3200507 : EVENT: Clock#Time=Tue,10:25
3204458 : WIFI : AP Mode ssid will be wemos-mini-18_18 with address 192.168.4.1
3204459 : WIFI : Connecting clumsy_ap2 attempt #58
3217493 : WIFI : AP Mode ssid will be wemos-mini-18_18 with address 192.168.4.1
3217493 : WIFI : Connecting clumsy_ap2 attempt #59
3218677 : EVENT: WiFi#Disconnected
3218707 : WIFI : Disconnected! Reason: '(2) Auth expire' Connected for 1122 ms
3221325 : WD   : Uptime 54 ConnectFailures 84 FreeMem 12800
3245444 : WIFI : AP Mode ssid will be wemos-mini-18_18 with address 192.168.4.1
3245445 : WIFI : Connecting clumsy_ap2 attempt #61
3249673 : SYS  : -80.00,12632.00,100.00,54.00
3249677 : EVENT: sysinfo#rssi=-80.00
3249709 : EVENT: sysinfo#mem=12632.00
3249741 : EVENT: sysinfo#load=100.00
3249772 : EVENT: sysinfo#uptime=54.00
3250620 : EVENT: WiFi#Disconnected
3250650 : WIFI : Disconnected! Reason: '(15) 4way handshake timeout' Connected for 5130 ms
3251307 : WD   : Uptime 54 ConnectFailures 84 FreeMem 12624
3259435 : WIFI : AP Mode ssid will be wemos-mini-18_18 with address 192.168.4.1
3259436 : WIFI : Connecting clumsy_ap2 attempt #62
3260490 : EVENT: Clock#Time=Tue,10:26
3260650 : EVENT: WiFi#Disconnected
3260679 : WIFI : Disconnected! Reason: '(2) Auth expire' Connected for 1121 ms
3273521 : WIFI : AP Mode ssid will be wemos-mini-18_18 with address 192.168.4.1
3273522 : WIFI : Connecting clumsy_ap2 attempt #63
3273659 : EVENT: WiFi#Disconnected
3273689 : WIFI : Disconnected! Reason: '(203) Assoc fail' Connected for 125 ms
3281281 : WD   : Uptime 55 ConnectFailures 84 FreeMem 12624
3287445 : WIFI : AP Mode ssid will be wemos-mini-18_18 with address 192.168.4.1
3287446 : WIFI : Connecting clumsy_ap2 attempt #64
uzi18 commented 5 years ago

It is hard to connect with rssi - 80

clumsy-stefan commented 5 years ago

RSSI value is not reliable when the node ESP is not connected! same node has less than -70 when conected.... on deep-sleep nodes they even show the "not connected" default of +31 when sending the values! and after a reboot it runs without issues... (without moving it...) if you read above, it's a layer 2 issue which is reproducable when you tweak parameters on the AP....

PS: you can try it yourself! jsut connect 20-30 Nodes and lower the group-key timout to 5min. and the frame reply threshold value to something like 7... see what happens!

uzi18 commented 5 years ago

I think it is wrong place for issues with layer2, you should filll issue on esp8266 core or maybe better on nonos sdk project But please don't shout here

clumsy-stefan commented 5 years ago

it only happens with ESPeasy and not with oter types of firmwares I tried. I assume the node gets too busy at some point in time and not leaving enough time to the core do the rekeying (everything explained a number of times), so feel free to read the debugs and explanations... but I agree, you don't need to answer actually... PS: @uzi18 had you ever had 30 nodes running succesfully at the same time?

clumsy-stefan commented 5 years ago

@TD-er: in ESPEasyWifi.ino lines 650 - 669 the switch statement's default match breaks out of the switch and therefore tryConnectWiFi() returns true even though WiFi.status() is not necessarily WL_CONNECTED but can be any other state (only 2 false return states are checked..).

Chaning this and returning trueonly if the WiFi.status() actually returned WL_CONNECTED solves at least one of the layer 2 disconnect/exception issues I'm facing!

What do you think? Am I missing something or why should tryConnectWiFi() return when WiFi.status() is not ?WL_CONNECTED`?

TD-er commented 5 years ago

@clumsy-stefan Good to see you're still digging into the WiFi code.

https://github.com/letscontrolit/ESPEasy/blob/5ee18ec556c9c58802af29f5fd78593905ef35c1/src/ESPEasyWifi.ino#L604-L671

The initial idea of this function was to start the WiFi connect sequence. Maybe its function has a bit changed throughout all the changes since. But you may be on to something here. I think it may be a proper change to return true (in the end of that function) only if the status returns it is connected.

Can you describe what seems to be the other layer 2 issue you're facing?

clumsy-stefan commented 5 years ago

can't say exactly when the other exception occurs, but at least there is a difference if this function returns only truewhen it the status is WL_CONNECTED I attach the debug before and after the change..

before

156874 : EVENT: WiFi#Disconnected Processing time:74 milliSeconds
156876 : WIFI : Disconnected! Reason: '(0) Unknown' Connected for 2 m 32 s
156877 : WIFI  : Arduino wifi status: WL_CONNECTED ESPeasy internal wifi status: ESPEASY_WIFI_DISCONNECTED
157208 : WIFI : Connecting clumsy_ap2 attempt #0
157212 : WIFI  : Arduino wifi status: WL_DISCONNECTED ESPeasy internal wifi status: ESPEASY_WIFI_DISCONNECTED
scandone
160069 : Fatal exception 9(LoadStoreAlignmentCause):
epc1=0x40105cd4, epc2=0x00000000, epc3=0x00000000, excvaddr=0x00000003, depc=0x00000000

Exception (9):
epc1=0x40105cd4 epc2=0x00000000 epc3=0x00000000 excvaddr=0x00000003 depc=0x00000000

after

108304 : EVENT: WiFi#Disconnected Processing time:73 milliSeconds
108307 : WIFI : Disconnected! Reason: '(2) Auth expire' Connected for 2014 ms
108308 : WIFI  : Arduino wifi status: WL_DISCONNECTED ESPeasy internal wifi status: ESPEASY_WIFI_DISCONNECTED
109217 : WIFI : Connecting clumsy_ap2 attempt #1
109220 : WIFI  : Arduino wifi status: WL_DISCONNECTED ESPeasy internal wifi status: ESPEASY_WIFI_DISCONNECTED
scandone
state: 0 -> 2 (b0)
state: 2 -> 3 (0)
state: 3 -> 5 (10)
add 0
aid 1
cnt

connected with clumsy_ap2, channel 9
dhcp client start...
112113 : WIFI : Connected! AP: clumsy_ap2 (4C:5E:0C:39:F6:55) Ch: 9 Duration: 2895 ms
112115 : WIFI  : Arduino wifi status: WL_DISCONNECTED ESPeasy internal wifi status: ESPEASY_WIFI_CONNECTED
ip:10.0.10.117,mask:255.255.0.0,gw:10.0.0.2
113751 : WIFI : DHCP IP: 10.0.10.117 (wemos-mini-17-17) GW: 10.0.0.2 SN: 255.255.0.0   duration: 1636 ms
113765 : EVENT: WiFi#Connected

Hmm.. I just realize, that after this the internal status does not (yet) match the Arduino status... this could also lead to issues I guess...

uzi18 commented 5 years ago

@clumsy-stefan this status is because we can't relay on Arduino wifi status, that's why @TD-er introduced ESPEasy status, but ok maybe we can try to double check if every status in code is properly checked.

TD-er commented 5 years ago

It could be this wifi status has been fixed in core 2.5.0, so maybe our own status has become obsolete. That would be nice, since it makes the WiFi code rather complicated and thus prone to errors.

Edit: I'm looking at this error you gave: Fatal exception 9(LoadStoreAlignmentCause): One of the more recent fixes in core 2.5.0 is about the constructor of IPAddress, which should fix problems when the alignment of the given byte sequence isn't 32-bit aligned. Maybe this is something similar?

clumsy-stefan commented 5 years ago

That's one of the guesses I have, that the ESPEasy status is not always in sync with Arduino status. Especially temporary disconnects on layer 2 (eg. WiFi rekeeyings) are probably not really handled / realized.

One other thing couldbe the opposite, that ESPEasy thinks it's disconnected and tries to reconnect but the core is still conencted and therefore leads to an exception. but can't prove that one yet...

about the alignment, yes, can be, but can't nail this either currently...

so the only thing I'm quite sure currently is the returncode of tryConnectWiFi()should match the actual connection status or at least check for WL_CONNECTED...

clumsy-stefan commented 5 years ago

@TD-er I'm somewhat more concerned about

connected with clumsy_ap2, channel 9
dhcp client start...
112113 : WIFI : Connected! AP: clumsy_ap2 (4C:5E:0C:39:F6:55) Ch: 9 Duration: 2895 ms
112115 : WIFI  : Arduino wifi status: WL_DISCONNECTED ESPeasy internal wifi status: ESPEASY_WIFI_CONNECTED

For me, the last two lines indicate that the core did not yet update the status, even though ESPEasy thinks it did... so it could end up in some race condition here...

after this happens sometimes I start to see a lot of

7989956 : Read settings: ControllerSettings index: 0
7989997 : Read settings: ControllerSettings index: 0
7990130 : Read settings: ControllerSettings index: 0
7990267 : Read settings: ControllerSettings index: 0
7990399 : Read settings: ControllerSettings index: 0
7990531 : Read settings: ControllerSettings index: 0
7990664 : Read settings: ControllerSettings index: 0
7990799 : Read settings: ControllerSettings index: 0
7990938 : Read settings: ControllerSettings index: 0

from which it never recovers...

a bit later it tells me:

8185850 : ip:169.254.37.119,mask:255.255.0.0,gw:0.0.0.0
Read settings: ControllerSettings index: 0
8185975 : WIFI : DHCP IP: 169.254.37.119 (wemos-mini-18-18) GW: (IP unset) SN: 255.255.0.0
8185990 : EVENT: WiFi#Connected

No clue where this comes from... but after this it starts to try to connect to the controller/server which obviously fails until it runs out of tries (100) and reboots...

EDIT: btw, you can force this behaviour if you kick the node off the AP manually...sometimes it just reconnects, sometimes it happens what's described above... EDIT2: sometimes it crashes with Exception 9... so it seems to be some kind of race-condition how exactly it recovers from a disconnect! my fault, had a addLog() in the onDisconenct() callback...

clumsy-stefan commented 5 years ago

it's more or less always the same situation. after the 4-way handshake fails (rekeeying) it never recovers anymore... not sure how to force a recovry of this..

at least adding some delay(100) in the processConnect() and processDisconnect thus giving the core time to update the WiFi status makes the WiFi satuses in sync again. This also makes the units get into the below situation much less often!

900695 : WIFI : DHCP renew probably failed
900697 : Reset WiFi.
900699 : WIFI : Connecting clumsy_ap2 attempt #0
901713 : EVENT: WiFi#Disconnected
901772 : WIFI : Disconnected! Reason: '(8) Assoc leave' Connected for 14 m 56 s
902326 : WD   : Uptime 15 ConnectFailures 44 FreeMem 20248 WiFiStatus 0
907048 : EVENT: WiFi#Disconnected
907106 : WIFI : Disconnected! Reason: '(15) 4way handshake timeout' Connected for 6172 ms
907786 : WIFI : Connecting clumsy_ap2 attempt #1
911821 : EVENT: WiFi#Disconnected
911879 : WIFI : Disconnected! Reason: '(2) Auth expire' Connected for 3860 ms
911894 : WIFI : Connecting clumsy_ap2 attempt #2
912793 : EVENT: Clock#Time=Sat,08:29
919974 : EVENT: WiFi#Disconnected
920033 : WIFI : Disconnected! Reason: '(15) 4way handshake timeout' Connected for 7962 ms
920824 : WIFI : Connecting clumsy_ap2 attempt #3
922083 : EVENT: WiFi#Disconnected
922141 : WIFI : Disconnected! Reason: '(2) Auth expire' Connected for 1125 ms
922805 : WIFI : Connecting clumsy_ap2 attempt #4
923138 : EVENT: WiFi#Disconnected
923196 : WIFI : Disconnected! Reason: '(203) Assoc fail' Connected for 133 ms
925831 : WIFI : Connecting clumsy_ap2 attempt #5
931179 : EVENT: WiFi#Disconnected
931238 : WIFI : Disconnected! Reason: '(15) 4way handshake timeout' Connected for 5165 ms
931775 : WIFI : Set WiFi to AP+STA
932701 : EVENT: WiFi#APmodeEnabled
932778 : WIFI : AP Mode ssid will be wemos-mini-17_17 with address 192.168.4.1
932778 : WIFI : Connecting clumsy_ap2 attempt #6
933023 : WD   : Uptime 16 ConnectFailures 44 FreeMem 17856 WiFiStatus 0
934065 : EVENT: WiFi#Disconnected
934122 : WIFI : Disconnected! Reason: '(2) Auth expire' Connected for 1123 ms
935712 : WIFI : Connecting clumsy_ap2 attempt #7
936042 : EVENT: WiFi#Disconnected
936106 : WIFI : Disconnected! Reason: '(203) Assoc fail' Connected for 169 ms
938745 : WIFI : Connecting clumsy_ap2 attempt #8
939079 : EVENT: WiFi#Disconnected
939138 : WIFI : Disconnected! Reason: '(203) Assoc fail' Connected for 131 ms
941778 : WIFI : Connecting clumsy_ap2 attempt #9
947130 : EVENT: WiFi#Disconnected
947189 : WIFI : Disconnected! Reason: '(15) 4way handshake timeout' Connected for 5140 ms
947725 : WIFI : Connecting clumsy_ap2 attempt #10
948976 : EVENT: WiFi#Disconnected
949035 : WIFI : Disconnected! Reason: '(2) Auth expire' Connected for 1121 ms
951805 : WIFI : Connecting clumsy_ap2 attempt #11
952140 : EVENT: WiFi#Disconnected
952199 : WIFI : Disconnected! Reason: '(203) Assoc fail' Connected for 134 ms
955778 : WIFI : Connecting clumsy_ap2 attempt #12
956115 : EVENT: WiFi#Disconnected
956174 : WIFI : Disconnected! Reason: '(203) Assoc fail' Connected for 142 ms
959734 : WIFI : Connecting clumsy_ap2 attempt #13
960064 : EVENT: WiFi#Disconnected
960123 : WIFI : Disconnected! Reason: '(203) Assoc fail' Connected for 129 ms
clumsy-stefan commented 5 years ago

@TD-er tryConnectWiFi() is returning a trueor falsein case the connection was succesfull or not... however the WiFiConnectRelaxed() actually never checks for this...

is this function somehow from before event-based WiFi? it seems like it never reaches the last 2 lines in that function...

TD-er commented 5 years ago

Yes it was some kind of left-over from before the event-based wifi. I think we really should consider having a good look at that WiFi code again, since it isn't as structured as it should be.

clumsy-stefan commented 5 years ago

ok... I'm still debugging what exactly happens due to the 4way handshake timeout and why the node won't reconnect again.... but I think I'm still poking a bit in the dark however finding small bits here and there which could add up though....

clumsy-stefan commented 5 years ago

one other small one seems to be in WifiCheck()... in there checking for layer 2 connectivity is only done when IP is not valid anymore (eg. all octets are 0). This could lead to the situation that layer 2 is dis-(or re-)connecting/handshaking, etc. but the IP is still valid as it's not yet expired (DHCP). That's probably the cause why the "DCHP renew probably failed" only happens after a long time, when the lease is actually gone... but I'm still verifying this...

clumsy-stefan commented 5 years ago

also there is wifiCheck(), WiFiConnected() and connectionCheckHandler() which all do some kind of connection checking, and call each other as well as resetWiFi() under certain conditions... especially connectionCheckHandler()seems only to be called when mqtt_reconnect_count > 10. So what happens in a non-MQTT environment?

PS: I'm just documenting my findings here searching for the underlaying WiFi troubles...So I'm happy for any thoughts on it, but not neccesarily expected...

clumsy-stefan commented 5 years ago

there has to be some kind of mysterious race-condition somewhere. when the AP initiates a reauth or rekeeying sometimes the node/core either does not get enough cpu-time to complete the handshake or it gets interrupted while doing so, especially on nodes with low signal (and therefore the handshake taking much longer)... (see below).

these rekeesings/reauths seem to generate a disconnect event by the core, even though the core would do a auto renegotiation or reconncet (as I understand). but then it seems this handshake process gets interrupted by the ESPEasy state machine as a manual reconnect is triggered... this process repeats itself and never ends (or in some cases generates wdt's)..

810114 : EVENT: WiFi#Disconnected
810146 : WIFI : Disconnected! Reason: '(16) Group key update timeout' Connected for 13 m 16 s
810977 : WIFI : Connecting clumsy_ap2 attempt #0
811081 : WIFI : Connection lost to: clumsy_ap2
813089 : EVENT: WiFi#Disconnected
813120 : WIFI : Disconnected! Reason: '(2) Auth expire' Connected for 2011 ms
814977 : EVENT: Clock#Time=Thu,08:55
821529 : WD   : Uptime 14 ConnectFailures 0 FreeMem 22000 WiFiStatus 0
831976 : WIFI : Connecting clumsy_ap2 attempt #1
832079 : WIFI : Connection lost to: clumsy_ap2
836831 : EVENT: WiFi#Disconnected
836863 : WIFI : Disconnected! Reason: '(2) Auth expire' Connected for 4753 ms
TD-er commented 5 years ago

So maybe we should use the wifi events only to monitor the process, not take action ourselves? Changing this will render 2.3.0 and perhaps 2.4.0 builds unusable though.

clumsy-stefan commented 5 years ago

No, I would'nt do that... In my branch I already did a couple of (not so intrusive) changes which seem ta make these cases occur less often...

just found another small thing: in tryConnectWiFi() the WiFi.status() check towrds the end after the WiFi.begin() call starts to return WL_DISCONNECTED. According to the Documentation this means "if module is not configured in station mode". trying to find out why this happens or actually if it helps to put the esp explicitely in AP mode before calling WiFi.begin()

So I still hope to find the "real" underlying issue why these stalls happen... If so (fingers crossed) I would suggest, that I do a PR with all changes for you (and others) to review....

TD-er commented 5 years ago

Please know that I have also seen lots and lots of situations where the state of WiFi.status() was not correct. So maybe the core libraries now have fixes and very likely I messed up also somewhere in all the attempts to get the WiFi to behave stable. That debugging was very hard to do, since I cannot reproduce these issues myself and had to act on reports by other users. Lately I have a module which is also behaving badly regarding WiFi, so that's my favorite WiFi test module. But it may also be an indication the problem will be made worse when some parts are just close being out of spec. For example power supply, or missing decoupling capacitors, (too) thin PCB traces, less shielding, etc.

clumsy-stefan commented 5 years ago

sure, not that I'm ruling HW-issues out. But I have ~40 nodes running, with all different kind of power supplys, different boards, brands, etc... and at some point in time most of them face connectivity issues... especially the ones with weak wifi coverage or lots of GPIO's in use..

And I currently do have a bit of time to do some debugging and I still find it very interesting and challenging ;) By now I even start to understand how the whole sequence of connecting, checking, disconnecting and so on works ;)

So if it's OK for you I'll keep digging deeper in these connectivity issues.... you're the boss though....

TD-er commented 5 years ago

Please continue digging :) I really want to be freed from all those disconnect issues which are next to impossible to reproduce. They already have taken way too long now and it would be really great if they are fixed.

0ki commented 5 years ago

I'm getting about 4-24 hours of uptime followed by 2-10 hours of downtime on average. During downtime the node continues to work, but there is no wifi connection.

The accesspoint (MikroTik) shows:

18:16:15 wireless,info 80:xx:xx:xx:xx:xx@iotnet: connected, signal strength -62 
18:16:20 wireless,info 80:xx:xx:xx:xx:xx@iotnet: disconnected, unicast key exchange timeout 
18:17:31 wireless,info 80:xx:xx:xx:xx:xx@iotnet: connected, signal strength -60 
18:17:36 wireless,info 80:xx:xx:xx:xx:xx@iotnet: disconnected, unicast key exchange timeout 
ESP Easy Information
Build:⋄ 20103 - Mega
Libraries:⋄ ESP82xx Core 2_4_2, NONOS SDK 2.2.1(cfd48f3), LWIP: 2.0.3 PUYA support
GIT version:⋄ mega-20190110
Plugins:⋄ 7 [Normal] [Sonoff POW R1/R2]
Build time:⋄ Jan 10 2019 03:21:19
Binary filename:⋄ ESP_Easy_mega-20190110_hard_SONOFF_POW_4M.bin
0ki commented 5 years ago

This was not the problem on the older version (when I still could use channel 14 and have my hidden-ssid network).