letscontrolit / ESPEasy

Easy MultiSensor device based on ESP8266/ESP32
http://www.espeasy.com
Other
3.26k stars 2.21k forks source link

WLAN deauth reason 3 - random - seems to reboot unit #2203

Closed NorbertRoller closed 5 years ago

NorbertRoller commented 5 years ago

It happens randomly but within 20-60 Minutes regularly. Is not a hardware issue as it happens on several units. WLAN deauth and associated / authorized is between 1 Second and several minutes.

It can not be triggered or at least I have not found out how,

It did not happen in the June, 6th version.

Log of AP: 2018-12-27 19:55:52 Station: 84:F3:EB:BB:20:A5 has authorized on Channel: 1, SSID: KACR, 2.4GHz. Interface: wlan-1-1. notice

2018-12-27 19:55:51 Station: 84:F3:EB:BB:20:A5 has associated on Channel: 1, SSID: KACR, 2.4GHz. Interface: wlan-1-1. notice

2018-12-27 19:55:51 Station: 84:F3:EB:BB:20:A5 has deauth by STA Logout on Channel: 1, SSID: KACR, 2.4GHz. reason 3, Interface: wlan-1-1. notice

At the same time the log of espeasy shows a line with "failed to fetch": 5971946: Memtrace^0: lowest: 10832 sendContentBlocking-> 12216 sendContentBlocking-> 11544 sendContentBlocking-> 10872 sendContentBlock 5971947: 1: lowest: 10720 parseTemplate3-> 13264 ruleMatch-> 13200 ruleMatch2-> 13136 parseTemplate-> 13304 parseTemplate3-> 13264 rule 5971947: 2: lowest: 10872 PluginCall_s (0)-> 14416 PluginCall_s (0)-> 14416 PluginCall_s (0)-> 14416 PluginCall_s (0)-> 14416 PluginCal 5971948: WD : Uptime 100 ConnectFailures 0 FreeMem 15992 5972836: LoopStats: shortestLoop: 131 longestLoop: 901266 avgLoopDuration: 165.21 loopCounterMax: 229007 loopCounterLast: 172045 countFi 5972836: Scheduler stats: (called/tasks/max_length/idle%) 172045/2162/7/83.50 5973054: sendcontent free: 14880 chunk size:400 5973058: sendcontent free: 14208 chunk size:400 5973063: sendcontent free: 13536 chunk size:400 5973068: sendcontent free: 14424 chunk size:145 5973070: sendcontent free: 14424 chunk size:0 5975088: sendcontent free: 15096 chunk size:360 5975090: sendcontent free: 14424 chunk size:0 5976111: sendcontent free: 15096 chunk size:280 5976112: sendcontent free: 14424 chunk size:0 ### >> Failed to fetch << 21262: sendcontent free: 15648 chunk size:118 21263: sendcontent free: 14976 chunk size:0 22486: sendcontent free: 15648 chunk size:276 22488: sendcontent free: 14976 chunk size:0 23817: sendcontent free: 15648 chunk size:276 23819: sendcontent free: 14976 chunk size:0 25288: sendcontent free: 15648 chunk size:276 25290: sendcontent free: 14976 chunk size:0 26678: sendcontent free: 15648 chunk size:276 26680: sendcontent free: 14976 chunk size:0 27821: sendcontent free: 15648 chunk size:276 27823: sendcontent free: 14976 chunk size:0 29403: RuleDebug: 101: gpio,5,0 29403: RuleDebug: 101: endif 29404: RuleDebug: 000: endon 29407: RuleDebug: 100: taster1#taster1=0 do 29408: RuleDebug: 101: if [relay1#relay1]=0 29408: RuleDebug: 101: gpio,4,1 29409: RuleDebug: 101: else 29409: RuleDebug: 101: gpio,4,0 29410: RuleDebug: 101: endif 29410: RuleDebug: 000: endon 29413: RuleDebug: 100: system#boot do 29414: RuleDebug: 101: gpio,5,0 29415: RuleDebug: 101: gpio,4,0 29416: EVENT: Clock#Time=Thu,19:59 Processing time:21 milliSeconds

Information of the unit: Load: 16.20% (LC=5863) Free Mem: 15032 (13592 - sendWebPage) Free Stack: 3600 (1716 - parseTemplate) Boot: Manual reboot (1) Reset Reason: Hardware Watchdog Network Wifi: 802.11N (RSSI -65 dB) IP config: Static IP / subnet: 192.168.0.150 / 255.255.255.0 GW: 192.168.0.9 Client IP: 192.168.0.101 DNS: 8.8.8.8 / 0.0.0.0 Allowed IP Range: 192.168.0.0 - 192.168.0.255 STA MAC: 84:F3:EB:BB:20:A5 AP MAC: 86:F3:EB:BB:20:A5 SSID: KACR (5C:E2:8C:81:54:1D) Channel: 1 Connected: 50 m 51 s Last Disconnect Reason: (1) Unspecified Number reconnects: 0 Firmware Build:⋄ 20103 - Mega Libraries:⋄ ESP82xx Core 2_4_2, NONOS SDK 2.2.1(cfd48f3), LWIP: 2.0.3 PUYA support GIT version:⋄ mega-20181218 Plugins:⋄ 46 [Normal] Build Md5: 17c53d944edae329df8bbdaa2d76520 Md5 check: passed. Build time:⋄ Dec 18 2018 03:05:31 Binary filename:⋄ ESP_Easy_mega-20181218_normal_ESP8266_1024.bin System Status Syslog Log Level: None Serial Log Level: None Web Log Level: Info ESP board ESP Chip ID: 12263589 (0xBB20A5) ESP Chip Freq: 80 MHz ESP Board Name: PLATFORMIO_ESP01_1M Storage Flash Chip ID: Vendor: 0x68 Device: 0x4014 Flash Chip Real Size: 1024 kB Flash IDE Size: 1024 kB Flash IDE speed: 40 MHz Flash IDE mode: DOUT Flash Writes: 0 daily / 0 boot Sketch Size: 769 kB (104 kB free) SPIFFS Size: 110 kB (36 kB free)

Rule1 (only one) on Taster2#Taster2=0 do if [Relay2#Relay2]=0 gpio,5,1 else gpio,5,0 endif endon

on Taster1#Taster1=0 do if [Relay1#Relay1]=0 gpio,4,1 else gpio,4,0 endif endon

on System#Boot do gpio,5,0 // Prevent relay turning on during boot gpio,4,0 endon

TD-er commented 5 years ago

Do several units experience this wifi disconnect at the same time?

NorbertRoller commented 5 years ago

Most of them are connected to one AP and I have not noticed that others do it at the same time.

Von: Gijs Noorlander [mailto:notifications@github.com] Gesendet: Freitag, 28. Dezember 2018 09:49 An: letscontrolit/ESPEasy Cc: NorbertRoller; Author Betreff: Re: [letscontrolit/ESPEasy] WLAN deauth reason 3 - random - seems to reboot unit (#2203)

Do several units experience this wifi disconnect at the same time? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/letscontrolit/ESPEasy/issues/2203#issuecomment-450319685 , or mute https://github.com/notifications/unsubscribe-auth/AomgS2x5HiRVSz4OL-COFs8F1rO0v4Hlks5u9dsMgaJpZM4Zjlbh the thread. https://github.com/notifications/beacon/AomgS4nm_dm7-MdxhOziFSJwnjIl59stks5u9dsMgaJpZM4Zjlbh.gif

NorbertRoller commented 5 years ago

Hi, I exported parts of the log of last night, when there was no access from any other unit, neither a button had been pushed. So the units had been left alone.

There are 2 units within 2m to the AP in the same room: 84:F3:EB:BB:20:A5 84:F3:EB:BB:20:1A

I highlighted the occasion when both deauthed last night. It is at the same time. But the ...:A5 is doing it more frequently. As I expected a NTP issue, I have it disabled it on this unit and left it on with ...:1A, but obviously it makes no difference.

71 2018-12-28 05:59:46 Wireless LAN Station: 84:F3:EB:BB:20:A5 has authorized on Channel: 1, SSID: KACR, 2.4GHz. Interface: wlan-1-1. notice

72 2018-12-28 05:59:46 Wireless LAN Station: 84:F3:EB:BB:20:A5 has associated on Channel: 1, SSID: KACR, 2.4GHz. Interface: wlan-1-1. notice

73 2018-12-28 05:59:46 Wireless LAN Station: 84:F3:EB:BB:20:A5 has deauth by STA Logout on Channel: 1, SSID: KACR, 2.4GHz. reason 3, Interface: wlan-1-1. notice

74 2018-12-28 04:36:29 Wireless LAN Station: 84:F3:EB:BB:20:A5 has authorized on Channel: 1, SSID: KACR, 2.4GHz. Interface: wlan-1-1. notice

75 2018-12-28 04:36:29 Wireless LAN Station: 84:F3:EB:BB:20:A5 has associated on Channel: 1, SSID: KACR, 2.4GHz. Interface: wlan-1-1. notice

76 2018-12-28 04:36:29 Wireless LAN Station: 84:F3:EB:BB:20:A5 has deauth by STA Logout on Channel: 1, SSID: KACR, 2.4GHz. reason 3, Interface: wlan-1-1. notice

77 2018-12-28 04:19:30 Wireless LAN Station: 84:F3:EB:BB:20:A5 has authorized on Channel: 1, SSID: KACR, 2.4GHz. Interface: wlan-1-1. notice

78 2018-12-28 04:19:29 Wireless LAN Station: 84:F3:EB:BB:20:A5 has associated on Channel: 1, SSID: KACR, 2.4GHz. Interface: wlan-1-1. notice

79 2018-12-28 04:19:29 Wireless LAN Station: 84:F3:EB:BB:20:A5 has deauth by STA Logout on Channel: 1, SSID: KACR, 2.4GHz. reason 3, Interface: wlan-1-1. notice

80 2018-12-28 04:02:47 Wireless LAN Station: 84:F3:EB:BB:20:1A has authorized on Channel: 1, SSID: KACR, 2.4GHz. Interface: wlan-1-1. notice

81 2018-12-28 04:02:46 Wireless LAN Station: 84:F3:EB:BB:20:1A has associated on Channel: 1, SSID: KACR, 2.4GHz. Interface: wlan-1-1. notice

82 2018-12-28 04:02:46 Wireless LAN Station: 84:F3:EB:BB:20:1A has deauth by STA Logout on Channel: 1, SSID: KACR, 2.4GHz. reason 3, Interface: wlan-1-1. notice

83 2018-12-28 04:02:46 Wireless LAN Station: 84:F3:EB:BB:20:A5 has authorized on Channel: 1, SSID: KACR, 2.4GHz. Interface: wlan-1-1. notice

84 2018-12-28 04:02:46 Wireless LAN Station: 84:F3:EB:BB:20:A5 has associated on Channel: 1, SSID: KACR, 2.4GHz. Interface: wlan-1-1. notice

85 2018-12-28 04:02:46 Wireless LAN Station: 84:F3:EB:BB:20:A5 has deauth by STA Logout on Channel: 1, SSID: KACR, 2.4GHz. reason 3, Interface: wlan-1-1. notice

86 2018-12-28 03:47:20 Wireless LAN Station: 84:F3:EB:BB:20:A5 has authorized on Channel: 1, SSID: KACR, 2.4GHz. Interface: wlan-1-1. notice

87 2018-12-28 03:47:20 Wireless LAN Station: 84:F3:EB:BB:20:A5 has associated on Channel: 1, SSID: KACR, 2.4GHz. Interface: wlan-1-1. notice

88 2018-12-28 03:47:20 Wireless LAN Station: 84:F3:EB:BB:20:A5 has deauth by STA Logout on Channel: 1, SSID: KACR, 2.4GHz. reason 3, Interface: wlan-1-1. notice

89 2018-12-28 03:46:06 Wireless LAN Station: 84:F3:EB:BB:20:A5 has authorized on Channel: 1, SSID: KACR, 2.4GHz. Interface: wlan-1-1. notice

90 2018-12-28 03:46:06 Wireless LAN Station: 84:F3:EB:BB:20:A5 has associated on Channel: 1, SSID: KACR, 2.4GHz. Interface: wlan-1-1. notice

91 2018-12-28 03:46:06 Wireless LAN Station: 84:F3:EB:BB:20:A5 has deauth by STA Logout on Channel: 1, SSID: KACR, 2.4GHz. reason 3, Interface: wlan-1-1. notice

Von: Gijs Noorlander [mailto:notifications@github.com] Gesendet: Freitag, 28. Dezember 2018 09:49

Do several units experience this wifi disconnect at the same time?

clumsy-stefan commented 5 years ago

Can confirm this:

1482231 : WD   : Uptime 25 ConnectFailures 66 FreeMem 20752 WiFiStatus 3
1510231 : SYS  : -81.00,20848.00,67.30,25.00
1510235 : EVENT: sysinfo#rssi=-81.00
1510275 : EVENT: sysinfo#mem=20848.00
1510316 : EVENT: sysinfo#load=67.30
1510356 : EVENT: sysinfo#uptime=25.00
1510446 : HTTP : C009 connection failed
1510571 : HTTP : C009 connection failed
1510696 : HTTP : C009 connection failed
1510699 : Reset WiFi.
1510701 :
Exception (9):
epc1=0x40105cd4 epc2=0x00000000 epc3=0x00000000 excvaddr=0x00000003 depc=0x00000000

resetWiFi() seems to generate an exception...

Probably related to the WiFi stability/layer 2 issues (#1987), as it seems, that when a auth expired or 4way handshake failed disconnect happens, the internal state of the WiFi statemachine is not updated. but I'm still verifying/debugging this...

TD-er commented 5 years ago

Could you also test with a core 2.5.0 beta build?

clumsy-stefan commented 5 years ago

This is core 2.5.0 beta 2.... ;) compiled with Arduino IDE on a Mac.

Booting version: SMY_2.68W Dec 28 2018 10:13:53 (ESP82xx Core 2_5_0_BETA2, NONOS SDK 3.0.0-dev(c0f7b44), LWIP: 2.1.2)
clumsy-stefan commented 5 years ago

PS: while debugging and trying to find out if the internal wifiStatusis updated correctly, I was wondering if line 492 in ESPEasy.ino is compiled correctly: if (wifiStatus != ESPEASY_WIFI_SERVICES_INITIALIZED || unprocessedWifiEvents()) { Is this interpreted LtR or is the || done before != (RtL)? As it seems it never gets in this condition after first succesful connection, even if WiFi is beeing disconnected later on...

TD-er commented 5 years ago

C++ compiler will execute that OR like this:

bool success = wifiStatus != ESPEASY_WIFI_SERVICES_INITIALIZED;
if (!success) { 
  if (unprocessedWifiEvents()) {
    success = true;
  }
}
if (success) {
  ...

And indeed, there is something wrong with handling wifi disconnects. Either something in my code, or something in the core libraries (or both)

zdzichu commented 5 years ago

@NorbertRoller could you try with older builds, particularly 2018-09-14 and 2018-09-10? For me, -10 is the last working correctly, and -14 is the first having troubles with wifi.

Barracuda09 commented 5 years ago

Some compilers will give an warning if you do not use brackets around the comparison in an if statement and using an OR (AND) operator at the same time.

So make it more prominent for the compiler what the intent is like so:


if ( ( wifiStatus != ESPEASY_WIFI_SERVICES_INITIALIZED ) || unprocessedWifiEvents()) {
clumsy-stefan commented 5 years ago

if ( ( wifiStatus != ESPEASY_WIFI_SERVICES_INITIALIZED ) || unprocessedWifiEvents()) {

yep, that's also how I changed it for my versions..

@TD-er unfortunately I could not yet find where it fails... it does not seem to be above line...

also the exception is raised reliably in the resetWiFi() function.... probably because of the ETS_UART_INTR_DISABLE() and ETS_UART_INTR_ENABLE() calls... it happens somewhere around there...

while googling a bit for wifi_station_disconnect() I found some issues and different implementations, one says that calling this function only removes the IP interface but not the AP connection.... so probably worth looking into this a bit deeper... at least the return value should probably be checked, if the disconnect was succesful...

NorbertRoller commented 5 years ago

Hi Tomasz & all, the electronic is hidden behind a wall switch. Since I cannot OTP the Firmware it is quite a hassle to take it apart and not been beaten up by the better half of mine J

Anyway, I will flash 09-10 in a minute and will let it run until tonight to report back.

Many thanks for your help.

Von: Tomasz Torcz [mailto:notifications@github.com] Gesendet: Freitag, 28. Dezember 2018 21:22 An: letscontrolit/ESPEasy Cc: NorbertRoller; Mention Betreff: Re: [letscontrolit/ESPEasy] WLAN deauth reason 3 - random - seems to reboot unit (#2203)

@NorbertRoller https://github.com/NorbertRoller could you try with older builds, particularly 2018-09-14 and 2018-09-10? For me, -10 is the last working correctly, and -14 is the first having troubles with wifi. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/letscontrolit/ESPEasy/issues/2203#issuecomment-450421345 , or mute https://github.com/notifications/unsubscribe-auth/AomgS8kvXVxChNqZP14c4pXdiE7Fca1qks5u9n18gaJpZM4Zjlbh the thread. https://github.com/notifications/beacon/AomgS7uVLIKMhzzRnhO9o91RzqeqgWYsks5u9n18gaJpZM4Zjlbh.gif

clumsy-stefan commented 5 years ago

@TD-er I forgot to mention, that I included pr #2194 in my builds...

NorbertRoller commented 5 years ago

It has happened with the 09-10 as well. Took only 50 Minutes 3 2018-12-29 11:36:42 Wireless LAN Station: 84:F3:EB:BB:20:A5 has authorized on Channel: 1, SSID: KACR, 2.4GHz. Interface: wlan-1-1. notice

4 2018-12-29 11:36:42 Wireless LAN Station: 84:F3:EB:BB:20:A5 has associated on Channel: 1, SSID: KACR, 2.4GHz. Interface: wlan-1-1. notice

5 2018-12-29 11:36:42 Wireless LAN Station: 84:F3:EB:BB:20:A5 has deauth by STA Logout on Channel: 1, SSID: KACR, 2.4GHz. reason 3, Interface: wlan-1-1. notice

6 2018-12-29 11:35:48 Wireless LAN Station: 84:F3:EB:BB:20:A5 has authorized on Channel: 1, SSID: KACR, 2.4GHz. Interface: wlan-1-1. notice

7 2018-12-29 11:35:48 Wireless LAN Station: 84:F3:EB:BB:20:A5 has associated on Channel: 1, SSID: KACR, 2.4GHz. Interface: wlan-1-1. notice

8 2018-12-29 11:35:48 Wireless LAN Station: 84:F3:EB:BB:20:A5 has deauth by STA Logout on Channel: 1, SSID: KACR, 2.4GHz. reason 3, Interface: wlan-1-1. notice

21 2018-12-29 10:45:29 Wireless LAN Station: 84:F3:EB:BB:20:1A has authorized on Channel: 1, SSID: KACR, 2.4GHz. Interface: wlan-1-1. notice

22 2018-12-29 10:45:29 Wireless LAN Station: 84:F3:EB:BB:20:1A has associated on Channel: 1, SSID: KACR, 2.4GHz. Interface: wlan-1-1. no

@NorbertRoller https://github.com/NorbertRoller could you try with older builds, particularly 2018-09-14 and 2018-09-10? For me, -10 is the last working correctly, and -14 is the first having troubles with wifi.

chunter1 commented 5 years ago

Just wanna confirm that i experience exactly the same issues as described before with several of my units.

micheldejongh commented 5 years ago

Can confirm it happens here as well on two of my units (all Wemos D1 mini pro)

NorbertRoller commented 5 years ago

Hi, a short update after a few days with 09-10. I seems to have even more deauths/reconnects, but it seems not to reset the unit. So it is a more stable situation. The 18-12 seems to also lock down completely once in a while. Meaning it is not reachable via WLAN and doesn't react to push buttons / rules. Power On/Off solves the issue until next time.

I have converted some more units to 09-10.

PS: I will not be at the house from 2.-4.1. for further testing.

TD-er commented 5 years ago

So the de-auth is something specific to your setup and is not really an issue, unless you depend on quick reply via WiFi. (reconnect may take up-to 7-10 seconds in some configurations) But the crashing is more serious and I've seen that happen also on one of my test nodes here running a stress test for about a week now.

NorbertRoller commented 5 years ago

Hi, happy new year, at first. During the last 24 hours I have also seen a few resets with version 09-10. Much less then in 18-12, but nevertheless it happened and the log of the EspEasy doesn't show it. (Or I don't know enough on how to enable the correct log level.) I also noticed that one unit has more resets then all other together. The Hardware is identical. So it could be a combination of hardware and software (memory timing etc...) For testing it could help to reduce CPU frequency. Can this be done by a parameter?

TD-er commented 5 years ago

CPU frequency is set at compile time and I only know of setting a higher frequency of 160 MHz (set in platformio.ini file, or Arduino IDE) You may want to try that also, some users have reported better stability when using 160 MHz.

There is a number of things you could also try on a hardware level:

This is more of a guideline, but it may help RF stability. Make sure the area around the antenna is as free as possible from metals in a range of 15 mm around it. image If some metal is close to the antenna, it may cause an impedance mismatch, which will affect RX and TX performance.

Add a 10 uF capacitor as close as possible to the 3V3 line (to GND) of the ESP8266. (note the polarity of these capacitors) If you're using one of those boards like Wemos/NodeMCU, there should be one soldered on the board, yellow with "106" written on top of it. If you have a board without one, or with a different number written on top, please show a picture. This capacitor will help reduce ripple on the 3v3 line, which also improves RF receiving quality.

Depending on the power supply used, you may also want to add an additional capacitor to keep the voltage stable. The ESP8266 has some hefty current spikes, so you may need to add something like 470 uF capacitor to the main power line to the ESP (position doesn't matter really). It is best to combine several capacitor sizes, since they all help filter out a different part of the spectrum. Best practice is to add the smallest ones as close to the part requiring the stable voltage. (the length of the traces is key factor here)

Make sure no coil or magnetic field is close to the crystal oscillator.

Edit: Last but no least; You may also want to try to rotate the ESP. (or the accesspoint) to see if the reception will improve. Antennas have a polarity, which makes reception optimal if sender and receiver have their antennas in the same orientation.

clumsy-stefan commented 5 years ago

@TD-er hmmm... even though I completely agree with all you mention, I still think calling a function like resetWiFi()and the underlying wifi_station_disconnect() should never raise an exception! No matter on HW or RF performance... IMHO....

NorbertRoller commented 5 years ago

Hi, since I am working for a Taiwanese Networking company that is also very skilled in WLAN / WiFi, I am well aware of Antenna Technology as well as hardware design. I am using a design from luani.de that fits into the round European outlets and will sit behind a switch that has some metal parts. However the unit in question is in the same room like the AP and this is a premium business model with 1 Watt / 2.4GHz. So by far strong enough to support the unit with -70 to -74 dbi. However, I can try to add another cap next to the CPU, when I have to flash it, again. PS: Reboot and reconnect also happened, when the module wasn't hidden behind the switch. Also the unit was swapped against another one some days ago. So it is not an issue of a particular modul.

TD-er commented 5 years ago

@clumsy-stefan I agree with the exception part, which is also a problem we have to look into. But that's software, not hardware and if a particular node is experiencing a disconnect more often, then it sounds like a hardware problem. So it seems like we're having 2 separate issues here:

@NorbertRoller The AP used may be very good, but if the ESP is not behaving like it should, it will still not work. I've been reading a lot about the reference design of the ESP chips, antenna design, etc. Espressif themselves describe possible design guidelines with regard to RF issues which can be related to power supply and quality of the used crystal. So that's why I mention all those aspects too.

kischde commented 5 years ago

TD-er Maybe you remember the watchdog reboots I had, possible caused from the oled frame plugin. In the last week, i had a few times, that the esp was again not reachable any more. I had to power off/on. Cause i need that for light control (via IR) and i am on holiday at the moment, I had to disable all apart from the IR. It was not running longer than half a day, even less, when it was stressed (sending sensor data and many display on /off, cause the whole family was at home) Now it is running (and i reboot it evey night)

However the Timing log shows delays:

ESP Easy Mega: esp_Rollo

⌂⚙💬📌🔌✉🔧 Description Function #calls call/sec min (ms) Avg (ms) max (ms) P_35_Communication - IR Transmit ONCE_A_SECOND 51845 1.00 0.006 0.011 0.094 P_35_Communication - IR Transmit TEN_PER_SECOND 518433 10.00 0.006 0.011 0.053 P_35_Communication - IR Transmit WRITE 7 0.00 110.226 110.311 110.402 P_35_Communication - IR Transmit FIFTY_PER_SECOND 2592111 50.00 0.006 0.012 0.062 Load File 4 0.00 2.544 15.797 23.023 Loop 622971794 12015.86 0.071 0.009 1107.445 Plugin call 50 p/s 2592111 50.00 0.409 0.638 1.725 Plugin call 10 p/s 518433 10.00 0.403 0.603 1.467 Plugin call 10 p/s U 518433 10.00 0.606 0.820 1.156 Plugin call 1 p/s 51845 1.00 0.827 1.161 1107.054 setNewTimerAt() 3373222 65.06 0.059 0.120 0.223 hostByName() 16 0.00 0.148 26.016 101.348 WiFi.isConnected() 623176756 12019.81 0.003 0.000 0.108 Statistics
Start Period: 2019-01-01 00:25:01 Local Time: 2019-01-01 14:49:06 Time span: 51845.80 sec

With the old Esp Easy (RC147) it was running for a year without any issues

clumsy-stefan commented 5 years ago

including PR #2194 seems to triger the exception at wifi_station_disconnect() (more often) than without it... I'm currently trying a build wihtout that PR just cherrypicking single commits which are WiFi related (b99c9c880b6a1c210635a8ae8669d0940476473c / 1ab5667e99a5166410f422c3768194fb1e01529d).... we'll see how it goes...

EDIT: probably unrelated, but deepsleep is also not working anymore. the units always wake after about 10 sec. no matter what is configured as deepsleep time!

chunter1 commented 5 years ago

From the two modules that show the issue most frequently, one has got 12pcs DS18B20 sensors and the other only two gpios configured. In my oppinion, they couldn't be more different in terms of cpu utilization but show the same connection issue. What they have in common is the FHEM HTTP controller and the AP. However, there are two more ESP8266 connected to the same accress point which show the issue muuuuch less often. Since the modules are essential for my heating system, i use the following rule to sucessfully prevent them from geting stuck.

`On System#Boot=0 do timerSet,1,60 endon

On Rules#Timer=1 do timerSet,1,60 if %rssi% < 0 timerSet,2,610 endif endon

On Rules#Timer=2 do reboot endon`

clumsy-stefan commented 5 years ago

probably unrelated here, but using ESP.deepSleepInstant()instead of ESP.deepSleep()makes deep sleep working again. according to the documentation the only difference in the ESP.deepSleepInstant() call is that it does not wiat for WiFi shutdown...

so it could be that the 2.5.0 wifi libraries still have some flaws which could cause these strange WiFi issues again...

PS (EDIT): @TD-er the in thedeepSleep functions in Misc.ino variables probably should not be called delay as this can collide with the delay() function... at least if you want to use delay()within one of these functions. eg. to give the addLog(LOG_LEVEL_INFO, F("SLEEP: Powering down to deepsleep...")); line enough time to send it's payload, some delay() afterwards is needed...

NorbertRoller commented 5 years ago

I am using FHEM as well with this units. Might it be related to the FHEM daemon within EspEasy?

TD-er commented 5 years ago

You could try to set the controller settings for FHEM a bit lower, like 10 msec minimal delay between messages and a timeout of 100 msec (if on local LAN, see the average timings to get a good default of 2 - 3x the average)

clumsy-stefan commented 5 years ago

@NorbertRoller I'm also using ESPEasy with FHEM controller(uniquely) running ~40Nodes on 3 AP's. For me the situation improved when setting the timeout higher (2000ms) and reduced the queue size.

The issue seems to be less on the ESPEasy controller side but more on the FHEM server side. As FHEM is basically single-threaded it can take quite a bit to accept the connection and handle the updates. so the more units you have the moreconnection losses you get.

Also improving the situation is if you can set the layer 2 ack timeout on your router to something higher than the default... (eg. on Mikrotik's the "distance")...

hope that helps...

clumsy-stefan commented 5 years ago

had now one of the serial attached debug units fail (after 500min.) Exception while trying to connect to AP..

29982838 : WD   : Uptime 500 ConnectFailures 36 FreeMem 19648 WiFiStatus 3
30010299 : WIFI : DHCP renew probably failed
30010300 : Reset WiFi.
30010303 : WIFI : Connecting clumsy_ap2 attempt #0
30010308 : EVENT: Clock#Time=Thu,00:44
30010369 : EVENT: WiFi#Disconnected
30010427 : WIFI : Disconnected! Reason: '(1) Unspecified'
30010831 : SYS  : -76.00,19560.00,35.00,500.00
30010835 : EVENT: sysinfo#rssi=-76.00
30010893 : EVENT: sysinfo#mem=19560.00
30010952 : EVENT: sysinfo#load=35.00
30011010 : EVENT: sysinfo#uptime=500.00
30012951 : WD   : Uptime 500 ConnectFailures 36 FreeMem 19552 WiFiStatus 0
30030423 : WIFI : Connecting clumsy_ap2 attempt #1
30034700 :
Exception (9):
epc1=0x40105cd4 epc2=0x00000000 epc3=0x00000000 excvaddr=0x00000003 depc=0x00000000
kischde commented 5 years ago

TD-er Maybe you remember the watchdog reboots I had, possible caused from the oled frame plugin. In the last week, i had a few times, that the esp was again not reachable any more. I had to power off/on. Cause i need that for light control (via IR) and i am on holiday at the moment, I had to disable all apart from the IR. It was not running longer than half a day, even less, when it was stressed (sending sensor data and many display on /off, cause the whole family was at home) Now it is running (and i reboot it evey night)

Back home, activating all, the unit crashes again. I´ve integrated the rule from chunter1, which does not work... As the unit crashed (the first time) with display on, I see the display and esp is complete freezed. So i have to switch off/on the houses fuse to re-activate it again...

NorbertRoller commented 5 years ago

Quick Update: I converted 3 units to the 20190106 Version and also added a bigger cap for the 3,3V supply. The cap was just in case and not that I had seen issues with the power supply.

1 unit didn't report back into FHEM after running for +/-12h. However I could still talk to the unit and switch the relay.

While checking the log of FHEM I detected another unit to have issues, too.

2019.01.07 08:59:55 2: ESPEasy espBridge: connect to http://192.168.0.153:80 timed out [set ESPEasy_ESP_Mobil1_Relay1 gpio 4,0] 2019.01.07 08:59:55 2: ESPEasy ESPEasy_ESP_Mobil1_Relay1: WARNING: connect to http://192.168.0.153:80 timed out 2019.01.07 08:59:58 2: ESPEasy ESPEasy_ESP_Mobil1_Relay1: RESOLVED: connect to http://192.168.0.153:80 timed out

The log of the AP, that connect all 3 units, has only shown one deauth. 2019-01-07 05:01:13 notice Wireless LAN Station: 84:F3:EB:BB:20:1A has authorized on Channel: 1, SSID: KACR, 2.4GHz. Interface: wlan-1-1.

2019-01-07 05:01:13 notice Wireless LAN Station: 84:F3:EB:BB:20:1A has associated on Channel: 1, SSID: KACR, 2.4GHz. Interface: wlan-1-1.

2019-01-07 05:01:13 notice Wireless LAN Station: 84:F3:EB:BB:20:1A has deauth by STA Logout on Channel: 1, SSID: KACR, 2.4GHz. reason 3, Interface: wlan-1-1.

TD-er commented 5 years ago

The problem with ESP82xx WiFi is that if it is too busy to address the wifi from time to time, it will miss packets, or at least it seems it may reply to the other side (TCP ack) but the data is no longer in the buffers. So then it will wait for an acknowledgement which was already sent and thus will never arrive.

The connection management should be looked at. Can you also try a core 2.5.0 (beta2) build?

AxelMilb commented 5 years ago

Hello I recently found this because I have such devices. Can it be the problem? https://www.facebook.com/photo.php?fbid=10217351035990428&set=gm.1896712807094741&type=3&theater&ifg=1

( Firmware release 1.4.4 the stable wi-fi fix for Shelly1 and Shelly2:

Change log: --- Wi-Fi fix: We implement the fix which detect that ESP based devices lost Wi-Fi connectivity. In 30 seconds we reboot the network stack. There is no another way, issue is coming from ESP SDK.)

TD-er commented 5 years ago

Do you have a link to their code?

AxelMilb commented 5 years ago

No

kischde commented 5 years ago

Today I had a new issue: The unit was not reachable via WLAN anymore (no ping/access to esp website/receiving commands/etc.) Luckily the display was on at this moment: time was shown correct, WIFI strenght signal was 100%. The display was jumping to every second page (somehow like it is very busy and can not display every single page). I was waiting around ten minutes, nothing changed. I was checking the router, there was the esp showed connected with 100% signal. So I switched off the WLAN router and the esp rebooted cause of the reboot routine mentioned above in the post of chunter1

Now it is working again

giig1967g commented 5 years ago

It’s not a new issue. I have the same symptoms , but only with router mikrotik.

chunter1 commented 5 years ago

I can confirm that when i use a mikrotik ap, it is happening more frequently than with a wrt54gl. However it‘s happening with both aps.

kischde commented 5 years ago

It’s not a new issue. I have the same symptoms , but only with router mikrotik.

For mine it was a new behavior, because before it crashed complete, so i had to power it off/on

TD-er commented 5 years ago

@kischde We added some more calls to delay(0) so it possibly all timeouts due to this unnoticed disconnect now fall within the timeout used by the Watchdog.

giig1967g commented 5 years ago

I can try and reproduce very easily. In my setup I have a mikrotik a Linksys both working at the same time. When I set the wifi of the ESP to the mikrotik within 24 hours it freezes as reported by @kischde . As soon as I set the wifi to Linksys the unit stays on for months without issues.

So for me the solution was to keep the mikrotik out of the equation and everything is fine.

clumsy-stefan commented 5 years ago

@giig1967g what "distance" setting do you have in your mikrotik (wifi-interface -> advanced)?

giig1967g commented 5 years ago

@clumsy-stefan: I tried several. Now I have "indoors"

clumsy-stefan commented 5 years ago

@giig1967g "indoors" seems to be around 10... if you go up to 30 or so it get a little bit more stable.... or increase the group-key-timeout to 1h or more... but it still happens....

giig1967g commented 5 years ago

@clumsy-stefan I have "disconnect timeout" = 00:00:15

clumsy-stefan commented 5 years ago

@giig1967g that's a different setting... if you go to wireless -> security profiles and select the one you're using, there i the "group key update" parameter... which is default at 5min. if you increase that one it get's also more stable (as the disconnects always happen at the rekeeying).

giig1967g commented 5 years ago

@clumsy-stefan I have "7d 00:00:00" in Group Key Update !

clumsy-stefan commented 5 years ago

@giig1967g oh, wow... ok then... that's why you get up to 24h uptime I guess... I'm always belwo 3h... but intensionally force it to do some debugging what exactly happens...