letscontrolit / ESPEasy

Easy MultiSensor device based on ESP8266/ESP32
http://www.espeasy.com
Other
3.27k stars 2.21k forks source link

[WiFi] read all lines from slow servers #1720

Closed TD-er closed 4 years ago

TD-er commented 6 years ago

As suggested by @clumsy-stefan here: https://github.com/letscontrolit/ESPEasy/pull/1669#issuecomment-419620199

The suggested fix is based on this PR in the ESP8266 Arduino examples. (PR 5113)

I would like to know if someone who is experiencing issues with slow servers, if they see this line in the logs:

"Timeout while reading input data!"

If so, please report it, so I can make the timeout setting configurable. Later I will make the function safeReadStringUntil use a callback construction and let it operate async. For now the function is blocking, so it may cause all kind of issues.

sassod commented 6 years ago

I saw a lot of these when thingspeak plugin was active and mandatory 15s interval was applied, the data was sent successfully and after that the warning appeared. But next data sent was again successful and again the warning appeared.

On Sat, Sep 8, 2018, 15:19 Gijs Noorlander notifications@github.com wrote:

As suggested by @clumsy-stefan https://github.com/clumsy-stefan here: #1669 (comment) https://github.com/letscontrolit/ESPEasy/pull/1669#issuecomment-419620199

The suggested fix is based on this PR https://github.com/esp8266/Arduino/commit/4fdff072e836a03df5ccd3ddf98ff8ba39b29db8 in the ESP8266 Arduino examples. (PR 5113 https://github.com/esp8266/Arduino/pull/5113)

I would like to know if someone who is experiencing issues with slow servers, if they see this line in the logs:

"Timeout while reading input data!"

If so, please report it, so I can make the timeout setting configurable. Later I will make the function safeReadStringUntil use a callback construction and let it operate async. For now the function is blocking, so it may cause all kind of issues.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/letscontrolit/ESPEasy/issues/1720, or mute the thread https://github.com/notifications/unsubscribe-auth/ADPy_58jLgO0CHcqDEj1yOJloFt-MhPlks5uY8PdgaJpZM4Wf3kW .

TD-er commented 6 years ago

@sassod Was this with the controller queue included? (PR merged yesterday, but I made a lot of test builds.)

sassod commented 6 years ago

I haven't updated all the nodes some are running builds dated May or April but some have been updated to early September releases, I will need to check if the most recent releases are still producing this info. Unfortunately I'm not gonna be able to check because almost all have deep sleep enabled and I'm away from home till 16th of September. Maybe someone else can check the status if thingspeak plugin is sending multiple data one after ankther with 15s delay

On Sat, Sep 8, 2018, 15:30 Gijs Noorlander notifications@github.com wrote:

@sassod https://github.com/sassod Was this with the controller queue included? (PR merged yesterday, but I made a lot of test builds.)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/letscontrolit/ESPEasy/issues/1720#issuecomment-419642250, or mute the thread https://github.com/notifications/unsubscribe-auth/ADPy_2_Dus7A50_B74rlxa-oXQdOzvVPks5uY8aCgaJpZM4Wf3kW .

TD-er commented 6 years ago

The change introduced with yesterdays merge is meant to be used for controllers like Thingspeak. The controllers themselves manage their own queue and only send new values when the minimal interval has passed since last send. So you don't have to use the Message Delay value anymore. (that one blocks the whole ESP)

Only thing I haven't checked is if the result of millis() increases when in deep sleep. (this just occurred to me)

clumsy-stefan commented 6 years ago

I only have "HTTP : C009 connection failed" in the logs (on firsth sight) and no TImeouts as described... but probably the same root cause?!

TD-er commented 6 years ago

@clumsy-stefan Looking at the code, that error will occur when

As far as I can see, the timeout cannot be set, but is 5000 msec for WiFiClient.

clumsy-stefan commented 6 years ago

I've upgraded my server to a newer and faster HW/SW.... interestingly since then the units run without issues and reboots.... it really seems, that when the server is really slow or has tcp/network problems the client can get into some strange state... eg. having to many parallel open sessions or waiting too long for a result...

clumsy-stefan commented 6 years ago

quick update, all units runing with latest commit from yesterday (self compiled):

nb: now I also do get the timeouts in the logs ... I only see reboots, when the server is not running/answering... other than this they seem to work fine..

Name / date / UPTIME (over 30 hours, most of them).... sonoff_basic_06_sysinfo 2018-09-12 08:20:08 2125 sonoff_basic_08_sysinfo 2018-09-12 08:20:21 2125 sonoff_dual_01_sysinfo 2018-09-09 16:37:51 1594 sonoff_pow_01_sysinfo 2018-09-12 08:20:17 1364 sonoff_pow_02_sysinfo 2018-09-12 08:20:23 2126 wemos_mini_01_sysinfo 2018-09-12 04:24:45 1896 wemos_mini_02_sysinfo 2018-09-12 08:19:59 2131 wemos_mini_03_sysinfo 2018-09-12 08:20:14 2131 wemos_mini_04_sysinfo 2018-09-12 08:20:07 1991 wemos_mini_05_sysinfo 2018-09-11 19:40:42 1371 wemos_mini_06_sysinfo 2018-09-12 08:20:07 2130 wemos_mini_07_sysinfo 2018-09-12 08:20:19 1585 wemos_mini_08_sysinfo 2018-09-12 08:20:16 2129 wemos_mini_09_sysinfo 2018-09-12 08:20:30 2129 wemos_mini_10_sysinfo 2018-09-12 08:19:55 2128 wemos_mini_11_sysinfo 2018-09-12 08:20:14 2128 wemos_mini_12_sysinfo 2018-09-12 08:20:31 2128 wemos_mini_13_sysinfo 2018-09-12 08:19:49 2127 wemos_mini_14_sysinfo 2018-09-12 08:20:03 2127 wemos_mini_15_sysinfo 2018-09-12 08:20:15 2127 wemos_mini_16_sysinfo 2018-09-12 08:20:26 2127 wemos_pro_01_sysinfo 2018-09-12 08:20:14 2126

clumsy-stefan commented 6 years ago

Another quick update... As per list below lots of nodes have >5000min uptime!! great... As I had to shutdown and work on the server it seems some of the nodes (quite randomly) rebooted while others survived without issues...

However, for my part the current version(s) seem to be really stable!! appreciate it!! Good work!!

Name / date / UPTIME sonoff_basic_02_sysinfo 2018-09-14 05:21:41 27 sonoff_basic_03_sysinfo 2018-09-14 10:27:28 1665 sonoff_basic_06_sysinfo 2018-09-14 10:27:06 5131 sonoff_basic_08_sysinfo 2018-09-14 10:27:19 5132 sonoff_pow_01_sysinfo 2018-09-14 10:27:15 4371 sonoff_pow_02_sysinfo 2018-09-14 10:27:20 5133 wemos_mini_01_sysinfo 2018-09-14 10:26:49 4528 wemos_mini_02_sysinfo 2018-09-14 10:27:01 5137 wemos_mini_03_sysinfo 2018-09-14 10:27:29 2745 wemos_mini_04_sysinfo 2018-09-14 10:27:08 4997 wemos_mini_05_sysinfo 2018-09-14 10:26:47 3933 wemos_mini_06_sysinfo 2018-09-14 10:27:05 5136 wemos_mini_07_sysinfo 2018-09-14 10:27:11 760 wemos_mini_08_sysinfo 2018-09-14 10:27:12 5136 wemos_mini_09_sysinfo 2018-09-14 10:27:27 5135 wemos_mini_10_sysinfo 2018-09-14 10:27:01 793 wemos_mini_11_sysinfo 2018-09-14 10:27:20 5135 wemos_mini_12_sysinfo 2018-09-14 10:26:37 5133 wemos_mini_13_sysinfo 2018-09-14 10:26:57 5134 wemos_mini_14_sysinfo 2018-09-14 10:27:11 5079 wemos_mini_15_sysinfo 2018-09-14 10:27:19 5134 wemos_mini_16_sysinfo 2018-09-14 10:27:23 5134 wemos_pro_01_sysinfo 2018-09-14 10:26:57 1994

TD-er commented 6 years ago

OK, then for now, I will close it

giig1967g commented 6 years ago

ok to close it, but still the fact that some units rebooted just because the server was shutdown needs further investigation, I think.

clumsy-stefan commented 6 years ago

problem is, since updating my server it seems to be fast enough so the "slow response" issue does not occur anymore ;) so I'm having troubles to test it..

giig1967g commented 6 years ago

but if you shutdown the server some units do reboot?

TD-er commented 6 years ago

And what is the reboot reason then? Watchdog?

clumsy-stefan commented 6 years ago

@giig1967g: yes, true, but can't be sure it's because of that @TD-er: as an example units 7 & 10: both of them: Boot Manual reboot (2) Reset Reason Hardware Watchdog Free Mem 16504 (13504 - addToLog) (Unit 7) Free Mem 13368 (5800 - LoadTaskSettings) (Unit 10)

I think it's a difference if the server doesn't answer at all or if it answer's opens the socket and is too slow or even doesn't respond at all... I guess these are different situations from the unit's pov...

it's easier to deal with if the server is not reachable at all, than having open sockets to deal with..

giig1967g commented 6 years ago

I am asking this because I also was having serious issues that stopped when I swapped my router (from Mikrotik to Lynksys). With Mikrotik I was also getting some reboots (HW watchdog) if the router rebooted or even if just the wifi on the router was set to off. With Linksys this does not happen. So I just wanted to highlight that the fact that for me and for @clumsy-stefan the problem disappeared, is not because the root cause has been identified, but because we changed hardware. Many other could face this problem until the root cause is found.

What do you think?

clumsy-stefan commented 6 years ago

had a really strange issue today: after nearly all units ran for over 6'000Min. I updated all of them with the newst commit this morning. After that they all ran fine for about 4 hours until in the afternoon I had to fix some things on the server and shut it down for about an hour... after it restartet all units were "absent". Looking at the log (via UDP syslog or webpage) they all said "Timeout while reading input data!"... even after a (software) reboot, the problem persisted. I finally solved it by going to the controller tab editing the controller and just pressed "save"... then the units rebooted once again and all ran fine after that.... Anyone any clue what happened here?

TD-er commented 4 years ago

These timeout patches are already present in the code for a while now. Core 2.6.0 also has a lot of fixes related to proper handling of timeouts.

So I will close this one for now. Please create a new issue and refer to this one if it is still an issue.