Logbook full of "unavailable" entries; log shows "Timeout fetching resource status"

eddyg commented 3 years ago

First, let me say that I was very excited to find this Home Assistant integration!

A couple pieces of background info:

Home Assistant 2021.7.2
HACS 1.13.2
hass-opensprinkler (installed via HACS) 1.1.6

I am running with OS hardware version 2.3, with firmware 2.1.7 (I had previously tried upgrading to 2.1.9(7) but ran in to weird issues where after a few minutes, API calls would starting returning "result:2" (unauthorized) which required me to run with the "Ignore Password" option. I also had multiple occurrences of the clock doing a "time warp" (even though I have NTP sync enabled) which resulted in a watering program running twice on the same day!! I eventually downgraded back to 2.1.7 and things are once again stable. I opened a ticket with OpenSprinkler about the weirdness I saw after upgrading to 2.1.9(7), but they had not heard of problems like what I described.)

I recently found this integration and successfully added it to my Home Assistant config. Everything seemed to be working great, but then I noticed my Logbook was full of "OpenSprinkler became unavailable" followed 5 seconds later by "OpenSprinkler turned [on|off]" entries.

I checked my Home Assistant log, and it's full of messages like this:

2021-07-16 02:45:55 ERROR (MainThread) [custom_components.opensprinkler] Timeout fetching OpenSprinkler resource status data
2021-07-16 02:46:50 ERROR (MainThread) [custom_components.opensprinkler] Timeout fetching OpenSprinkler resource status data
2021-07-16 02:48:00 ERROR (MainThread) [custom_components.opensprinkler] Timeout fetching OpenSprinkler resource status data
2021-07-16 02:48:20 ERROR (MainThread) [custom_components.opensprinkler] Timeout fetching OpenSprinkler resource status data
2021-07-16 02:49:10 ERROR (MainThread) [custom_components.opensprinkler] Timeout fetching OpenSprinkler resource status data
2021-07-16 05:01:10 ERROR (MainThread) [custom_components.opensprinkler] Timeout fetching OpenSprinkler resource status data
2021-07-16 06:49:45 ERROR (MainThread) [custom_components.opensprinkler] Timeout fetching OpenSprinkler resource status data
2021-07-16 06:55:50 ERROR (MainThread) [custom_components.opensprinkler] Timeout fetching OpenSprinkler resource status data
2021-07-16 06:57:05 ERROR (MainThread) [custom_components.opensprinkler] Timeout fetching OpenSprinkler resource status data
2021-07-16 06:57:35 ERROR (MainThread) [custom_components.opensprinkler] Timeout fetching OpenSprinkler resource status data
2021-07-16 06:58:30 ERROR (MainThread) [custom_components.opensprinkler] Timeout fetching OpenSprinkler resource status data

Hoping to reproduce the problem, I've been running this loop (the pw hash is for the default) polling the "all" endpoint every 5 seconds:

while true; do date; curl -v 'http://sprinkler/ja?pw=a6d82bced638de3def1e9bbb4983225c'; echo; sleep 5; done

but have not seen any failures.

Any idea what might be causing the timeouts in HA? My first thought is that it could be related to needing to supply the controller's MAC address when you're not running 2.1.9(4) or later, so maybe it's ARP-related. (I have not yet looked at the code to see why this is a requirement...)

eddyg commented 3 years ago

OK, I've now looked through the code and the MAC address appears to only be used to set a unique ID for HA, so my theory about it being an ARP issue is a non-starter.

vinteo commented 3 years ago

Unfortunately, this is "normal". I have looked into it several times and OS just does not consistently respond to all calls. I get it too, about 3-4 times a day. Since it is sporadic, it doesn't really effect the operation. The amount you get also seems to depend on how good your connection between the two devices are.

One thing I might look at is to downgrade the log message to a warning instead of an error.

Also note that I have not tried any firmware beyond 2.1.9 (4). I will have to find some time to upgrade and test.

eddyg commented 3 years ago

Thanks for the reply, Vincent!

My OS is wired directly into my switch (as is my HA server), so I don't think it's the network. I had considered that the OS hardware is only 10/half, so I reset the port statistics a couple days ago to see how quickly errors accumulated. But the stats from don't indicate that many collisions or errors as a percentage of packets sent/received:

  Half-duplex, 10Mb/s, media type is 10/100/1000BaseTX
  Last clearing of "show interface" counters 2d13h
  5 minute input rate 3000 bits/sec, 1 packets/sec
  5 minute output rate 8000 bits/sec, 5 packets/sec
     361124 packets input, 88564443 bytes, 0 no buffer
     Received 3 broadcasts (0 multicasts)
     0 runts, 0 giants, 0 throttles
     352 input errors, 352 CRC, 0 frame, 0 overrun, 0 ignored
     0 watchdog, 0 multicast, 0 pause input
     0 input packets with dribble condition detected
     1094625 packets output, 214785694 bytes, 0 underruns
     183 output errors, 112 collisions, 0 interface resets
     0 unknown protocol drops
     0 babbles, 0 late collision, 0 deferred
     0 lost carrier, 0 no carrier, 0 pause output
     0 output buffer failures, 0 output buffers swapped out

If this only happened "a few times a day" for me, I'd probably be OK ignoring it. But, looking at the last 61h (= 2d13h, the same time since I cleared the counters on the switch port) of my HA logs:

❯ docker logs home-assistant --since 61h 2>&1 | grep '^2021-07-14.*Timeout fetching OpenSprinkler' | wc -l
37
❯ docker logs home-assistant --since 61h 2>&1 | grep '^2021-07-15.*Timeout fetching OpenSprinkler' | wc -l
109
❯ docker logs home-assistant --since 61h 2>&1 | grep '^2021-07-16.*Timeout fetching OpenSprinkler' | wc -l
172

As you can see, it happens a lot (and today's not even over yet), and of course each one of those timeouts corresponds to a ton of Logbook entries, and it makes my History graphs look bad (as you know, this is only one of many associated with OS). Here is the last hour, as an example:

I'm very interested in trying to at least improve the situation, whether that's through retries, caching responses, etc.

The first thing I wonder: would reducing the stat query interval from every 5s to every 30s make a difference in how often a timeout occurs? Is this easily configurable to test?

vinteo commented 3 years ago

changing the interval should be as easy as changing this line:

https://github.com/vinteo/hass-opensprinkler/blob/92f8cb82daf77cde6cfeb3f9b55c6d91270ef83c/custom_components/opensprinkler/const.py#L13

However this will mean all switches, sensor etc take much long to update, which might not be the best experience.

Another thing you could take a look at is try using the WIP MQTT branch and see if that works well for you. I haven't taken a look at it in a long time so it might need a bit of updating if you are comfortable with code, which seems like you are. This will make it a push rather than a poll system. There is a PR for this: https://github.com/vinteo/hass-opensprinkler/pull/111

You will need to have MQTT setup with HASS and also set it in the settings in OpenSprinkler . I think minimum is 2.1.9 (4).

eddyg commented 3 years ago

Ok, I will look in to this a bit a more next week. Thanks again!

I’m reluctant to upgrade the firmware, but maybe I can try 2.1.9(4) and see if it behaves any different than (7).

Out of curiosity, what hardware version do you have? Wondering if that’s a factor…

vinteo commented 3 years ago

I am on hardware 3.0

eddyg commented 3 years ago

I wonder how much difference hardware v3.0 makes...

FWIW, I changed DEFAULT_SCAN_INTERVAL to 30s on the July 17th and restarted HA. It seems to have made a difference:

❯ for d in {17..26}; { print -n "$d: "; docker logs home-assistant --since 250h 2>&1 | grep "2021-07-$d.*Timeout fetching OpenSprinkler" | wc -l }
17: 1
18: 1
19: 1
20: 13
21: 1
22: 2
23: 3
24: 5
25: 20
26: 27

It looked really good for a while there, but the last couple days it has spiked up again (even though I haven't made any changes). I will let this run for another few days, and then try changing the scan interval to 60s to see how that effects things.

jandado commented 2 years ago

I have this issue as well. The frequency of timeouts seemed too precise. I enabled debug log and tried to find out more. It looked like every 25th request times out with the defaults (5s scan interval, 10s timeout). Like clockwork.

Motivated by @eddyg, I tried querying the sprinkler using curl instead and see if I can reproduce this behavior. It turns out I can - it is important to add the timeout option ("-m 10" to mimic the integration). I then played with various request frequency and timeouts. I have concluded my opensprinkler fails to respond in a timely fashion for ~16s every ~125s. It does reply to ping requests, it does accept a connection and does accept the request, but it does not reply for the "black out" period of roughly 16s.

What is required is not changing the scan interval, but the TIMEOUT in init.py. The opensprinkler integration is not at fault here, but changing the timeout does resolve the frequent unavailability of entities as well as huge number of entries generated in Home Assistant because of the state changes.

I hope this helps.

eddyg commented 2 years ago

Wow, that's some very interesting data, @jandado! Thanks for following up here! 👍

I'm going to try changing the TIMEOUT value in __init__.py from 10 to 30 right now and see what happens.

Out of curiosity, which hardware version are you using and which firmware? (I have hardware version 2.3, and continue to run the older 2.1.7 release.) I've noticed that my LCD backlight no longer "dims" even though I have the "Idle Brightness" setting set quite low... I know that used to work before I experimented with running 2.1.9(7)... (I see now that 2.1.9(9) is available...)

vinteo commented 2 years ago

Let me know how you guys go after running for a while, if it works I will add the change to the integration

eddyg commented 2 years ago

I set TIMEOUT = 30 on Apr 6, but it hasn't helped my situation. 😐 greping for Timeout fetching OpenSprinkler in my logs, it got much worse recently... (But remember, I'm running OS 2.3 hardware with 2.1.7 firmware...)

I plan to try the 2.1.9(9) firmware and see if it makes a difference since the release notes mention "hopefully improving wired Ethernet reliability", but I had issues last time I upgraded from 2.1.7 where API calls would start returning "result:2" (unauthorized)...

jandado commented 2 years ago

I have HW version 3.1 - AC, firmware 2.1.9(9).

I have had zero failures since the modification (that's almost 5 days now).

@eddyg, I would suggest you try the following (assuming the token is still valid): while true; do curl 'http://sprinkler/ja?pw=a6d82bced638de3def1e9bbb4983225c' --output /dev/null -m 1 --silent; echo -n $?; sleep 1; done

It echos the curl exit code, times out in 1s. The reason for the short timeout it to try to identify, if the sprinkler times out randomly or in larger "windows" as mine. It outputs 0 for each success, and 28 for a timeout. After each call it sleeps only for 1 second. So the double digit exit code represents 2 seconds (1s timeout, 1s sleep), while the success represents only 1 seconds (very short time for the request, 1s sleep). Thus, each character represents ~1s.

This is what my output looks like after running it for something over 4 minutes: 00000000000000000000000000000000000000000000000000000000000 00000000000028282828282828280000000000000000000000000000000 00000000000000000000000000000000000000000000000000000000000 00000000000000002828282828282828000000000000000000000000000 00000000000000000000000000000

@vinteo, I would say the results are, at this point, inconclusive. I can always make the modification locally, if it only works for me. Let's see if Eddy will be able to find a pattern. I find it odd the change did initially nothing for him (on the 6th & 7th), only to get much worse later (8-10th). It seems like there is something else at play.

Cheers, J.

jandado commented 2 years ago

I have got an update - I was able to identify the source of my problem. I have long forgot I had MQTT configured on my sprinkler, before I had this integration installed. But I have renamed the host running the MQTT service since then.

Fixing the hostname in the sprinkler MQTT configuration resolved the "blackouts". The name lookup was the culprit - I assume there must have been multiple attempts by the sprinkler before it gave up each request, but it blocked replying the http requests for the duration.

@vinteo, I no longer need the timeout increased.

@eddyg - what's the chance of something similar happening in your env? I believe your FW does not yet support MQTT, but if you use the weather service, IFTTT integration, or NTP - perhaps all of them can be a source of what you experience, if they have the potential to block the reply to the integration request and can occasionally take longer than expected.

Just my last two cents. The issue is now resolved for my specific case.

travisghansen commented 2 years ago

Wow. That’s some amazing debugging work!

cybergrimes commented 1 year ago

Unfortunately, this is "normal". I have looked into it several times and OS just does not consistently respond to all calls. I get it too, about 3-4 times a day. Since it is sporadic, it doesn't really effect the operation. The amount you get also seems to depend on how good your connection between the two devices are.

One thing I might look at is to downgrade the log message to a warning instead of an error.

Also note that I have not tried any firmware beyond 2.1.9 (4). I will have to find some time to upgrade and test.

If the log message was downgraded to a warning would that stop it from showing up in the HA system logs?

vinteo / hass-opensprinkler

Logbook full of "unavailable" entries; log shows "Timeout fetching resource status" #155