livysdad27 / tempestWS

Weatherflow Tempest WebSocket Driver
GNU General Public License v3.0
5 stars 0 forks source link

Driver crashes when internet is lost #10

Closed vogelnr closed 7 months ago

vogelnr commented 1 year ago

I figured I'd make a new issue as I can directly correlate the driver crashing to when my security gateway flaked last night and rebooted itself (unifi USG, takes about 15-20 minutes to recover itself on a reboot). The timing of it lines up with my weewx logs generating the timeout error and crashing. So having some mechanism that can recover or restart the service if there is an internet outage would be beneficial IMO. I didn't notice until I looked at my weewx page and saw it hadn't updated this afternoon. Anyways, thought it would be useful to know that it appears this will occur whenever the websocket isn't reachable it seems. Stop/restart of the weewx service worked fine to come back online.

Mar 5 23:20:02 wx weewx[1593] ERROR user.tempestWS: Caught a <class 'websocket._exceptions.WebSocketTimeoutException'>, attempting to reconnect! Try 0 Mar 5 23:21:09 wx weewx[1593] INFO weewx.engine: Main loop exiting. Shutting engine down. Mar 5 23:21:09 wx weewx[1593] INFO weewx.engine: Shutting down StdReport thread Mar 5 23:21:09 wx weewx[1593] INFO user.tempestWS: Stopping messages and closing websocket Mar 5 23:21:09 wx weewx[1593] CRITICAL main: Caught unrecoverable exception: Mar 5 23:21:09 wx weewx[1593] CRITICAL main: [Errno -3] Temporary failure in name resolution Mar 5 23:21:09 wx weewx[1593] CRITICAL main: Traceback (most recent call last): Mar 5 23:21:09 wx weewx[1593] CRITICAL main: File "/usr/local/lib/python3.9/dist-packages/websocket/_socket.py", line 108, in recv Mar 5 23:21:09 wx weewx[1593] CRITICAL main: bytes_ = _recv() Mar 5 23:21:09 wx weewx[1593] CRITICAL main: File "/usr/local/lib/python3.9/dist-packages/websocket/_socket.py", line 87, in _recv Mar 5 23:21:09 wx weewx[1593] CRITICAL main: return sock.recv(bufsize) Mar 5 23:21:09 wx weewx[1593] CRITICAL main: File "/usr/lib/python3.9/ssl.py", line 1226, in recv Mar 5 23:21:09 wx weewx[1593] CRITICAL main: return self.read(buflen)

livysdad27 commented 1 year ago

Thanks for the report and it does sound like you have the root cause nailed down. Based on the logs the websocket died during the reconnect attempt because it couldn't get outside of your home network. This means it's not going to get past try 0 I suppose so let me ponder that a bit. Alternately I think you can tell your init.d to attempt restarts. If I have an article handy I'll drop it here. Between now and then I'll think about how to refactor the reconnect code to catch this case and do another retry.

Thanks for the report. It's good to see that we caught a Timeout exception and put that bug in a box.

On Mon, Mar 6, 2023 at 11:23 AM vogelnr @.***> wrote:

I figured I'd make a new issue as I can directly correlate the driver crashing to when my security gateway flaked last night and rebooted itself (unifi USG, takes about 15-20 minutes to recover itself on a reboot). The timing of it lines up with my weewx logs generating the timeout error and crashing. So having some mechanism that can recover or restart the service if there is an internet outage would be beneficial IMO. I didn't notice until I looked at my weewx page and saw it hadn't updated this afternoon. Anyways, thought it would be useful to know that it appears this will occur whenever the websocket isn't reachable it seems. Stop/restart of the weewx service worked fine to come back online.

Mar 5 23:20:02 wx weewx[1593] ERROR user.tempestWS: Caught a <class 'websocket._exceptions.WebSocketTimeoutException'>, attempting to reconnect! Try 0 Mar 5 23:21:09 wx weewx[1593] INFO weewx.engine: Main loop exiting. Shutting engine down. Mar 5 23:21:09 wx weewx[1593] INFO weewx.engine: Shutting down StdReport thread Mar 5 23:21:09 wx weewx[1593] INFO user.tempestWS: Stopping messages and closing websocket Mar 5 23:21:09 wx weewx[1593] CRITICAL main: Caught unrecoverable exception: Mar 5 23:21:09 wx weewx[1593] CRITICAL main: [Errno -3] Temporary failure in name resolution Mar 5 23:21:09 wx weewx[1593] CRITICAL main: Traceback (most recent call last): Mar 5 23:21:09 wx weewx[1593] CRITICAL main: File "/usr/local/lib/python3.9/dist-packages/websocket/ *socket.py", line 108, in recv Mar 5 23:21:09 wx weewx[1593] CRITICAL main: *** bytes = _recv() Mar 5 23:21:09 wx weewx[1593] CRITICAL main: File "/usr/local/lib/python3.9/dist-packages/websocket/_socket.py", line 87, in _recv Mar 5 23:21:09 wx weewx[1593] CRITICAL main: return sock.recv(bufsize) Mar 5 23:21:09 wx weewx[1593] CRITICAL main: File "/usr/lib/python3.9/ssl.py", line 1226, in recv Mar 5 23:21:09 wx weewx[1593] CRITICAL main: **** return self.read(buflen)

— Reply to this email directly, view it on GitHub https://github.com/livysdad27/tempestWS/issues/10, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNPU3EUZONKCA2XHUBNB3TW2Y2TFANCNFSM6AAAAAAVRQ7GSE . You are receiving this because you are subscribed to this thread.Message ID: @.***>

-- Billy Jackson @.***

livysdad27 commented 1 year ago

This fix is going to take a bit longer. Here's the logic I am proposing....

If there's a failure upon startup, don't catch the exception and attempt to restart. I'm thinking it's important to have a stable system at startup.

After that, if the network goes down, update the reconnect logic to capture the exception and keep trying until a max-retries is reached.

Billy

vogelnr commented 1 year ago

Yeah, I think that's a great idea to have failure on startup caught and have it restart. I have noticed when rebooting the pi (updates then now installing and configuring a RTC) that it looks like the weewx service starts before network services come up and it has the timeout exception right away. A simple restart of weewx once I log in fixes it. I suspect there would be a way to delay the start of the weewx service on boot but I haven't looked into that.

Nick

livysdad27 commented 1 year ago

Ya, that's a challenge for your startup order on the pi. Weewx isn't going to be happy all up if it's starting prior to the network. You should be able to google how to get the different startup scripts to be set to look for a dependency OR just set the order.

I might have more time this weekend. Thanks for the patience and the report!

On Fri, Mar 10, 2023 at 10:42 AM vogelnr @.***> wrote:

Yeah, I think that's a great idea to have failure on startup caught and have it restart. I have noticed when rebooting the pi (updates then now installing and configuring a RTC) that it looks like the weewx service starts before network services come up and it has the timeout exception right away. A simple restart of weewx once I log in fixes it. I suspect there would be a way to delay the start of the weewx service on boot but I haven't looked into that.

Nick

— Reply to this email directly, view it on GitHub https://github.com/livysdad27/tempestWS/issues/10#issuecomment-1464234640, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNPU3ASAUXF4KYK3GF6PLTW3NYZVANCNFSM6AAAAAAVRQ7GSE . You are receiving this because you commented.Message ID: @.***>

-- Billy Jackson @.***

livysdad27 commented 1 year ago

This might help in the short term....

This thread covers how to modify the weewx systemd scripts to "wait" on network startup etc... I suspect systemd is already setup on your pi. If not there's a link to converting from init.d to systemd. This might be a good long term solution TBH as it would cover the startup option.

It'll take me a bit to get my code potentially handling this so hopefully it'll help a bit sooner. https://groups.google.com/g/weewx-user/c/wNTLs5DTZzc

Billy

On Fri, Mar 10, 2023 at 10:51 AM Billy Jackson @.***> wrote:

Ya, that's a challenge for your startup order on the pi. Weewx isn't going to be happy all up if it's starting prior to the network. You should be able to google how to get the different startup scripts to be set to look for a dependency OR just set the order.

I might have more time this weekend. Thanks for the patience and the report!

On Fri, Mar 10, 2023 at 10:42 AM vogelnr @.***> wrote:

Yeah, I think that's a great idea to have failure on startup caught and have it restart. I have noticed when rebooting the pi (updates then now installing and configuring a RTC) that it looks like the weewx service starts before network services come up and it has the timeout exception right away. A simple restart of weewx once I log in fixes it. I suspect there would be a way to delay the start of the weewx service on boot but I haven't looked into that.

Nick

— Reply to this email directly, view it on GitHub https://github.com/livysdad27/tempestWS/issues/10#issuecomment-1464234640, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNPU3ASAUXF4KYK3GF6PLTW3NYZVANCNFSM6AAAAAAVRQ7GSE . You are receiving this because you commented.Message ID: @.***>

-- Billy Jackson @.***

-- Billy Jackson @.***

vogelnr commented 1 year ago

Thanks, I'll take a look at it and implement.

Just to be clear, I was just reporting findings to help correlate a known outage and boot experiences as feedback incase it was something you wanted to address. In no way am I expecting you to spend time working on it just for me ;) I greatly appreciate the driver you've made and I'm more than willing to continue to use it as-is for as long as needed. Knowing the quarks and dealing with them is helpful.

Thanks! Nick

vogelnr commented 1 year ago

After some digging I think I found an easier option. Modifying the weewx.conf and setting loop_on_init = True has the WeeWX service retry every 60 seconds on a hardware driver failure. The Userguide states for loop_on_init:

"Normally, if the hardware driver fails to load, WeeWX will exit. The assumption is that there is a configuration problem and so retries are useless. However, in some cases, drivers can fail to load for intermittent reasons, such as a network failure. In these cases, it may be useful to have WeeWX do a retry. Setting this option to 1 will cause WeeWX to keep retrying indefinitely."

While stumbling upon the loop_on_init I did see another driver for WeeWX specifically mention the loop_on_init incase of driver failure, WLLDriver.

I modified my weewx.conf, did a reboot and watched the syslog. As expected WeeWX displayed a failure but instead of exiting it gave a message "Mar 10 20:24:00 wx weewx[529] CRITICAL main: **** Waiting 60 seconds then retrying..." It was successful on the retry and WeeWX finished starting up once eth0 came online. I've snipped the syslog and included everything from when the failure happened to it succeeding so you can see the process.

In my opinion setting the option for loop_on_init = True is the answer for startup failures. I don't think this would resolve internet outages w/o the service restarting. I'm going to test that out by unplugging the network cable and letting it sit for 10 minutes before plugging it back in.

weewx-retry.txt

Nick

vogelnr commented 1 year ago

Confirming on network disconnect and reconnecting that it does cause weewx to crash and need a manual restart.

duvalljm commented 1 year ago

I too have encountered this issue on three occasions since setting up weewx and have a question. I can't tell from the last comment if setting loop_on_init = True resolves this. In my case, internet drops for several minutes or flaps several times over a period of a few minutes both of which caused weewx to fail. What I found interesting is the behavior once internet was back. If I just start weewx, it doesn't work. If I stop weewx, I get response that weewx is not running. However, if I stop and then start, it works every time. Does loop_on_init fix this?

livysdad27 commented 1 year ago

I recently had to deal with some internet downtimes. @vogelnr I found this article, implemented it and it seems to address the issue. https://github.com/weewx/weewx/wiki/systemd . I had to modify the startup script to call python3 explicitly and my ubuntu implementation puts the service in a slightly different location than the debian instructions. In the [Service] stanza you want a Restart on-failure . You can also find other options for systemd startup params that help. After converting mine to systemd and implementing that param basically any weewx crash/failure will reboot/restart eventually.

I still want to build in some retry code into the driver but the websocket library I use makes it a bit trycky.

@duvalljm you might also try the above. WRT your other question, I don't use loop_on_init. That question might be a better one for the weewx google group. I get a ton of help there.

livysdad27 commented 7 months ago

I added the retry code back in and in my most recent power failure (cable to the battery pack failed) the retry routine worked. In the course of this eventually weewx failed and the instructions WRT systemd referenced above worked as well. Considering this closed.