livysdad27 / tempestWS

Weatherflow Tempest WebSocket Driver
GNU General Public License v3.0
5 stars 0 forks source link

Driver crashes weewx after a day or two of running #7

Closed vogelnr closed 1 year ago

vogelnr commented 1 year ago

I've noticed after running fine for at least a day that suddenly my Weewx will stop updating data. When I looked at the syslog I find the error output below. Stopping and starting the Weewx resolves the issue. I did try adjusting the restart sleep interval to 30 thinking maybe it's retrying too fast for the API but I have had no success on having Weewx run for longer than 2 days before needing to restart it. I'm adjusting the sleep interval to 60 now to see if that helps but thought I should post here incase there is another solution I'm missing.

Below is my syslog output: Feb 19 11:20:25 wx weewx[1263] INFO weewx.reportengine: Copied 0 files to /var/www/html/weewx Feb 19 11:25:36 wx weewx[1263] INFO weewx.engine: Main loop exiting. Shutting engine down. Feb 19 11:25:36 wx weewx[1263] INFO weewx.engine: Shutting down StdReport thread Feb 19 11:25:36 wx weewx[1263] INFO user.tempestWS: Stopping messages and closing websocket Feb 19 11:25:36 wx weewx[1263] CRITICAL main: Caught unrecoverable exception: Feb 19 11:25:36 wx weewx[1263] CRITICAL main: Connection timed out Feb 19 11:25:36 wx weewx[1263] CRITICAL main: Traceback (most recent call last): Feb 19 11:25:36 wx weewx[1263] CRITICAL main: File "/usr/local/lib/python3.9/dist-packages/websocket/_socket.py", line 108, in recv Feb 19 11:25:36 wx weewx[1263] CRITICAL main: bytes_ = _recv() Feb 19 11:25:36 wx weewx[1263] CRITICAL main: File "/usr/local/lib/python3.9/dist-packages/websocket/_socket.py", line 87, in _recv Feb 19 11:25:36 wx weewx[1263] CRITICAL main: return sock.recv(bufsize) Feb 19 11:25:36 wx weewx[1263] CRITICAL main: File "/usr/lib/python3.9/ssl.py", line 1226, in recv Feb 19 11:25:36 wx weewx[1263] CRITICAL main: return self.read(buflen) Feb 19 11:25:36 wx weewx[1263] CRITICAL main: File "/usr/lib/python3.9/ssl.py", line 1101, in read Feb 19 11:25:36 wx weewx[1263] CRITICAL main: return self._sslobj.read(len) Feb 19 11:25:36 wx weewx[1263] CRITICAL main: TimeoutError: [Errno 110] Connection timed out Feb 19 11:25:36 wx weewx[1263] CRITICAL main: Feb 19 11:25:36 wx weewx[1263] CRITICAL main: During handling of the above exception, another exception occurred: Feb 19 11:25:36 wx weewx[1263] CRITICAL main: Feb 19 11:25:36 wx weewx[1263] CRITICAL main: Traceback (most recent call last): Feb 19 11:25:36 wx weewx[1263] CRITICAL main: File "/usr/share/weewx/weewxd", line 154, in main Feb 19 11:25:36 wx weewx[1263] CRITICAL main: engine.run() Feb 19 11:25:36 wx weewx[1263] CRITICAL main: File "/usr/share/weewx/weewx/engine.py", line 208, in run Feb 19 11:25:36 wx weewx[1263] CRITICAL main: for packet in self.console.genLoopPackets(): Feb 19 11:25:36 wx weewx[1263] CRITICAL main: File "/usr/share/weewx/user/tempestWS.py", line 152, in genLoopPackets Feb 19 11:25:36 wx weewx[1263] CRITICAL main: raw_resp = self.ws.recv() Feb 19 11:25:36 wx weewx[1263] CRITICAL main: File "/usr/local/lib/python3.9/dist-packages/websocket/_core.py", line 362, in recv Feb 19 11:25:36 wx weewx[1263] CRITICAL main: opcode, data = self.recv_data() Feb 19 11:25:36 wx weewx[1263] CRITICAL main: File "/usr/local/lib/python3.9/dist-packages/websocket/_core.py", line 385, in recv_data Feb 19 11:25:36 wx weewx[1263] CRITICAL main: opcode, frame = self.recv_data_frame(control_frame) Feb 19 11:25:36 wx weewx[1263] CRITICAL main: File "/usr/local/lib/python3.9/dist-packages/websocket/_core.py", line 406, in recv_data_frame Feb 19 11:25:36 wx weewx[1263] CRITICAL main: frame = self.recv_frame() Feb 19 11:25:36 wx weewx[1263] CRITICAL main: File "/usr/local/lib/python3.9/dist-packages/websocket/_core.py", line 445, in recv_frame Feb 19 11:25:36 wx weewx[1263] CRITICAL main: return self.frame_buffer.recv_frame() Feb 19 11:25:36 wx weewx[1263] CRITICAL main: File "/usr/local/lib/python3.9/dist-packages/websocket/_abnf.py", line 338, in recv_frame Feb 19 11:25:36 wx weewx[1263] CRITICAL main: self.recv_header() Feb 19 11:25:36 wx weewx[1263] CRITICAL main: File "/usr/local/lib/python3.9/dist-packages/websocket/_abnf.py", line 294, in recv_header Feb 19 11:25:36 wx weewx[1263] CRITICAL main: header = self.recv_strict(2) Feb 19 11:25:36 wx weewx[1263] CRITICAL main: File "/usr/local/lib/python3.9/dist-packages/websocket/_abnf.py", line 373, in recvstrict Feb 19 11:25:36 wx weewx[1263] CRITICAL main: **** bytes = self.recv(min(16384, shortage)) Feb 19 11:25:36 wx weewx[1263] CRITICAL main: File "/usr/local/lib/python3.9/dist-packages/websocket/_core.py", line 529, in _recv Feb 19 11:25:36 wx weewx[1263] CRITICAL main: return recv(self.sock, bufsize) Feb 19 11:25:36 wx weewx[1263] CRITICAL main: File "/usr/local/lib/python3.9/dist-packages/websocket/_socket.py", line 110, in recv Feb 19 11:25:36 wx weewx[1263] CRITICAL main: raise WebSocketTimeoutException("Connection timed out") Feb 19 11:25:36 wx weewx[1263] CRITICAL main: websocket._exceptions.WebSocketTimeoutException: Connection timed out Feb 19 11:25:36 wx weewx[1263] CRITICAL main: Exiting.

livysdad27 commented 1 year ago

Hi,

Thanks for the report. A few quick questions to help....

About three weeks ago I updated the driver with new reconnect code but on that line it captures the WebSocketConnectionClosedException and not the WebSocketTimeoutException. I could try adding code to address that as well if you'd be game to run it and share the results so that we can reproduce whatever is causing it. That would at least keep the server up and running.

The next challenge would be for you to sort out why you're getting timeouts. Something seems to be interrupting your connection or slowing it down periodically. That's why I ask about where your server is setup.

Let me know if you're game to try a new version and I'll add the exception handling and we'll see how it does.

Billy

vogelnr commented 1 year ago

Hello,

I'm running Weewx 4.10.1-1 on pi2 hardware at my house and all installed mid last week. It's a newly built pi but it was the only one I had laying around to setup at the moment. I'd prefer it on pi3 but the supply chain isn't allowing that to happen any time soon. Nothing else runs on it, just Weewx. It's hardwired into my gig network and I have fiber to my home w/o any bandwidth constraints. I do have it using 9.9.9.9 public DNS to avoid any issues w/ my dns filtering in house. I've been keeping a ssh session open tailing the syslog to catch it when it crashes so I can restart it. The only possible constraint I could see is the pi2 hardware but I would think it could handle running queries as I have another pi2 running flightaware w/o issues as well as a couple running pihole. I could be wrong though!

I'm game to try the new code. I assume I'd just need to uninstall the driver and install the new one once I downloaded that code?

Thanks for the quick reply on this! Nick

On Tue, Feb 21, 2023 at 11:06 AM Billy @.***> wrote:

Hi,

Thanks for the report. A few quick questions to help....

  • Are you running weewx 4 or 5?
  • Is your server in the cloud, at your house or somewhere else where there might be a slow connection?

About three weeks ago I updated the driver with new reconnect code but on that line it captures the WebSocketConnectionClosedException and not the WebSocketTimeoutException. I could try adding code to address that as well if you'd be game to run it and share the results so that we can reproduce whatever is causing it. That would at least keep the server up and running.

The next challenge would be for you to sort out why you're getting timeouts. Something seems to be interrupting your connection or slowing it down periodically. That's why I ask about where your server is setup.

Let me know if you're game to try a new version and I'll add the exception handling and we'll see how it does.

Billy

— Reply to this email directly, view it on GitHub https://github.com/livysdad27/tempestWS/issues/7#issuecomment-1438821067, or unsubscribe https://github.com/notifications/unsubscribe-auth/A6BIZQF44OE6N6H6TDWP35TWYTYZ3ANCNFSM6AAAAAAVDGAXMY . You are receiving this because you authored the thread.Message ID: @.***>

livysdad27 commented 1 year ago

Thanks Nick,

Sounds good. Let me see if I can get a new branch updated for you tonight with some instructions for trying it out and we'll see if we can at least capture the exception and keep your server up and running.

On my system I get periodic dropped connections from Weatherflow's websocket server. I wonder too if that impacts your pi or somehow triggers the timeout sometimes. At a minimum we'll try to keep it up and running. I also have a side project of setting up Weewx to restart after a crash but not a ton of time for that just yet.

Happy to help. I don't think I have a lot of users of this code yet so the more people who share things the easier it is to keep it out there.

Billy

On Tue, Feb 21, 2023 at 9:17 AM vogelnr @.***> wrote:

Hello,

I'm running Weewx 4.10.1-1 on pi2 hardware at my house and all installed mid last week. It's a newly built pi but it was the only one I had laying around to setup at the moment. I'd prefer it on pi3 but the supply chain isn't allowing that to happen any time soon. Nothing else runs on it, just Weewx. It's hardwired into my gig network and I have fiber to my home w/o any bandwidth constraints. I do have it using 9.9.9.9 public DNS to avoid any issues w/ my dns filtering in house. I've been keeping a ssh session open tailing the syslog to catch it when it crashes so I can restart it. The only possible constraint I could see is the pi2 hardware but I would think it could handle running queries as I have another pi2 running flightaware w/o issues as well as a couple running pihole. I could be wrong though!

I'm game to try the new code. I assume I'd just need to uninstall the driver and install the new one once I downloaded that code?

Thanks for the quick reply on this! Nick

On Tue, Feb 21, 2023 at 11:06 AM Billy @.***> wrote:

Hi,

Thanks for the report. A few quick questions to help....

  • Are you running weewx 4 or 5?
  • Is your server in the cloud, at your house or somewhere else where there might be a slow connection?

About three weeks ago I updated the driver with new reconnect code but on that line it captures the WebSocketConnectionClosedException and not the WebSocketTimeoutException. I could try adding code to address that as well if you'd be game to run it and share the results so that we can reproduce whatever is causing it. That would at least keep the server up and running.

The next challenge would be for you to sort out why you're getting timeouts. Something seems to be interrupting your connection or slowing it down periodically. That's why I ask about where your server is setup.

Let me know if you're game to try a new version and I'll add the exception handling and we'll see how it does.

Billy

— Reply to this email directly, view it on GitHub < https://github.com/livysdad27/tempestWS/issues/7#issuecomment-1438821067>, or unsubscribe < https://github.com/notifications/unsubscribe-auth/A6BIZQF44OE6N6H6TDWP35TWYTYZ3ANCNFSM6AAAAAAVDGAXMY

. You are receiving this because you authored the thread.Message ID: @.***>

— Reply to this email directly, view it on GitHub https://github.com/livysdad27/tempestWS/issues/7#issuecomment-1438835617, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNPU3EK2OFGQPSEAH7P34TWYT2CHANCNFSM6AAAAAAVDGAXMY . You are receiving this because you commented.Message ID: @.***>

-- Billy Jackson @.***

vogelnr commented 1 year ago

Sounds great! I appreciate it and more than willing to try things out. I liked the idea of using the websocket than the other Weatherflow driver I had seen out there. It seems to work great except this exception that occurs.

Nick

livysdad27 commented 1 year ago

Hi Nick

I made my updates to a branch called timeout_test. I smoke tested it on my cloud server and it didn't die right away but it of course doesn't throw the exception you're running into. I find the easiest way to update the file is to do a new git clone and then cd into the tempestWS directory and a git checkout timeout_test. After that you can stop weewx, replace the driver file and restart weewx. You probably have your own process.

The change that I made is to update the code that catches the WebSocketConnectionClosed exception to also catch the WebSocketTimeoutException and then updated the logging to tell us which one was caught. It's a hair hacky but hopefully works.

What I'm not sure of is if this will be enough because the exception is handled by waiting the amount of time set in the reconnect_sleep_interval value in your weewx.conf and then firing up the reconnection code. That code currently assumes things are going to startup smoothly or you sholdn't bother. If you find that it dies here you could play with different values of reconnect_sleep_interval to see if the issue/state that is causing it goes away after more than the default number of seconds (I think 20).

Give it a whirl and send along log files for when it either succeeds or fails and we'll iterate. Let me know how it goes.

Billy

vogelnr commented 1 year ago

Sounds good, I believe I ran the git functions correctly (fairly new to using github). Nothing screamed at me when I went through the process at least. Installed the driver again into weewx and started the process back up. It looks to be running now. I'll keep an eye on the logs and report back in a few days with success or sooner if there is a failure.

Thanks! Nick

vogelnr commented 1 year ago

Hello Billy,

I just wanted to follow up that so far Weewx has not crashed since updating the driver. I did keep the reconnect_sleep_interval set to 60 seconds. I've attached my log file which contains the original crashes to this morning. Below is the only message I see generated by the user.tempestWS driver which to me looks like the reconnecting is successful. Either case I've attached the log file for you to look at if interested. I did redact my station ID and serial from where I saw it but other than that it is the entire raw log. So far it's looking promising.

Feb 23 10:30:00 wx weewx[11572] ERROR user.tempestWS: Caught a <class 'websocket._exceptions.WebSocketConnectionClosedException'>, attempting to reconnect! Try 0 Feb 23 10:31:00 wx weewx[11572] INFO user.tempestWS: Successfully received open connection response! Feb 23 10:31:00 wx weewx[11572] INFO user.tempestWS: Received a positive ack response for listen_rapid_start Feb 23 10:31:00 wx weewx[11572] INFO user.tempestWS: Received a positive ack response for listen_start

weewx2-24-23.log

livysdad27 commented 1 year ago

Hi Nick,

Thanks for the logs. Looks like there were four resets captured recently. They didn't spit out timeout errors which I find interesting. I'm going to keep this open for a couple more days. I'd love to know if you see any timeout errors caught and restarted in that time and/or if you have a crash. After that though I hope we can call this one good. I've merged the changes into the main branch with the knowledge that it's at least helping.

Let me know if you run into anything new.

Billy

livysdad27 commented 1 year ago

No new reports so closing.

vogelnr commented 1 year ago

Hey Billy,

You'll love this, the day after you closed the problem it crashed. However... I think WeatherFlow might be having server issues as the webpage for my station on their system is having retry/timeouts. Anyways, I thought I'd add the output of the error here incase you wanted some exception in the code for handling it. I tried a manual restart and it presents the same error. To me it looks like it can't connect to the auth server or database. Outside of this it has been solid. I'll check back later and see if the WeatherFlow site is responding and check WeeWX then.

Nick

weewx3-2-23.log

livysdad27 commented 1 year ago

Thanks for sharing Nick,

I saw the same thing in my own implementation. After troubleshooting I sent tempest a ticket and just got a response back....

- Please type your reply above this line -

Your ticket has been updated. To provide additional information, please reply. Hi Billy, thanks for writing in. We pushed an API update this afternoon and some parts of it didn't go quite as planned; I'm not 100% sure if it's responsible for the errors you're seeing but it seems at least possible. Everything should be back up shortly - but if you're still getting those errors in an hour or so, please let us know!

Kind Regards,

WeatherFlow Support

On 2023 M03 2 20:38:50 UTC, Billy Jackson @.*** wrote:

Starting today every time I try to attach to the websocket API I receive the message below..

'status_code': 5, 'status_message': 'AUTHORIZATION_REQUIRED - Rate limit exceed or your API Key or Oauth2 Access Token missing'

This started out of the blue. I re-generated a key to see if that was the issue but I get the same problem. I've not changed my code (using a weewx driver) and it usually keeps the websocket open for long periods of time just listening.

More data below for timestamps. The first one is the initial failure. The second two are attempts to restart. Third is attempting to restart with a new key.

Mar 02 10:53:57 ip-172-26-6-233 python3[620]: weewx[620] ERROR user.tempestWS: I don't recognize this at all: {'status_code': 5, 'status_message': 'AUTHORIZATION_REQUIRED - Rate limit exceed or your API Key or Oauth2 Access Token missing'} Mar 02 12:28:53 ip-172-26-6-233 python3[7123]: weewx[7123] ERROR user.tempestWS: I don't recognize this at all: {'status_code': 5, 'status_message': 'AUTHORIZATION_REQUIRED - Rate limit exceed or your API Key or Oauth2 Access Token missing'} Mar 02 12:33:05 ip-172-26-6-233 python3[7174]: weewx[7174] ERROR user.tempestWS: I don't recognize this at all: {'status_code': 5, 'status_message': 'AUTHORIZATION_REQUIRED - Rate limit exceed or your API Key or Oauth2 Access Token missing'}

Thanks

On Thu, Mar 2, 2023 at 12:50 PM vogelnr @.***> wrote:

Hey Billy,

You'll love this, the day after you closed the problem it crashed. However... I think WeatherFlow might be having server issues as the webpage for my station on their system is having retry/timeouts. Anyways, I thought I'd add the output of the error here incase you wanted some exception in the code for handling it. I tried a manual restart and it presents the same error. To me it looks like it can't connect to the auth server or database. Outside of this it has been solid. I'll check back later and see if the WeatherFlow site is responding and check WeeWX then.

Nick

weewx3-2-23.log https://github.com/livysdad27/tempestWS/files/10875659/weewx3-2-23.log

— Reply to this email directly, view it on GitHub https://github.com/livysdad27/tempestWS/issues/7#issuecomment-1452524172, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNPU3BWLW3DGZKXM2N2CLTW2EBXVANCNFSM6AAAAAAVDGAXMY . You are receiving this because you modified the open/close state.Message ID: @.***>

-- Billy Jackson @.***

livysdad27 commented 1 year ago

Looks like a tempest issue BUT I might open a new issue to cover catching this and keeping the server up and running.

vogelnr commented 1 year ago

I see the weatherflow page for my station is back up and working. However, weewx continues to have the same authorization error message. Hopefully they figure it out.

livysdad27 commented 1 year ago

Thanks to the tempest forum I found a fix. I just did an emergency merge. Reinstall and I suspect you'll be good. They changed the API parameter for using your personal token to no longer work named as ?api_key. Now it has to be ?token. I just validated on my installation. Give it a whirl!

vogelnr commented 1 year ago

That did it. I cloned, reinstalled and forced a couple reports and everything is updating again. Thanks for your work on this! I tried searching their forums for anything about today's outage and saw nothing.

livysdad27 commented 1 year ago

Happy to help, I got lucky with a post on https://community.weatherflow.com/t/websocket-api-stopped-working/20253/9 and one person there sorted it out. I thought it only applied to the rest api though based on the tempest response until I went to the websocket api docs and saw the change. At least it was easy to fix!

On Thu, Mar 2, 2023 at 5:23 PM vogelnr @.***> wrote:

That did it. I cloned, reinstalled and forced a couple reports and everything is updating again. Thanks for your work on this! I tried searching their forums for anything about today's outage and saw nothing.

— Reply to this email directly, view it on GitHub https://github.com/livysdad27/tempestWS/issues/7#issuecomment-1452800086, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNPU3E6INK24WLCSVRHWVLW2FBX7ANCNFSM6AAAAAAVDGAXMY . You are receiving this because you modified the open/close state.Message ID: @.***>

-- Billy Jackson @.***

vogelnr commented 1 year ago

Makes you wonder if they didn't make the change completely when they were depreciating ?api_key and whatever merge or change they did today updated it. Glad you were able to find it and it did work!