michael-slx / weewx-weatherlink-live

WeeWX driver for WeatherLink Live
MIT License
20 stars 7 forks source link

Persistent failures after DNS glitch #42

Closed michael-slx closed 2 months ago

michael-slx commented 5 months ago

Discussed in https://github.com/michael-slx/weewx-weatherlink-live/discussions/41

Originally posted by **gdt** April 15, 2024 I'm helping someone set up weewx with a WLL for a VP2. Both are on ethernet. The router is glitchy and sometimes fails to respond to DNS queries for mdns. The router does not support static-mapped DHCP. That's all a mess, and not the driver's fault. The computer is a RPI3, running NetBSD with python 3.11. An identical computer with identical sw -- just not this driver -- works 100% solidly with a VP2, staying up except for power, and not having problems at all. From time to time, the dns lookup fails, and it keeps trying. often it's ok. Sometimes, there are further issues, and it gets into a state where I get three lines ``` ERROR user.weatherlink_live.scheduler: Error caught in scheduler tick. Not rescheduling WARNING user.weatherlink_live.driver: No data since 1 iterations CRITICAL __main__: Caught WeeWxIOError: Error while receiving or processing packets: ConnectionError(MaxRetryError('HTTPConnectionPool(host=\'wea\ therlinklive.lan\', port=80): Max retries exceeded with url: /v1/current_conditions (Caused by NameResolutionError(": Failed to re\ solve \'weatherlinklive.lan\' ([Errno 7] No address associated with hostname)"))')) ``` followed by ``` CRITICAL __main__: Caught WeeWxIOError: Error while receiving or processing packets: OSError(48, 'Address already in use') ``` I'm way out on a limb, but it looks like failure to close the listening socket for UDP when handling the error. And the scheduler tick error looks suspicious too. Code is up-to-date driver (git head from a week ago), and old 4.8 weewx. It's on my list to update weewx to 5.x, but that's harder because of the changing of the install method so it will take me a bit. Reading commit logs I don't see a breakage of older weewx, and this error seems like it's in the driver. I'm going to try to read the code and figure this out, but thought I'd mention it, see if anybody has thoughts.
gdt commented 4 months ago

There was another connection problem. First error 165018Z today, and last error 171340Z. I got a reconnection alert at 1715, due to it posting data over mqtt at the 5min archive interval.

The only thing I'd say isn't right is that there is no report of success at the same critical log level, once, following a critical report of failure. Perhaps that should be inferred, and this is just my preference. But obviously that is really not a big deal and the important thing is that it reconnected.

So I'd say merge this to master, please. Thanks for addressing my bug report in discussions and hoisting it to issues.

gdt commented 4 months ago

The same instance from the previous comment is still running. There have been no more 3x timeouts, just singles, so the recovery code isn't being invoked. But still, all is well.

michael-slx commented 2 months ago

Have you had any issues recently?

gdt commented 2 months ago

I haven't had a "reporting lost" alert and hence haven't looked which is a goog sign. Just went to get the logs.. There have been multiple instances of 3 failures, but the code is coping ok and recovers. So I think things are ok.

gdt commented 2 months ago

I would say I have enough experience, including with timeouts, to be as sure as I can be that the two commits are a huge improvement, so I recommend merging the bugfix/close-on-error to develop and then folding it into a release. (I'm not clear on how you are doing releases.) Thanks very much for addressing this. Let me know if I can do anything else to help. I still need to update this box to weewx 5, and it's not local, so things are a little tricky.

I have filed #46 and #47 for the two suggestions I made earlier, now edited out.