home-assistant / core

:house_with_garden: Open source home automation that puts local control and privacy first.
https://www.home-assistant.io
Apache License 2.0
72.74k stars 30.46k forks source link

NWS Integration, unstable data #64022

Closed Sudo-Rob closed 2 years ago

Sudo-Rob commented 2 years ago

The problem

I have two systems running HA OS and both are running the NWS integration. Each is connected to the same managed switch and router, and each is configured with the same latitude, longitude, and weather observation station. One system uses an ARM CPU (Odroid N2) and it reliably acquires NWS data as expected. The other system is an x86 system (Odroid H2), and it only occasionally acquires a full set of 'daynight' data. Sometimes, forecast data are only partially provided. Additionally, 'hourly' data are limited to the current observation set; typically, no hourly forecasts data are present. The problem is mostly repeatable, but particularly after a host reboot. The x86 system operates as expected with all other add-ons and integrations, including Nabu Casa and other API-based cloud services.

image

What version of Home Assistant Core has the issue?

core-2021.12.9

What was the last working version of Home Assistant Core?

No response

What type of installation are you running?

Home Assistant OS

Integration causing the issue

NWS

Link to integration documentation on our website

https://www.home-assistant.io/integrations/nws/

Example YAML snippet

N/A

Anything in the logs that might be useful for us?

Other than a snitun server errors, home-assistant.log is clean.

Additional information

Both captures are comparable messages: [SYN] Seq=0 Win=64240 Len=0 MSS=1460 SACK_PERM=1 TSecr=0 WS=128

I observe (via Wireshark) differences in the TCP responses:

This is the ARM system TCP info: image Note the 'Conversation completeness' is good.

This is the x86 system TCP info: image

Note the 'Conversation completeness' is incomplete. When the NWS integration isn't properly acquiring forecast data, this condition appears in all packets, and also manifested the condition when the system was configured as a compliant Supervised installation (Debian bullseye).

probot-home-assistant[bot] commented 2 years ago

nws documentation nws source (message by IssueLinks)

probot-home-assistant[bot] commented 2 years ago

Hey there @matthewflamm, mind taking a look at this issue as it has been labeled with an integration (nws) you are listed as a code owner for? Thanks! (message by CodeOwnersMention)

MatthewFlamm commented 2 years ago

The NWS servers are definitely inconsistent, but I have no idea what would cause differences across architectures (or other factors). The only area I can think of from an integration perspective is if there are rate limitations you are running into by having multiple servers presumably on the same network. But I would be surprised as the integration only requests data every 15 minutes or so.

The NWS integration in homeassistant caches forecast data for 45 minutes, and reuses that data if a failure occurs. This is the reason why you can reproduce this more often during startup; there is no successfully cached data.

Sudo-Rob commented 2 years ago

Thanks for the response.

I've read there can be differences in the way Python behaves with different architectures. I'm also trying to find out if there's any dependency on the BIOS for the Ethernet port. FYI, the correct driver appears in the Host startup log.

I don't think there are rate limitations with servers. I can put the link to the API endpoint in a browser and the rate of an error is pretty low. I've never seen a back-to-back error. I've tried both the daynight and hourly links. Moreover, the x86 system reliably acquires current observation data. That said, I have seen api.weather.gov resolve to a different IP address from time to time when using DNS manual look-up, but that should affect both systems. I have seen some server limits with getting Nexrad images, but I currently have those disabled. FYI, my two-way WAN bandwidth is 25 Mbps and I have plenty of headroom.

I suppose the significance might be that when rebooting, the ARM system gets an immediate refresh of the cache and the x86 system doesn't. Still, when I had an x86 Supervised installation (w/ Debian), I experienced the same problem with the NWS integration.

Not that it's significant, but I also see some differences in the behavior of the hourly weather card. During the few times when both systems have acquired hourly data, only the ARM system displays a few icons below the current observations; the x86 system doesn't.

I'm not sure where to go from here other than to see if there are any similar reports from those using NUCs or other x86 hosts. If you want me to do any specific testing, just ask.

MatthewFlamm commented 2 years ago

Interestingly enough, I just restarted my server that has x86_64 arch, which seemed to be working fine prior to the restart. I am now seeing this too. No data, but no error messages either.

NWS servers have had weird caching issues whereby using a specific user agent would provide old data. But I also tried a different user agent and I'm getting the same issue now.

This integration defers error handling in requests to a core utility provided by homeassistant, so this aspect of the problem (no error message on a failed update) should be shared with other integrations. Unfortunately, there is not any debug logging in this section to figure out what might be coming in. This would be a good feature request to add into the upstream library. One possibility is that an empty forecast is being returned (I'm not sure how this squares with your TCP observation).

Sudo-Rob commented 2 years ago

I'll look at the packet captures later and see if I can see any differences in the packet lengths. Unfortunately, the application data packets are encrypted (if I'm correctly interpreting what I'm seeing). If you want me to try to post some of the captures, I can do that. You'll need Wireshark to interpret them.

Sudo-Rob commented 2 years ago

Hmm...I just tried something and I'd like to know if it's repeatable on your end.

I performed a controlled shutdown of the ARM system and rebooted the x86 system. There was no problem at startup.

MatthewFlamm commented 2 years ago

I don't have an ARM system running. But my server also was able to recently fetch valid data without restarting.

Sudo-Rob commented 2 years ago

My misbehaving system will sometimes get back into sync and then lose it again. So, that isn't a surprise. It seems like two client systems (of any architecture) on the same LAN might be the issue.

I do see a transactions that look suspicious. It occurs at the beginning of a TLS handshake, and it's at the point where the ACK is followed by a Client Hello packet. The TCP sequence number should have incremented, but didn't. Morover, during the entire transaction, the Conversation Completeness (I need to find a definition) is marked incomplete.

https://www.cloudflare.com/learning/ssl/what-happens-in-a-tls-handshake/

image

I think I need to accumulate more data to see if I can capture the start of a train wreck. BTW, this is my first experience with Wireshark. Gotta start somewhere.

Edit: Conversation Completeness addressed at the end of this page: https://www.wireshark.org/docs/wsug_html_chunked/ChAdvTCPAnalysis.html

Sudo-Rob commented 2 years ago

I've accumulated some data and I have some observations. The setup is with two HA clients and a common server. Data are being captured by tcpdump in my router.

The first observation is more of what I've already described. These are during the beginning of a TLS handshake when there's a duplicate sequence number. The server recognizes it and replies with a (I think) re-transmission request [SYN, ACK]. I think this is benign, but I don't think it should occur so often. Moreover, all are with identical timing relative to the handshake. That is, the behavior is deterministic.

Another thing I've seen is a client TCP RST (reset) message three times in quick succession on both systems. What's interesting is between the second and third instance, the server is sending a data packet. It happens every time the signature appears. That said, it's a small packet with only about 24 unique bytes. The third reset differs from the first two by one byte.

In one instance, it is followed by a cluster of re-transmission errors, but I think the timing was happenstance. Both clients happened to be requesting a TLS handshake at nearly the same time.

image

I don't know what these observations mean relative to the data integrity or the integration response to the events. The application data are encrypted, which means that even though a correct packet sizes is transmitted, the data in them can be null. You'd never be able to tell the difference by looking at the encrypted packet.

I hope this helps.

TaylorFinklea commented 2 years ago

I'm also seeing issues but only with the Daynight weather sensor. Hourly seems to be consistently working. Daynight used to work, but now it just never shows any forecast.

I'm running core-2021.12.10

2022-01-18_15-42 2022-01-18_15-43

Sudo-Rob commented 2 years ago

One possibility is that an empty forecast is being returned (I'm not sure how this squares with your TCP observation).

I wonder if it has any relationship to this stubborn issue: https://github.com/home-assistant/core/issues/41420#

2022-01-20 12:44:48 ERROR (MainThread) [snitun.client.client_peer] Challenge/Response error with SniTun server (0 bytes read on a total of 32 expected bytes) 2022-01-20 12:44:48 ERROR (MainThread) [hass_nabucasa.remote] Connection problem to snitun server

milnergroup commented 2 years ago

I was getting similar results where the temperature was being received whilst forecast was missing. I don't think the platform / architecture makes a difference in this case. I noticed that the pre-populated lat/long parameters were much longer than the weather station's lat/long.

I believe the issue is occuring on the NWS side when it tries to locate your local station code (by using the lat/long prepopulated in the parameters during setup). My default lat/long parameters have a 15 decimal precision. I reduced it to 2 (which doesn't really matter for NWS purposes.)

So far, this works for me! Example:

BEFORE (NOT WORKING) IMG_8531

AFTER (WORKING) IMG_37D365A8EC53-1

cc: @Sudo-Rob , @MatthewFlamm @jnewland

MatthewFlamm commented 2 years ago

The lat lon is only used in a single API call during integration setup. This in turn is used to gather the observation station and the NWS gridpoints. This information is what is used to make requests for data. By reducing precision, it is possible that you are getting a different grid point for the forecast.

NWS servers can sometimes start providing good data if anything changes about the request. This could also be what you are observing.

milnergroup commented 2 years ago

Thanks @MatthewFlamm. Definitely makes sense although I was getting the same METAR station prior to changes. The example works for me so far (I have 2 METAR stations near my location, I actually used the lat/long of the closest station which i found on the NWS site. I used the lat/long of the station instead of my own lat/long's default. (see pic below)

You mentioned that it uses the lat/long only once during integration setup. Is it possible that the integration is reloaded on a timed bases and uses the params again? or after reboot?

Anyway, maybe this would help someone. I will monitor my results and will post in case it fails again...it's been working for a couple days now.

IMG_18DC6D0B4331-1

Sudo-Rob commented 2 years ago

Curiously, my two systems use the identical lat/long entries in HA, yet the grid descriptors are slightly different:

https://api.weather.gov/gridpoints/LWX/100,87/forecast https://api.weather.gov/gridpoints/LWX/100,87/forecast/hourly

vs.

https://api.weather.gov/gridpoints/LWX/99,88/forecast https://api.weather.gov/gridpoints/LWX/99,88/forecast/hourly

TaylorFinklea commented 2 years ago

I was getting similar results where the temperature was being received whilst forecast was missing. I don't think the platform / architecture makes a difference in this case. I noticed that the pre-populated lat/long parameters were much longer than the weather station's lat/long.

I believe the issue is occuring on the NWS side when it tries to locate your local station code (by using the lat/long prepopulated in the parameters during setup). My default lat/long parameters have a 15 decimal precision. I reduced it to 2 (which doesn't really matter for NWS purposes.)

So far, this works for me! Example:

BEFORE (NOT WORKING) IMG_8531

AFTER (WORKING) IMG_37D365A8EC53-1

cc: @Sudo-Rob , @MatthewFlamm @jnewland

Your solution seems to be working for me for some reason, except I manually defined the METAR station in addition. I'll see if this fixes the issue in the long run.

Sudo-Rob commented 2 years ago

The integration automatically populates HA lat/long values. If so, why is there a difference in the grid points when presented with the same lat/long? Moreover, why should it matter that much regarding reliability? Am I wrong about the empty entries?

I think there's something else going on, but I really shouldn't speculate. I only know a limited amount about network stuff. We might have to wait for some debug code to be developed.

I'll try limiting the lat/long precision and report back later.

Sudo-Rob commented 2 years ago

I configured the integration with two decimal place precision. It puts me about 2000 feet away from the boundary of a grid box. That's plenty good enough.

It has been over 5 hours and things have been relatively stable. There have been only two '500' (Internal Server Errors) in the log. That's a pass in my book, as my well-behaved system typically doesn't do much better.

I'll check the log tomorrow and report back here.

Sudo-Rob commented 2 years ago

In the last 16+ hours (i.e. since last midnight) my stable system accumulated 3 errors. My problematic system accumulated 12 errors. Still, it's operating much better than before changing the lat/long precision. At least it's still providing forecast information rather than dropping it for hours on end. Note I didn't have to reduce the lat/long precision on the more-stable system.

I should mention that my problematic system is running my normal set of integrations, whereas the stable system is just a skeleton system I've recently loaded with HA, a few add-ons, and the NWS integration. Both CPUs are lightly loaded (6.1% and 2.7%, respectively) with plenty of RAM overhead for each (as reported by Glances plug-in).

MatthewFlamm commented 2 years ago

I highly suspect that this is related to the discussions like these:

https://github.com/weather-gov/api/discussions/492

There are supposedly upgrades in the works including a pre-production server. This could be patched into the pynws library if you are brave enough :).

Sudo-Rob commented 2 years ago

I highly suspect that this is related to the discussions like these:

weather-gov/api#492

It'd be wonderful if it eventually turned out to be the case. There's a comment about servers being out of sync. I had thoughts of this, too. I think the servers are cloud based, and are likely proxies. Perhaps there's a cache associated with specific MAC addresses that repeatedly allocates some clients to a bad server.

Edit: To @MatthewFlamm: Somehow, your last two sentences didn't click in my mind until after I replied to what I saw at https://github.com/weather-gov/api/discussions/492. FYI, I wouldn't mind trying it, but don't want you to expend any effort with the library, especially if it involves me tinkering with the innards of a container.

For the others here, there's recent post listing a URL (preview-api.weather.gov). It changes the caching scheme, and they're looking for feedback with the caveat that it's not intended for operational use. Initial reports are positive. So, it seems like there's progress that might result in a durable fix.

I'm still left wondering why two clients with the same lat/long and five decimal place precision can be assigned two different grid points. Moreover, I'm relatively certain there are other problems associated with the 500 errors. I never had any similar issues with DarkSky.

In case you were considering it, I hope you don't retract your help-wanted request for the debug logging just yet.

I can't speak for the others, but I'm good with closing the issue if I'm still getting good data after a couple more days.

Sudo-Rob commented 2 years ago

My HA instance with the NWS integration continues to play nice with the API. The latest post at https://github.com/weather-gov/api/discussions/492#discussioncomment-2053603 is a strong indicator the API server caching is the proximate cause of the missing forecast data.

I think I'm good until another server gets cramps. Thanks to all for the help.