nerves-hub / nerves_hub_link

Connect devices to NervesHub via a Phoenix channel
https://hex.pm/packages/nerves_hub_link
Apache License 2.0
36 stars 18 forks source link

check time synchronization before socket connection #195

Closed joshk closed 1 month ago

joshk commented 5 months ago

Similar to the recent addition of checking if the network is available before connecting the socket, this also checks if the time is synchronized.

Additionally, you can opt-out by using:

config :nerves_hub_link,
  wait_for_time_sync: false,

which is useful for tests or when running your code on your host machine (eg. Mac) where NTP isn't available and NervesTime will always report that time isn't synchronized.

fixes #191

fhunleth commented 5 months ago

I feel like there should be a warning. If you were to put this into production and you don't have control over where the device is installed, it might prevent the device from connecting. NTP is sometimes blocked by routers. I don't know if it's a blanket UDP ban or whether some admin really doesn't like NTP, but we see it happen in our installs. NTP synchronization also takes minutes, so if something causes the device to reboot soon after boot, it might also make it trickier to recover the device.

joshk commented 5 months ago

That is a good point, your time can be in sync without NTP.

I do wonder if having a HTTP based NTP adapter for NervesTime might be a slight better way to go for the case you mention above. As a devices clock drifts, it will cause issues with SSL.

For this PR, would the warning be during socket connection?

lawik commented 5 months ago

Good call.

So the problem seems to have been that the client tries to connect with a clock that is out of whack and fails to do the crypto dance. I get this on every restart as noted in #191.

It isn't harmful to fail that way until it succeeds but it is kind of weird.

Would time NervesTime.synchronized? give false even if an RTC has kept it accurate? Then synchronization is the wrong thing to look at.

lawik commented 4 months ago

I wonder if it is better to just try and figure out if the time is out of whack based on the request rather than not making it before time sync. We can handle the outcome if we know that it is the problem.

joshk commented 4 months ago

what is the error you get when the time isn't synchronized yet?

lawik commented 4 months ago

403, see https://github.com/nerves-hub/nerves_hub_link/issues/191

joshk commented 4 months ago

Thats right. 403 doesn't really tell us anything, that is a generic error, or as Mozilla tells me...

The HTTP 403 Forbidden response status code indicates that the server understands the request but refuses to authorize it.

lawik commented 4 months ago

Issue has more details. It fails to establish the websocket and I bekieve the 403 is only triggered by an invalid signature. The signature is invalid because it signs a couple of hours off before NervesTime syncs up.

we might not want to wait for NervesTime sync. But if we don't we should have a warning that this could be due to time kerfuffery so people don't scratch their heads for two minutes and then change the config for their perfectly correct firmware.

lawik commented 4 months ago

The server could pass its perceived time as part of the 403 in a header or something. Would let us check for this case. Maybe do some backoff?

fhunleth commented 4 months ago

I'm really having trouble with this one. I don't think that I'd ever enable it or recommend anyone enable it. The reason is that it assumes failure without trying. Maybe the time is wrong, but maybe it will never be fixed because NTP is down or blocked. If the devices never try, then the only option is to replace them in the field or hope that there's another backend connection that doesn't do this. On the other hand, if the devices try, then this can be addressed on the backend by deploying a cert with a better validity range or tolerating a larger time window.

I agree with the sentiment of just fixing the error message. How about if the SSL connection fails, then check if the time has been synchronized. Same logic, but be optimistic and try rather than assume failure if unsynchronized.

joshk commented 4 months ago

Thanks for the feedback @fhunleth

I'll look into the changes you mentioned.

One thing to note is that these 403 errors due to time synchronization push the Socket retry backoff to hit its max limit. This is a similar issue we saw several months ago related to the Shared Secrets work, and likely means there is some work to be done there (like resetting the backoff after a successful connection)

lawik commented 1 month ago

We've done work to mitigate this in another issue where we were using System.system_time instead of System.os_time and the system time would take a long time to update while OS time was more reasonable.

I think we should change the warning but this thread can be closed.