Closed joshk closed 6 months ago
I'm not sure that I want this for wired network devices that are quicker at connecting. We've also had cases in production where devices struggle to get online (for reasons) and we've figured out how to send an on-connect script to help them get past that issue. I'm pretty sure a delay like this would have forced a site visit unless we got lucky. Don't get me wrong, this is pretty rare, but you get scarred bad when it happens.
If the goal is to not waste CPU cycles trying to connect when you know it's going to fail, I think there are a couple options. VintageNet can tell you if there's an internet connection or not. VintageNet won't work when running on the host, though. Second idea is to call :inet.getifaddrs/0
and see if there are network interfaces with IP addresses.
I think there's a certain elegance to keeping things dumb and just polling in ignorance. I know there are errors, but it seems more robust. Maybe there's a way to not log anything until NHL has been disconnected for over a minute (or something).
@jjcarstens expressed similar concerns.
The thing to remember is that the retry backoff after a disconnect, e.g. due to a broken network connection, will climb like so 1, 2, 4, 8, 16, 32, 60. This PR is only about delaying the first connection try on boot, and its configurable.
What I saw when onboarding some new users to NervesHub is the device (a raspberry pi) took about 10-15 seconds for network connectivity to be available and during that time there were a slew log messages and connectivity issues, and due to the retry backoff, we could then ssh into the device before Link would try and connect again.
I think a likely better approach would be to wait for :inet.getaddr("nerveshub.host.url")
to return {:ok, _}
before trying to connect, and polling that every second or two.
You're right about the backoff algorithm. I like your idea of waiting until DNS starts working to start the connect, and I saw #181.
This makes me want to explore having Slipstream back off differently depending on what error it gets back from connecting. For example, :nxdomain
wouldn't back off like what you've done here and that way the reconnect timeouts work the same regardless of whether it's right after boot or not. I think this needs some more thought, and it involves modifying Slipstream, so nothing I want to do now since you addressed the most important case, imho.
Thanks for the feedback @fhunleth.
I'm going to close this PR as its superseded by #181
It usually takes 10-20 seconds for the network to become available after the device is started. This adds a 15-second delay and can be configured to whatever is preferred