Open MaxMaximus opened 4 years ago
We started working on an improvement in the Windows installer to make the storagenode
and storagenode-updater
services dependent on the dnscache
service.
https://review.dev.storj.io/c/storj/storj/+/819
Meanwhile, you should be able to do this manually using the following commands:
sc config storagenode depend= dnscache
sc config storagenode-updater depend= dnscache
This should be a better workaround than the restarts.
Thanks! I have tested this option with services dependency and looks like it solved the problem with DNS errors during system boot. Although I left the autorestart option in case of any other problems may occur later, but for now it no longer triggered as storagenode starts properly from first try.
Looks like the issue is back after one of the node software updates (from 1.3.3 i think). It did not cause any problems so far thanks to my "plan B" in place (leaving simple auto restart of service if it is not running). But routinely checking node logs i have noticed that first start after computer reboot is failing again due to DNS errors. And only second try (which i set with 2 minute delay) succeed:
...............................
2020-06-16T04:31:54.406+0300 INFO Stop/Shutdown request received.
---------- short computer power off to clean dust from coolers --------------
2020-06-16T04:44:16.314+0300 INFO Configuration loaded {"Location": "C:\\Program Files\\Storj\\Storage Node\\config.yaml"}
2020-06-16T04:44:16.340+0300 INFO Invalid configuration file key {"Key": "check-interval"}
2020-06-16T04:44:16.403+0300 INFO tracing disabled
2020-06-16T04:44:16.416+0300 INFO Operator email {"Address": "*****"}
2020-06-16T04:44:16.416+0300 INFO Operator wallet {"Address": "*****"}
2020-06-16T04:44:16.899+0300 INFO db.migration Database Version {"version": 36}
2020-06-16T04:44:19.501+0300 WARN trust Failed to fetch URLs from source; used cache {"source": "https://tardigrade.io/trusted-satellites", "error": "HTTP source: Get https://tardigrade.io/trusted-satellites: dial tcp: lookup tardigrade.io: no such host", "errorVerbose": "HTTP source: Get https://tardigrade.io/trusted-satellites: dial tcp: lookup tardigrade.io: no such host\n\tstorj.io/storj/storagenode/trust.(*HTTPSource).FetchEntries:63\n\tstorj.io/storj/storagenode/trust.(*List).fetchEntries:90\n\tstorj.io/storj/storagenode/trust.(*List).FetchURLs:49\n\tstorj.io/storj/storagenode/trust.(*Pool).fetchURLs:240\n\tstorj.io/storj/storagenode/trust.(*Pool).Refresh:177\n\tstorj.io/storj/storagenode.(*Peer).Run:664\n\tmain.cmdRun:209\n\tstorj.io/private/process.cleanup.func1.4:344\n\tstorj.io/private/process.cleanup.func1:362\n\tgithub.com/spf13/cobra.(*Command).execute:840\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:945\n\tgithub.com/spf13/cobra.(*Command).Execute:885\n\tstorj.io/private/process.ExecWithCustomConfig:86\n\tstorj.io/private/process.Exec:63\n\tmain.(*service).Execute.func1:66\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57"}
2020-06-16T04:44:19.501+0300 INFO preflight:localtime start checking local system clock with trusted satellites' system clock.
2020-06-16T04:44:19.501+0300 ERROR preflight:localtime unable to get satellite system time {"Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "error": "rpccompat: dial tcp: lookup saltlake.tardigrade.io: no such host", "errorVerbose": "rpccompat: dial tcp: lookup saltlake.tardigrade.io: no such host\n\tstorj.io/common/rpc.Dialer.dialTransport:264\n\tstorj.io/common/rpc.Dialer.dial:241\n\tstorj.io/common/rpc.Dialer.DialAddressID:151\n\tstorj.io/storj/storagenode/preflight.(*LocalTime).getSatelliteTime:110\n\tstorj.io/storj/storagenode/preflight.(*LocalTime).Check.func1:67\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57"}
2020-06-16T04:44:19.516+0300 ERROR preflight:localtime unable to get satellite system time {"Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "error": "rpccompat: dial tcp: lookup asia-east-1.tardigrade.io: no such host", "errorVerbose": "rpccompat: dial tcp: lookup asia-east-1.tardigrade.io: no such host\n\tstorj.io/common/rpc.Dialer.dialTransport:264\n\tstorj.io/common/rpc.Dialer.dial:241\n\tstorj.io/common/rpc.Dialer.DialAddressID:151\n\tstorj.io/storj/storagenode/preflight.(*LocalTime).getSatelliteTime:110\n\tstorj.io/storj/storagenode/preflight.(*LocalTime).Check.func1:67\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57"}
2020-06-16T04:44:19.516+0300 ERROR preflight:localtime unable to get satellite system time {"Satellite ID": "12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB", "error": "rpccompat: dial tcp: lookup europe-north-1.tardigrade.io: no such host", "errorVerbose": "rpccompat: dial tcp: lookup europe-north-1.tardigrade.io: no such host\n\tstorj.io/common/rpc.Dialer.dialTransport:264\n\tstorj.io/common/rpc.Dialer.dial:241\n\tstorj.io/common/rpc.Dialer.DialAddressID:151\n\tstorj.io/storj/storagenode/preflight.(*LocalTime).getSatelliteTime:110\n\tstorj.io/storj/storagenode/preflight.(*LocalTime).Check.func1:67\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57"}
2020-06-16T04:44:19.516+0300 ERROR preflight:localtime unable to get satellite system time {"Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "error": "rpccompat: dial tcp: lookup us-central-1.tardigrade.io: no such host", "errorVerbose": "rpccompat: dial tcp: lookup us-central-1.tardigrade.io: no such host\n\tstorj.io/common/rpc.Dialer.dialTransport:264\n\tstorj.io/common/rpc.Dialer.dial:241\n\tstorj.io/common/rpc.Dialer.DialAddressID:151\n\tstorj.io/storj/storagenode/preflight.(*LocalTime).getSatelliteTime:110\n\tstorj.io/storj/storagenode/preflight.(*LocalTime).Check.func1:67\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57"}
2020-06-16T04:44:19.516+0300 ERROR preflight:localtime unable to get satellite system time {"Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "error": "rpccompat: dial tcp: lookup europe-west-1.tardigrade.io: no such host", "errorVerbose": "rpccompat: dial tcp: lookup europe-west-1.tardigrade.io: no such host\n\tstorj.io/common/rpc.Dialer.dialTransport:264\n\tstorj.io/common/rpc.Dialer.dial:241\n\tstorj.io/common/rpc.Dialer.DialAddressID:151\n\tstorj.io/storj/storagenode/preflight.(*LocalTime).getSatelliteTime:110\n\tstorj.io/storj/storagenode/preflight.(*LocalTime).Check.func1:67\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57"}
2020-06-16T04:44:19.516+0300 ERROR preflight:localtime unable to get satellite system time {"Satellite ID": "118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW", "error": "rpccompat: dial tcp: lookup satellite.stefan-benten.de: no such host", "errorVerbose": "rpccompat: dial tcp: lookup satellite.stefan-benten.de: no such host\n\tstorj.io/common/rpc.Dialer.dialTransport:264\n\tstorj.io/common/rpc.Dialer.dial:241\n\tstorj.io/common/rpc.Dialer.DialAddressID:151\n\tstorj.io/storj/storagenode/preflight.(*LocalTime).getSatelliteTime:110\n\tstorj.io/storj/storagenode/preflight.(*LocalTime).Check.func1:67\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57"}
2020-06-16T04:44:19.516+0300 FATAL Failed preflight check. {"error": "system clock is out of sync: system clock is out of sync with all trusted satellites", "errorVerbose": "system clock is out of sync: system clock is out of sync with all trusted satellites\n\tstorj.io/storj/storagenode/preflight.(*LocalTime).Check:96\n\tstorj.io/storj/storagenode.(*Peer).Run:668\n\tmain.cmdRun:209\n\tstorj.io/private/process.cleanup.func1.4:344\n\tstorj.io/private/process.cleanup.func1:362\n\tgithub.com/spf13/cobra.(*Command).execute:840\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:945\n\tgithub.com/spf13/cobra.(*Command).Execute:885\n\tstorj.io/private/process.ExecWithCustomConfig:86\n\tstorj.io/private/process.Exec:63\n\tmain.(*service).Execute.func1:66\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57"}
-------------- autorestart kicks in after 2 mins as it detect service is not running ----------
2020-06-16T04:46:19.927+0300 INFO Configuration loaded {"Location": "C:\\Program Files\\Storj\\Storage Node\\config.yaml"}
2020-06-16T04:46:19.951+0300 INFO Invalid configuration file key {"Key": "check-interval"}
2020-06-16T04:46:19.952+0300 INFO tracing disabled
2020-06-16T04:46:19.968+0300 INFO Operator email {"Address": "*****"}
2020-06-16T04:46:19.969+0300 INFO Operator wallet {"Address": "*****"}
2020-06-16T04:46:20.778+0300 INFO db.migration Database Version {"version": 36}
2020-06-16T04:46:21.378+0300 INFO preflight:localtime start checking local system clock with trusted satellites' system clock.
2020-06-16T04:46:22.310+0300 INFO preflight:localtime local system clock is in sync with trusted satellites' system clock.
2020-06-16T04:46:22.310+0300 INFO bandwidth Performing bandwidth usage rollups
2020-06-16T04:46:22.310+0300 INFO Node ***** started
2020-06-16T04:46:22.310+0300 INFO Public server started on [::]:28967
2020-06-16T04:46:22.310+0300 INFO Private server started on 127.0.0.1:7778
2020-06-16T04:46:22.310+0300 INFO trust Scheduling next refresh {"after": "4h11m34.882766653s"}
2020-06-16T04:46:51.374+0300 INFO piecestore download started {"Piece ID": "XP7RB3VJC6SAPAMSTSMDKA35TJQNV5IZE4PIAXQYKRELZWTYIQGA", "Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "Action": "GET"}
...................................
P.S. dnscache service is still in the dependency list of both storagenode and storagenode-updater services. I chekced it. But shomehow it does not help any more, while it worked fine with versions prior 1.3.3.
@MaxMaximus can you confirm if this is still an issue?
It's also need to change the start type to delayed-auto
:
https://forum.storj.io/t/node-offline-after-latest-windows-update/8266/4?u=alexey
@MaxMaximus can you confirm if this is still an issue?
I have not seen such problems for a long time (more than a year at least), since I configured the launch of the Storj services with a delay (the launch type is "delayed-auto") and added "Server" in dependency list (in addition to DNS Cache). I have now tried it on one of the machines where i seen it before (running the latest version 1.99.2) switch to the regular automatic startup(without delay) and restart the computer, there were also no problems with DNS when starting the Storj service.
The "DNS cache" AND "Server" services are listed in dependencies of Storj service as before.
I don't have time right now for more detailed testing and an attempts to reproduce the problems, but at least for me it is resolved at the moment.
Hello.
After introducing preflight clock checks node startup fails after a single DNS error. If DNS is unavailable for any reasons at node startup and storagenode fails at resolving host names of satellites into IPs it does not retry and just terminate loading. Resulting in node going offline for a long period of time until owner manual intervention.
I see this situation almost at each of my computers boot/reboot. Looks like storagenode service loads faster and perform first DNS requests while system DNS resolver is not ready yet. And node fail startup. It it is not a big problem by itself as window will try to start services after failure 2 times by default (3 tries total, with 1 minute delay each) and DNS will be available at time of the second try. But current installation package disable this option. So after initial failure service is NOT restarted and node remains offline.
I saw this situation today on my node: there was short power outage (~10 min), UPS trigger correct system shutdown and system restart after power back online, all programs resume running successfully except for Storj due to single DNS error ==> failed preflight check ==> node remained offline for about 12 hours (until i have noticed it and restarted service manually).
I altered storagenode service setting back to windows default (2 additional tries to start after failure) and it fixed problem. But just for me - it still will be a problem for all default setups.
Here example log (i reproduced error few times - just simple correct reboot is enough to trigger it):
If not second try to start service (which a i have added manually) node will be offline after each / almost each system boot.
P.S. I am not sure but DNS delay at OS boot can be caused by a lot of autorun programs on my computers. So it probable will affect only minor % of all nodes (slow or heavy loaded by other programs during boot). But still it is not a normal program behavior when just few dozen seconds of DNS inaccessibility cause the fatal error. Also it probably not isolated to Window only. Can affect other OS too, but I did not test on other OS.