Open seanshahkarami opened 5 years ago
I would suggest, at the very least, patching this so it uses the RW partition for the adjtime file. Something like:
hwclock -w --adjfile=/wagglerw/adjtime
or, alternatively, placing a symlink at /etc/adjtime
to /wagglerw/adjtime
.
It seems that the epoch service isn't working correctly. It may be due to an unexpected interaction between loss of connectivity to beehive and the RO FS.
I detected that the time had rolled back on some nodes after seeing a negative uptime for a handful of devices. After checking the date, it was stuck one day prior until I unlocked the filesystem and restated the epoch service. Unlocking the filesystem alone caused the incorrect date to be set (possibly a cached result from beehive?)
This needs to be corrected as it impacts a number of key items in the system. (SSL / TLS, data, health check uptimes, etc).
Here are the original logs I saw after realizing a detecting a node whose time had been rolled back:
We should debug the corner case this appears in. It seems that restarting the epoch service corrected the issue in some cases. Maybe it occurs when we have faulty connectivity?