waggle-sensor / nodecontroller-v1

Node Controller (NC) Software
3 stars 5 forks source link

Waggle Epoch Service #20

Open seanshahkarami opened 5 years ago

seanshahkarami commented 5 years ago

It seems that the epoch service isn't working correctly. It may be due to an unexpected interaction between loss of connectivity to beehive and the RO FS.

I detected that the time had rolled back on some nodes after seeing a negative uptime for a handful of devices. After checking the date, it was stuck one day prior until I unlocked the filesystem and restated the epoch service. Unlocking the filesystem alone caused the incorrect date to be set (possibly a cached result from beehive?)

This needs to be corrected as it impacts a number of key items in the system. (SSL / TLS, data, health check uptimes, etc).

Here are the original logs I saw after realizing a detecting a node whose time had been rolled back:

Mar 19 18:26:24 001e06117a38SD waggle_epoch.sh[462]: date: invalid date ‘2019/00/00 00:00:00’
Mar 19 18:26:24 001e06117a38SD waggle_epoch.sh[462]: Wagman epoch: 0
Mar 19 18:26:24 001e06117a38SD waggle_epoch.sh[462]: System epoch: 1553019984
Mar 19 18:26:25 001e06117a38SD waggle_epoch.sh[462]: Wagman build epoch: 1535751871
Mar 19 18:26:26 001e06117a38SD waggle_epoch.sh[462]: Guest Node epoch: 1553019986
Mar 19 18:26:26 001e06117a38SD waggle_epoch.sh[462]: Setting the system epoch to 1553019986...
Mar 19 18:26:26 001e06117a38SD waggle_epoch.sh[462]: Tue Mar 19 18:26:26 UTC 2019
Mar 19 18:26:26 001e06117a38SD waggle_epoch.sh[462]: Syncing the Node Controller hardware clock with the system date/time...
Mar 19 18:26:26 001e06117a38SD waggle_epoch.sh[462]: hwclock: cannot open /etc/adjtime: Read-only file system
Mar 19 18:26:26 001e06117a38SD waggle_epoch.sh[462]: Failed to set time. Retrying in 60 seconds...

We should debug the corner case this appears in. It seems that restarting the epoch service corrected the issue in some cases. Maybe it occurs when we have faulty connectivity?

seanshahkarami commented 5 years ago

I would suggest, at the very least, patching this so it uses the RW partition for the adjtime file. Something like:

hwclock -w --adjfile=/wagglerw/adjtime

or, alternatively, placing a symlink at /etc/adjtime to /wagglerw/adjtime.