Open scottyeager opened 6 months ago
From the logs analysis i saw few interesting things
There is (was) a clock skew on this node for around 30 minutes!
There were also some network interruptions (but not for very long)
As a side effect all rmb messages were invalidated because of the time stamp.
I am not sure if any of that related but the time skew is definitely a problem
There should be an error in the logs, but on failure to receive the event it seems we wait 10 seconds before retry but unfortunately we didn't log the failure
We will have to fix that missing log, and wait until this happens again. Obviously the reboot probably fixed the time issue. (note ntpd gives up if the skew is too big)
Side note: I am wondering if we can also have some code to monitor time skew and if it's too big we just restart ntpd. Restarting ntpd forces it to resync even if the skew is huge
Thanks for the investigations here. So far the farmer did not report any further issue since rebooting the node. I'll keep an eye out for any other examples of this behavior.
A farmer using the farmerbot reported that their nodes did not wake up automatically after the signal from the bot. Upon inspecting Zos logs, I don't see any evidence that the single online node in the farm was detecting the power target changes and sending WoL packets.
The farm in question is 2405 on mainnet. It's configured with node 4465 always remaining on. The farmer has reported that none of the other nodes are responding to the power target changes.
Here is one example:
At block 12103846 the power target for node 4466 was changed to 'Up'. The timestamp for this block is Fri Apr 19 2024 00:22:06 GMT. No responses to power target changes can be found in the node logs at this time, nor indeed for any other target changes happening for nodes in this farm over the last couple days.
Node 4465 is definitely working though and has active communication with tfchain:
I have asked the farmer to reboot 4465 to see if it helps, but this is of course a fairly serious concern due to the impact on minting if nodes don't respond promptly to power target changes.