threefoldtech / zos

Autonomous operating system
https://threefold.io/host/
Apache License 2.0
85 stars 14 forks source link

Node is not attempting to wake up its friends #2269

Open scottyeager opened 6 months ago

scottyeager commented 6 months ago

A farmer using the farmerbot reported that their nodes did not wake up automatically after the signal from the bot. Upon inspecting Zos logs, I don't see any evidence that the single online node in the farm was detecting the power target changes and sending WoL packets.

The farm in question is 2405 on mainnet. It's configured with node 4465 always remaining on. The farmer has reported that none of the other nodes are responding to the power target changes.

Here is one example:

At block 12103846 the power target for node 4466 was changed to 'Up'. The timestamp for this block is Fri Apr 19 2024 00:22:06 GMT. No responses to power target changes can be found in the node logs at this time, nor indeed for any other target changes happening for nodes in this farm over the last couple days.

Node 4465 is definitely working though and has active communication with tfchain:

image

I have asked the farmer to reboot 4465 to see if it helps, but this is of course a fairly serious concern due to the impact on minting if nodes don't respond promptly to power target changes.

muhamadazmy commented 6 months ago

From the logs analysis i saw few interesting things

As a side effect all rmb messages were invalidated because of the time stamp.

I am not sure if any of that related but the time skew is definitely a problem

muhamadazmy commented 6 months ago

There should be an error in the logs, but on failure to receive the event it seems we wait 10 seconds before retry but unfortunately we didn't log the failure

We will have to fix that missing log, and wait until this happens again. Obviously the reboot probably fixed the time issue. (note ntpd gives up if the skew is too big)

Side note: I am wondering if we can also have some code to monitor time skew and if it's too big we just restart ntpd. Restarting ntpd forces it to resync even if the skew is huge

rawdaGastan commented 6 months ago
scottyeager commented 6 months ago

Thanks for the investigations here. So far the farmer did not report any further issue since rebooting the node. I'll keep an eye out for any other examples of this behavior.