threefoldtech / minting_v3

minting code for grid v3, using v3 tokenomics
Apache License 2.0
0 stars 0 forks source link

Nodes can receive a wake up violation when they are actually shutting down #29

Open scottyeager opened 2 months ago

scottyeager commented 2 months ago

I've observed a rare possibility that a node can receive a wake up violation for failing to boot within 30 minutes when the node is in fact shutting down.

Here's the sequence of events:

  1. Node boots due to farmerbot. Upon boot it sends an uptime report resulting in both power_managed and power_managed_boot set to None
  2. But, in the same block as that uptime event, there is also a power target change for Up for this node. Maybe this shouldn't happen in normal circumstances, but it can and actually has. Since the power state for this node is still Down at this point, power_managed_boot will be set
  3. The node only sets its power state to Up in the next block after its first uptime report, typically
  4. There is a power target change to Down for this node more than 30 minutes after the target change to Up
  5. When the node shuts down, it first sets its power state to Down and thus both power_managed and power_managed_boot are not None
  6. Next, the node sends a final uptime report before shutting down (usually in the next block after the power state change). At this point, minting interprets this uptime report as a wake up event and assigns the node a violation

If we accept that it's legitimate to send multiple power target changes until a node wakes up, then this definitely shouldn't result in a violation.

Perhaps the solution would be to reorder the sequence of operations in Zos, but I guess that it was implemented this way for a reason, and of course rolling out changes to Zos is slow.

LeeSmet commented 2 months ago

So if I understand correctly: the farmer bot requests a boot by switching the target from down to up. While the node is apparently booting, the bot switches the target back to down, then back to up to request a second boot.

The behavior is correct, since the node did not finish its expected boot sequence for the first request (it must both send an uptime report and switch its power state, the latter only happens if its target is up). When the farmer bot was initially implemented, it was agreed that for verification purposes, a node MUST answer every power on request by fully booting. This is also what underpins the random wakeups.

As a side note, there is no specific ordering of calls in zos atm, and calls from multiple tasks which happen at the same time are inherently racy.

scottyeager commented 1 month ago

So if I understand correctly: the farmer bot requests a boot by switching the target from down to up. While the node is apparently booting, the bot switches the target back to down, then back to up to request a second boot.

The power target isn't switched back to down until after the node has fully booted and set it's power state to up. It's allowed by tfchain create additional events to set the power target to up, even when the target is already up. So in this case the bot is attempting repeatedly to wake the node before the node finishes booting.

As a side note, there is no specific ordering of calls in zos atm, and calls from multiple tasks which happen at the same time are inherently racy.

That's good to know. In my observations, nodes tend to set their power state to up in the block immediately following their first uptime report after waking up.

LeeSmet commented 1 month ago

In that case this is a bug in tfchain. Aside from the fact that an event should be emitted to notify of a change in state (while the state isn't changed here), the event is explicitly called (PowerTargetChanged). I guess it should be easy to update the tfchain code to prevent events from being emitted if there is no actual change. Aside from that, if the intent is that someting observes the current state, that something should just query the latest chain state.