threefoldtech / tfgrid-sdk-go

Apache License 2.0
2 stars 4 forks source link

Nodes that fail to shutdown are assigned "Standby" status #362

Open scottyeager opened 1 year ago

scottyeager commented 1 year ago

While troubleshooting some issues with nodes responding to power target changes from the farmerbot, I noticed that nodes that are actually online because they failed to respond show as "Standby" in the Dashboard.

The reason is the logic here in grid-proxy. Since poweringOff is selected for standby.

I suggest labeling nodes that haven't yet set their power state to down as "Up".

xmonader commented 6 months ago

@Omarabdul3ziz have a look at this

Omarabdul3ziz commented 6 months ago

the decision to consider poweringOff as a standby was intentional to avoid scenarios like where a node goes into standby shortly after being used if a node fails to power off it may retry or this indicates there might be an issue with it up status only applies to valid nodes ready for use. that is what i think. what do you think?

Omarabdul3ziz commented 6 months ago

also checked from the farmerbot side. it sets the target to Down when trying to power off the node, and it is only considered up when only state/target is Up which we do also on the proxy and the decision for this was made due to the reason i explained above

scottyeager commented 5 months ago

When nodes are functioning properly, the current approach and given reasoning make sense, sure. The issue with this approach though is that it masks potential issues with nodes:

  1. The node is supposed to shutdown but it didn't
  2. The node shut down but didn't set its power state to "Down" (in this case, the proxy will eventually return "down" but only if the node doesn't come back online within 24 hours)

From the perspective of someone deploying on the Grid, it doesn't matter much if these nodes are shown as standby or down, as long as they don't get selected for a deployment. But from the farmer's perspective, it makes the node look like it's functioning normally when in fact it is not

Properly functioning nodes will generally spend only a matter of seconds in the poweringOff state. It's much more likely that someone would try to deploy to the node during the wake up period when the node is shown as up. Since the type of errors I describe above are rare, it's also rather unlikely that someone would try to deploy to a node in the error state.

So to me we are masking important errors in Zos that need to be addressed with a very marginal benefit for users who might try to deploy on these nodes in very rare cases. Furthermore, the deployment interface itself could take care of not allowing users to deploy to nodes with a power target of "Down".