near / mpc

36 stars 11 forks source link

Include node heights metric in the liveliness of a node #680

Closed ChaoticTempest closed 1 month ago

ChaoticTempest commented 1 month ago

We should utilize the node's block height as a sign of the node being alive or not. If the node is severely behind in block height, we should say that it is offline, since it is not useable as a participant currently to serve signature requests.

An honest node can self report themselves by returning NotRunning when requested on the /state endpoint. This is how we can detect our own node is behind:

error_margin = 25
is_behind = nodes[self].height < (avg_height(nodes[0..n] - nodes[self]) - error_margin)

It's a bit harder for non-honest nodes, where our own node has to mark other nodes as offline due to their block heights. This will require a little more thought on how to resolve and more research into how NEAR nodes deal with nodes being behind. One naive solution is to check the nodes height, and run an outlier detection algorithm on top of the series of heights. Something like this, where we get the bottom 10% and top 90% of the data.

heights = heights[(heights < np.quantile(heights, 0.1)) & (heights > np.quantile(heights, 0.9))]

And then mark the corresponding nodes as offline from the our own node's viewpoint. This definitely requires more testing due to the nature of how all nodes will be doing this offline node assessment, where a general consensus has to be met.

Also, if the network is congested, all nodes should roughly be behind ideally so this should not be an issue in that case.

ChaoticTempest commented 1 month ago

Also note that even if a node is offline, it can still be a participant for triple and presignature generation. It just cannot be a part of signature production since signature production requires the indexer to be roughly in line with everyone else to process a signature request.

volovyks commented 1 month ago

@ChaoticTempest does it mean that when N=8 and T=5 we have 6 online nodes, it is better to add everybody as active participants in the triple generation protocol?

ChaoticTempest commented 1 month ago

not sure what you mean by that example. Currently, how we tell a node is online or not is through calling into /state on that particular node which tells us that the node is online. But that has issues when that node's is unable to process signature requests when their indexer is offline. So that means that in our liveness of nodes, it would still get counted as online because it was still able to serve /state.

What I'm proposing is to include the block height value in our /state calls to tell whether or not a node is running, or nodes themselves can mark themselves as offline by not responding in /state.