What will happen if 4 of the 7 elected consensus nodes are offline?

lock9 commented 3 years ago

I'm studying the voting mechanism for Neo 3, and some issues came to my mind. I didn't see any validation to verify if the candidates are actually online/healthy. What will happen if the elected nodes are not online? If there is no consensus, how can we 'roll back' election results? If there are no consensus nodes, it is not possible to change the elected nodes.

Is this true? What am I missing?

Edit: If this is in fact true, then we need to add some transition phase to ensure that elected CN are producing blocks

shargon commented 3 years ago

Maybe we should restore to original candidates if we reach more than view X

shargon commented 3 years ago

Please check https://github.com/neo-project/neo/pull/2205

lock9 commented 3 years ago

Hmm, not sure if that will solve the issue. I also think 2 days is too much. If the network is offline for 2 days people will hunt us with stones and sticks. Maybe we need committee members as backup nodes. I like your solution because it is pretty straight forward, but at the same time, I'm not confident that this will solve the issue 'forever'. Maybe it will, we need more feedback.

lock9 commented 3 years ago

@shargon can't new CN fork the network if they are not in sync? Can't they do this on purpose, like an attack? If so, it looks like a very profitable attack, meaning we need to add more security measures. If you think they can't fork it, I think that your solution may work.

ffantasy commented 3 years ago

i think the network will be stuck until enough node get online.

roman-khimov commented 3 years ago

I didn't see any validation to verify if the candidates are actually online/healthy.

It's really hard to do that. What is an online/healthy candidate? We're on a P2P network, even if we're to imagine some ping-address mechanism there it remains P2P, the chain can only see some results of this interaction via oracle.

What will happen if the elected nodes are not online?

Of course if 3 out of 7 nodes go down the chain will just stop and we've seen that happening, at the same time we have quite a number of blocks on Neo 2 mainnet/testnet, so it seems like that cases were handled somehow. And I'd split this question into two cases:

we have a proper node elected, but it doesn't work
we have an intentionally non-existent node elected

I think we've only seen the first thing happening and it rarely happens with three nodes at the same time, if it happens we assume node maintainers to be responsible people and ask them to fix the node. It usually works fine and with proper distribution of nodes between various parties it's hard to expect 3 out of 7 to be unreachable/unresponsive simultaneously. Still, there is some probability for that.

The second case is more interesting in that it could be an attack on the network with bad guys voting for random key with no node behind it and moving it to the list of CNs. This attack technically requires a lot of NEO (outweighing some other three proper node votes) and it's really hard to imagine any holder of substantial amounts of NEO doing that (ruining the network and making NEO worthless). But we can of course consider this scenario too.

In both cases we have some nodes not working and we can't get them back online. PR #2205 was dismissed already, so it's not a solution, let's look at #2226 now. I think there are several problems there:

it gives more power to standby validators list the chain will last for many years, some of standby keys may be lost, some may be stolen, chances are we just couldn't get 5 proper nodes when there will be a need to activate this mode. Depending on how these initial keys are distributed it can also be quite time-consuming to setup a proper number of CNs with these old keys. And it makes holders of these keys very very special from the governance point of view, they can always hard fork the network bypassing everything.
it requires configuring every node on the network changes are not reflected on-chain and reconfiguring all nodes takes time to propagate these changes

So I'm not sure #2226 is the best solution. What we can do first is try minimizing the chance of this happening:

Make registering as a candidate cost more. Registering is very cheap now, anyone can do it and then try pumping some votes for this key. Raising the barrier for entry can help.
Make registering as a candidate require an approval from the committee, thus preventing random keys from being candidates. More controversial change, but we don't want random people in the committee (and especially in CN list) anyway.
Add mechanism to forcibly deactivate misbehaving nodes (with committee-signed transaction, of course), thereby changing them with other candidates. Instead of voting the committee might just kick the dead CN off the list. There could be some rule like if the CN maintainer can't fix his node in 48 hours the committee does this.
Some self-regulation logic (monitoring module on committee nodes) could do that (removing misbehaving nodes) even quicker than that.

And then, if we're still in this situation I think instead of going back to standby list it's much more appropriate to use the current committee again. We can add a possibility for committee to sign a block (with some candidate deactivating transactions inside probably). So that blocks could be signed either by CNs (normally) or by the committee (if CNs can't do that). We trust the committee any way, it is current committee as of when this happens, it shouldn't be a problem to collect proper number of signatures, it doesn't require any configuration. The mechanism can be optional as it only makes sense when committee is at least twice bigger than the number of validators.

In any event, this can be done post-preview5 or even later.

erikzhang commented 3 years ago

We voted for 21 nodes to become the committee, and the 7 nodes with the highest votes became consensus nodes. There is a process for a node to gradually become a consensus node from 0 votes. If it is offline, maybe we will realize it and vote it out when it becomes a committee member.

erikzhang commented 3 years ago

The second case is more interesting in that it could be an attack on the network with bad guys voting for random key with no node behind it and moving it to the list of CNs.

It is impossible because users can vote to candidates only. And to be a candidate, his public key is verified.

Make registering as a candidate cost more.

Agree.

roman-khimov commented 3 years ago

If it is offline, maybe we will realize it and vote it out when it becomes a committee member.

Proactive monitoring and voting probably still is the best thing to prevent this. The question is though how do we monitor for a non-CN node. But if we're to have confirmation for committee node being alive then it becomes very easy to quickly replace (vote out) non-functional CNs (just because we'll always know that there are other nodes that can immediately replace them).

And to be a candidate, his public key is verified.

Right, but this verification only means that the key in question existed when registration transaction was created. It doesn't mean that there is a node on the network with this key and it doesn't prevent throwing away this key after registration. This is very theoretical, but still one can register a key, never run a node and still organize some "vote for X" campaign to gather enough votes.

EdgeDLT commented 3 years ago

Also not a fan of #2226 in its current form... who knows what state the default nodes are in. Maybe they are still trusted, maybe they are lost keys, maybe they were sold on the black market a long time ago and are now in the hands of a single malicious actor. We turn a liveness fail into a potential safety fail.

Moving back in the direction of the lightning voting proposal... Why not use a PoW-based fallback to facilitate voting and carrying any other critical messages until the CN error is resolved? Committee nodes can be your miners, keep block times short (1 minute target?).

It would allow us to keep moving forward until dBFT is restored, also serves as a check to see which CNs or candidates are online and pulling their weight. I imagine it could have other uses in the future too, e.g. provide entropy for PRNG (#2019).

neo-project / neo

What will happen if 4 of the 7 elected consensus nodes are offline? #2203