[BUG REPORT] Pocket nodes getting stuck during sync

bvanderlugt commented 2 years ago

Describe the bug The node runner community is reporting stuck nodes. Also, on restart the db is corrupt and will not sync

To Reproduce Not sure how to repro. Node runners are reporting the chain stopped syncing around block 71479, also 68389 . Expected behavior Pocket nodes should stay syncd

Screenshots

Snippets from Pocket Network #node-runners chat:

yup, Ive got a node stuck at 71479 and another stuck at 68389, for the latter, when I tried restarting pocket, it crashed and corrupted the blockchain data

panic: enterPrecommit: +2/3 prevoted for an invalid block: wrong Block.Header.AppHash.  Expected 7376BF29ECCFC41FDEEDB1CDD20A14CE9523C98FAEA3F0ECCB71720681CABB47, got E12EC5CF1308FA37148A893
B46DEE30A44AB7B81DD3D2ABA91515E149F92637C

On nodes that got stuck I see lots of these errors:

I[2022-09-23|11:27:06.579] Could not check tx                           module=mempool tx=67FA200C97C64C49B1F74F431D77CB11F6089E26E2363707526FF0215291BDAE err="mempool is full: number of txs 9000 (max: 9000), total txs bytes 7736607 (max: 1073741824)"

This is what I see before the service stops.

Sep 23 10:33:16 b2.dabble.run pocket[378613]: panic: enterPrecommit: +2/3 prevoted for an invalid block: wrong Block.Header.AppHash.  Expected 88BCA70707E668595338EE7F87489618C01E128BA24582B9CA9F933D7667C9D6, got 682E10813F8B9F00B6F9AF10EBEF5D3CB4303459514C4BBD09D54D9D914A8F22

Operating System or Platform: Please indicate the platform(s) where the bug is happening.

Additional context Community thread is here and sprinkled throughout the #node-runners channel September 22nd and 23rd.

This task is tracked internally with tickets T-1441, T-14354

Olshansk commented 2 years ago

@oten91 Do you have any insight on:

Why this happens?
How to avoid it in the past?
Workarounds in the meantime?

oten91 commented 2 years ago

1- Hey , Will try to summarize what I observed yesterday and some anecdotal information.

Around the time of the block 71479 a large node runner suddenly got some nodes offline taking around 9% of the voting power mid round (around the enterprecommit phase).

This last part seem to have created a problem for a large number of nodes returning this error :

panic: enterPrecommit: +2/3 prevoted for an invalid block: wrong Block.Header.AppHash.  Expected 88BCA70707E668595338EE7F87489618C01E128BA24582B9CA9F933D7667C9D6, got 682E10813F8B9F00B6F9AF10EBEF5D3CB4303459514C4BBD09D54D9D914A8F22

The error points out to a Majority voting for an invalid block , the assumption is that because the nodes that were down may have restarted and casted a different vote before the commit, there would be different version of the vote from those nodes being broadcasted (p2p) at that point in time in the network. Creating a Split "PreCommit state", Affected nodes caught in the different version of the state were not able to continue due to tendermint correctness properties hence the " +2/3 prevoted for an invalid block" error.
At this point and without a good rollback tool available, the data dir is corrupted and to restore the functioning of the node a backup is needed.

2- Avoiding is complex subject, some things happening on the network have to do more with misconfiguration and not fully understanding the shared cost of actions by individual node runners.

The current reality is that the existence of LeanPOKT and the careless use of it, makes it even harder for us to put barriers to avoid this kind of scenarios.

In the Past this type of issues were isolated to a small set of nodes as each node was its own process, In our new reality, as people are stacking nodes that share the same data (p2p/evidence/state) we are far more exposed to this kind of issues. Education is a good measure but Harsher on-chain penalties may work better on the long run.

Whats is worrying is that these events happened in the past but the scale and frequency at which these events are happening now is making the matter more serious. The previous report of this happening was at height 71273.

3- The tested and true "Restore from a backup" is the only recommended path.

Also the stuck nodes having the mempool full is expected as their p2p layer is not fully aware that consensus has stopped due to the panic recovery handling.

@oten91 Do you have any insight on:

Why this happens?

How to avoid it in the past?

Workarounds in the meantime?

tingiris commented 2 years ago

@oten91 I assume by "The current reality is that the existence of LeanPOKT and the careless use of it, makes it even harder for us to put barriers to avoid this kind of scenarios." - you mean that some node runners are using LeanPOKT to run a lot of validators on the same machine. Is that correct? Also, is it possible to identify the validators that suddenly went offline?

Olshansk commented 2 years ago

@oten91 Thank you for the detailed answer and explanation.

I have a few general questions to increase my personal understanding as someone who has spent less time working with the cosmo's implementation of tendermint BFT.

At this point and without a good rollback tool available, the data dir is corrupted and to restore the functioning of the node a backup is needed.

Q1: Does this mean the data dir gets irreversibly corrupted by uncommitted blocks?

Q2: Is this a result of the fact that our Tendermint fork is 2 years old or is it still the case today in the Cosmos SDK today?

Creating a Split "PreCommit state", Affected nodes caught in the different version of the state were not able to continue due to tendermint correctness properties hence the " +2/3 prevoted for an invalid block" error.

I read through Tendermint BFT Consensus Algorithm and am trying to understand what happened here.

There were sufficient votes being broadcast throughout the network, and +2/3 (enough to maintain liveness) voted in the same way, so why did that not becoming the "canonical chain"?

Q3: My understanding is that it is this point/reason (see the image below). Is this correct or am I missing something?

Education is a good measure but Harsher on-chain penalties may work better on the long run.

+1 to education.

However, personally, I believe that slashing is the simple, low-hanging, "crypto" approach to this. It'll obviously be up to the DAO to decide.

It's worth noting that this problem expands outside of just Pocket and LeanPOKT, across the entire crypto industry. This post summarizes it well IMO:

You can handle multiple validators with a single validator client. The only limit is the cpu and memory of the machine the software is running on

An attacker could maliciously control your client to incur penalizations, and much much worse, could manipulate your clients in a way that causes your validators to perform an illegal action, and thus be slashed. Penalization is generally a light thing, but slashing is a hardcore thing

@tingiris

you mean that some node runners are using LeanPOKT to run a lot of validators on the same machine. Is that correct?

Yes.

Also, is it possible to identify the validators that suddenly went offline?

I believe (haven't checked) it should be available via on-chain data if the blockchain continue to make progress while the validator was down, but otherwise, it would require timed inspection of the mempool when the error was happening.

Q4: Deferring to @oten91 if I missed something here?

We would probably need to increase slash_fraction_downtime to disincentivize aggregation of validators on a single machine but it's an "all-in" solution.

For reference, here's another reference from the slashing module:

oten91 commented 2 years ago

Answering here :

Q1: Does this mean the data dir gets irreversibly corrupted by uncommitted blocks?

Sadly yes, the block is not processed by the node but is stored. Modifying the expected apphash, after that happens a rollback is needed for the node to "forget" about the last round information as he was not expecting a block without all the required signatures (+2/3)

Q2: Is this a result of the fact that our Tendermint fork is 2 years old or is it still the case today in the Cosmos SDK today?

In this case for the stuck nodes the majority was wrong and this would have created a fork in other types of consensus , with tendermint "fork-resistant" qualities that should also be present in latest version, the result probably would be the same. One thing I believe is that this may be better handled on the newer versions, as they incorporated a "staging" state that they write to before writing to the actual one for this scenarios ,so it may not leave the db in a bad state even if it halts.

Q3: My understanding is that it is this point/reason (see the image below). Is this correct or am I missing something?

For this particular scenario and the perspective of the nodes that got stuck, the correct one is the last bullet point :

The block proposed was valid, and +2/3 of pre-votes were received for enough nodes, but +2/3 of the pre-commits for the proposed block were not received for enough validator nodes.

nodiesBlade commented 2 years ago

@oten91 Any chance this is a related issue? https://github.com/tendermint/tendermint/issues/5617

pokt-network / pocket-core

[BUG REPORT] Pocket nodes getting stuck during sync #1478