Checkpointing consensus failures

If a reorg arrives which is checkpointed but the current chain is relative long and not checkpointed some really strange behaviour happens that results in a failure for nodes to switch to the checkpointed chain.

Last night the testnet was mining an alt chain that was separate from the chain that all the service nodes were on. (This was caused by the difficulty bug, which doesn't seem to be solved!). So the situation was:

non-SN chain was mining without any checkpoints.
chain with all the SNs was stalled at 86185 (nothing was mining on it) for several hours.
there was a checkpoint at 86184 which was not on the non-SN chain being mined.

The SN continually communicated this checkpoint to nodes, but they apparently ignored it because:

Quorum state for height: 86184 was not cached in daemon!

At this point I started mining on the SN chain and restarted a node that was on the non-SN chain. It immediately reorg'ed from the non-SN chain to the SN chain:

###### REORGANIZE on height: 86184 of 86351, checkpoint is found in alternative chain on height 86184

and because this was a large chain, had to do a full SN rescan:

Recalculating service nodes list, scanning blockchain from height 3

then the weird stuff happens:

2019-07-13 13:36:07.107 [P2P1]  INFO    global  src/cryptonote_core/blockchain.cpp:1953 ----- BLOCK ADDED AS ALTERNATIVE ON HEIGHT 86184
id: <d1ca09ae53d91ce38032febcaee00b88b121f4d10b8ed09a84239677901835dd>
PoW:    <563b2ac5fa2d2c3416a52aaa47e343456e8081a047c6e55ab6c2a6cddd4b0000>
difficulty: 91108
2019-07-13 13:36:07.118 [P2P1]  INFO    global  src/cryptonote_core/blockchain.cpp:1941     ###### REORGANIZE on height: 86184 of 86184 with cum_difficulty 8149790482
 alternative blockchain size: 2 with cum_difficulty 8149881320
2019-07-13 13:36:07.172 [P2P1]  ERROR   blockchain  src/cryptonote_core/blockchain.cpp:1898 insertion of new alternative block returned as it already exists
2019-07-13 13:36:07.172 [P2P1]  ERROR   blockchain  src/cryptonote_core/blockchain.cpp:1124 Failed to push ex-main chain blocks to alternative chain

Those errors do not look innocuous. But then it continues to REORG itself back to the non-SN chain, for some reason doing a one block reorg for each block:

2019-07-13 13:36:07.191 [P2P1]  INFO    global  src/cryptonote_core/blockchain.cpp:1144 REORGANIZE SUCCESS! on height: 86184, new blockchain size: 86186
2019-07-13 13:36:07.202 [P2P1]  INFO    global  src/cryptonote_core/blockchain.cpp:1941 ###### REORGANIZE on height: 86186 of 86185 with cum_difficulty 8149881320
 alternative blockchain size: 1 with cum_difficulty 8149969482
2019-07-13 13:36:07.234 [P2P1]  INFO    global  src/cryptonote_core/blockchain.cpp:1144 REORGANIZE SUCCESS! on height: 86186, new blockchain size: 86187
2019-07-13 13:36:07.246 [P2P1]  INFO    global  src/cryptonote_core/blockchain.cpp:1941 ###### REORGANIZE on height: 86187 of 86186 with cum_difficulty 8149969482
 alternative blockchain size: 1 with cum_difficulty 8150053687
2019-07-13 13:36:07.278 [P2P1]  INFO    global  src/cryptonote_core/blockchain.cpp:1144 REORGANIZE SUCCESS! on height: 86187, new blockchain size: 86188
2019-07-13 13:36:07.290 [P2P1]  INFO    global  src/cryptonote_core/blockchain.cpp:1941 ###### REORGANIZE on height: 86188 of 86187 with cum_difficulty 8150053687
 alternative blockchain size: 1 with cum_difficulty 8150137620^[[0m
2019-07-13 13:36:07.321 [P2P1]  INFO    global  src/cryptonote_core/blockchain.cpp:1144 REORGANIZE SUCCESS! on height: 86188, new blockchain size: 86189
2019-07-13 13:36:07.333 [P2P1]  INFO    global  src/cryptonote_core/blockchain.cpp:1941 ###### REORGANIZE on height: 86189 of 86188 with cum_difficulty 8150137620
 alternative blockchain size: 1 with cum_difficulty 8150217391

which continues all the way up to:

2019-07-13 13:36:14.737 [P2P1]  INFO    global  src/cryptonote_core/blockchain.cpp:1144 REORGANIZE SUCCESS! on height: 86350, new blockchain size: 86351
2019-07-13 13:36:14.749 [P2P1]  INFO    global  src/cryptonote_core/blockchain.cpp:1941 ###### REORGANIZE on height: 86351 of 86350 with cum_difficulty 8157923766
 alternative blockchain size: 1 with cum_difficulty 8157944872
2019-07-13 13:36:14.783 [P2P1]  INFO    global  src/cryptonote_core/blockchain.cpp:1144 REORGANIZE SUCCESS! on height: 86351, new blockchain size: 86352
2019-07-13 13:36:14.803 [P2P1]  INFO    global  src/cryptonote_core/blockchain.cpp:1144 REORGANIZE SUCCESS! on height: 86184, new blockchain size: 86352
2019-07-13 13:36:14.803 [P2P1]  WARNING checkpoints src/checkpoints/checkpoints.cpp:58  CHECKPOINT FAILED FOR HEIGHT 86184. EXPECTED HASH <7792aafa97f765db3a4f5a275f9fe1f81afb5d04358444c03adcf9a88a4c3bd2>GIVEN HASH: <d1ca09ae53d91ce38032febcaee00b88b121f4d10b8ed09a84239677901835dd>
2019-07-13 13:36:14.803 [P2P1]  ERROR   blockchain  src/cryptonote_core/blockchain.cpp:4367 Local blockchain failed to pass a checkpoint in: update_checkpoint, rolling back!
2019-07-13 13:36:15.652 [P2P1]  INFO    global  src/cryptonote_core/service_node_list.cpp:1928  Service node data loaded successfully, height: 86352
2019-07-13 13:36:15.652 [P2P1]  INFO    global  src/cryptonote_core/service_node_list.cpp:1929  25 nodes and 32 rollback events loaded.
2019-07-13 13:36:15.652 [P2P1]  WARNING service_nodes   src/cryptonote_core/service_node_list.cpp:85    Recalculating service nodes list, scanning blockchain from height 3
...
2019-07-13 13:36:18.695 [P2P1]  WARNING service_nodes   src/cryptonote_core/service_node_list.cpp:118   Done recalculating service nodes list

Then after the recalculation we get some other WARNINGs (which I am pointing out here because they might be related to the difficulty issue but I have not investigated):

2019-07-13 13:36:18.781 [RPC1]  WARNING blockchain.db.lmdb  src/blockchain_db/lmdb/db_lmdb.cpp:80   Attempt to get cumulative difficulty from height 86351 failed -- difficulty not in db
2019-07-13 13:36:18.781 [RPC0]  WARNING blockchain.db.lmdb  src/blockchain_db/lmdb/db_lmdb.cpp:80   Attempt to get cumulative difficulty from height 86351 failed -- difficulty not in db

and then we sync back to the alt chain again:

2019-07-13 13:36:27.251 [P2P4]  INFO    global  src/cryptonote_protocol/cryptonote_protocol_handler.inl:1483    Synced 86282/86352 (99%, 70 left)
2019-07-13 13:36:29.504 [P2P8]  INFO    global  src/cryptonote_protocol/cryptonote_protocol_handler.inl:1483    Synced 86352/86352
2019-07-13 13:36:29.504 [P2P8]  INFO    global  src/cryptonote_protocol/cryptonote_protocol_handler.inl:2140    SYNCHRONIZED OK
2019-07-13 13:41:03.041 [P2P1]  INFO    global  src/cryptonote_core/blockchain.cpp:1933 ###### REORGANIZE on height: 86184 of 86352, checkpoint is found in alternative chain on height 86184
...
2019-07-13 13:41:03.858 [P2P1]  WARNING service_nodes   src/cryptonote_core/service_node_list.cpp:85    Recalculating service nodes list, scanning blockchain from height 3

This same process keeps repeating over and over forever.

The only thing that stopped this infinite loop was that the cumulative difficulty on the SN chain eventually overtook the cumulative difficulty on the non-SN chain at which point everything reorged to the checkpointed SN chain.

This all suggests some problems in the checkpointing implementation that need to be addressed:

consensus rules are applied too late. When comparing two chains it appears that cumulative difficulty is still used to let a reorg go ahead and only after the reorg is the result checked for checkpoint status. This is not correct: the consensus rules when comparing an alt chain need to compare checkpoints on the chain before reorganizing to it. In particular, a non-checkpointed chain should not be reorganized to over a checkpointed chain. The above patterns looks as though checkpoints are not applied until after difficulty-based reorgs are completed. Worse, it looks like the current behaviour could let a 51% attacker DDOS the SN network by getting it stuck in a perpetual full SN rescan loop.
SN rescans should not be possible to trigger by any network adversary. Essentially, rollbacks to the second-oldest chain must always be possible without triggering a rescan of the SN status. This likely means that we change how much rollback data is stored to always store back to the second-oldest checkpoint on the current chain (and #740 needs to be fixed so that the second-oldest checkpoint can never get older).
incoming checkpoints must be considered even if they are old as long as they are not older than the 2nd-oldest checkpoint along the current chain. Currently if a node gets too far ahead on a chain without checkpoints it rejects new checkpoints (because it doesn't have them in the cache). This is likely related to the above point and to #740.
checkpoints along alt chains that start on or after the second-oldest checkpoint must be properly evaluated, even when the alt chain is long enough to have different quorum states. Essentially we need a way to build an alternate SN status along an alt chain so that offered checkpoints along that alt chain can be evaluated for validity and thus the alt chain can be switched to if it wins consensus (by having more valid checkpoints along it).

oxen-io / oxen-core

Checkpointing consensus failures #742