Validator behaves differently for hard-fork in ledger-tool create-snapshot vs. passed directly to the solana-validator

steviez commented 1 year ago

Problem

The author(s) of https://github.com/solana-labs/solana/pull/31957 reported hitting a failure on the below test that at least at the surface doesn't seem to be directly caused by the PR. https://github.com/solana-labs/solana/blob/dcd66534dda718160c80752711e6c3a8c6686992/local-cluster/tests/local_cluster.rs#L2217

In general, the test looks to be somewhat flaky as the author mentioned seeing their branch fail on Linux but not on their Mac. I was able to reproduce the failure with very high reliability on a Linux machine, and was able to determine that it is failing due to hitting his assertion: https://github.com/solana-labs/solana/blob/dcd66534dda718160c80752711e6c3a8c6686992/core/src/consensus.rs#L881-L884

This test starts a local-cluster, makes several roots, stops the cluster, chooses a restart slot beyond the latest fork, and then restarts the cluster through wait-for-supermajority. In this test, the two nodes are restarted in different ways:

Node A creates a snapshot at the hard-fork slot, and restarts from that snapshot
Node B starts from its' local state, and instead passes the hard fork directly to validator through new_hard_forks field of ValidatorConfig.

These two approaches are supposed to be equivalent; however, I discovered there is a subtle difference:

As dictated by the test, both nodes must vote on at least slot 45 before the cluster is stopped
The restart / hard fork slot ends up being 40
Node A starts up from the snapshot it created at slot 40. All blockstore past slot 40 is purged (because of hard fork), so Node A's BankForks look like this:
```
root slot: 40, banks: [40]
```

Node B starts up by replaying from genesis to slot 40; it's BankForks looks like:

root slot: 16, banks: [16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40]

A key difference to note here is that these nodes have a different perspective of root. Technically, the cluster has not rooted slot 40, but by virtue of it being picked as the restart slot, it will become rooted. At this point, Node A considers 40 the latest root whereas Node B considers the latest root to be 16. This difference is important in this function https://github.com/solana-labs/solana/blob/077e29aa1eac997707a98a13e3cfb6000ba00ede/core/src/validator.rs#L1443-L1452 which then determines how post_process_restored_tower() functions: https://github.com/solana-labs/solana/blob/077e29aa1eac997707a98a13e3cfb6000ba00ede/core/src/validator.rs#L1469-L1496

From the logs, I determined that Node A was entering the above if statement whereas Node B was not; this follows from the first snippet where we'll only return a Some(...) if wait_slot_for_supermajority == root_slot.

// Node A log, no such log for Node B
[2023-06-13T22:04:52.089584487Z ERROR solana_core::validator]
Rebuilding a new tower from the latest vote account due to failed tower restore: The tower is useless because of new hard fork: 40

As further proof, here is the comparison of towers between the two nodes; noting that Node B contains votes past the hard fork.

// Node A
Tower {
    node_pubkey: AqEWUK8pdsfY2CTrBQLGS8w8ndMeuFcDpCkFwWaicaLL,
    threshold_depth: 8,
    threshold_size: 0.6666666666666666,
    vote_state: VoteState {
        node_pubkey: AqEWUK8pdsfY2CTrBQLGS8w8ndMeuFcDpCkFwWaicaLL,
        authorized_withdrawer: 8XdPBeEszdsPGBGU1sFPrQ6z4b7CqGYnbm5jFoQf3XPN,
        commission: 0,
        votes: [],
        root_slot: Some( 40, ),
        authorized_voters: AuthorizedVoters { authorized_voters: { 0: 8XdPBeEszdsPGBGU1sFPrQ6z4b7CqGYnbm5jFoQf3XPN, }, },
        prior_voters: CircBuf { ... },
        epoch_credits: [],
        last_timestamp: BlockTimestamp { slot: 0, timestamp: 0, },
    },
    last_vote: VoteStateUpdate(
        VoteStateUpdate {
            lockouts: [],
            root: None,
            hash: 11111111111111111111111111111111,
            timestamp: None,
        },
    ),
    last_vote_tx_blockhash: 11111111111111111111111111111111,
    last_timestamp: BlockTimestamp {
        slot: 0,
        timestamp: 0,
    },
    stray_restored_slot: None,
    last_switch_threshold_check: None,
},

// Node B
Tower {
    node_pubkey: Bz8byUe5bFKcQZWMdU1NQZuJ2GAN3vZvzkahcynXQi5S,
    threshold_depth: 8,
    threshold_size: 0.6666666666666666,
    vote_state: VoteState {
        node_pubkey: Bz8byUe5bFKcQZWMdU1NQZuJ2GAN3vZvzkahcynXQi5S,
        authorized_withdrawer: EsUBXKuWbFvPrkdchm68YfSwq91e5cFB45ZQmiF3fQV1,
        commission: 0,
        votes: [
            LandedVote { latency: 0, lockout: Lockout { slot: 17, confirmation_count: 31, }, },
            LandedVote { latency: 0, lockout: Lockout { slot: 18, confirmation_count: 30, }, },
            LandedVote { latency: 0, lockout: Lockout { slot: 19, confirmation_count: 29, }, },
            LandedVote { latency: 0, lockout: Lockout { slot: 20, confirmation_count: 28, }, },
            LandedVote { latency: 0, lockout: Lockout { slot: 21, confirmation_count: 27, }, },
            LandedVote { latency: 0, lockout: Lockout { slot: 22, confirmation_count: 26, }, },
            LandedVote { latency: 0, lockout: Lockout { slot: 23, confirmation_count: 25, }, },
            LandedVote { latency: 0, lockout: Lockout { slot: 24, confirmation_count: 24, }, },
            LandedVote { latency: 0, lockout: Lockout { slot: 25, confirmation_count: 23, }, },
            LandedVote { latency: 0, lockout: Lockout { slot: 26, confirmation_count: 22, }, },
            LandedVote { latency: 0, lockout: Lockout { slot: 27, confirmation_count: 21, }, },
            LandedVote { latency: 0, lockout: Lockout { slot: 28, confirmation_count: 20, }, },
            LandedVote { latency: 0, lockout: Lockout { slot: 29, confirmation_count: 19, }, },
            LandedVote { latency: 0, lockout: Lockout { slot: 30, confirmation_count: 18, }, },
            LandedVote { latency: 0, lockout: Lockout { slot: 31, confirmation_count: 17, }, },
            LandedVote { latency: 0, lockout: Lockout { slot: 32, confirmation_count: 16, }, },
            LandedVote { latency: 0, lockout: Lockout { slot: 33, confirmation_count: 15, }, },
            LandedVote { latency: 0, lockout: Lockout { slot: 34, confirmation_count: 14, }, },
            LandedVote { latency: 0, lockout: Lockout { slot: 35, confirmation_count: 13, }, },
            LandedVote { latency: 0, lockout: Lockout { slot: 36, confirmation_count: 12, }, },
            LandedVote { latency: 0, lockout: Lockout { slot: 37, confirmation_count: 11, }, },
            LandedVote { latency: 0, lockout: Lockout { slot: 38, confirmation_count: 10, }, },
            LandedVote { latency: 0, lockout: Lockout { slot: 39, confirmation_count: 9, }, },
            LandedVote { latency: 0, lockout: Lockout { slot: 40, confirmation_count: 8, }, },
            LandedVote { latency: 0, lockout: Lockout { slot: 41, confirmation_count: 7, }, },
            LandedVote { latency: 0, lockout: Lockout { slot: 42, confirmation_count: 6, }, },
            LandedVote { latency: 0, lockout: Lockout { slot: 43, confirmation_count: 5, }, },
            LandedVote { latency: 0, lockout: Lockout { slot: 44, confirmation_count: 4, }, },
            LandedVote { latency: 0, lockout: Lockout { slot: 45, confirmation_count: 3, }, },
            LandedVote { latency: 0, lockout: Lockout { slot: 46, confirmation_count: 2, }, },
            LandedVote { latency: 0, lockout: Lockout { slot: 47, confirmation_count: 1, }, },
        ],
        root_slot: Some( 16, ),
        authorized_voters: AuthorizedVoters { authorized_voters: { 0: EsUBXKuWbFvPrkdchm68YfSwq91e5cFB45ZQmiF3fQV1, }, },
        prior_voters: CircBuf { ... },
        epoch_credits: [ ( 0, 16, 0, ), ],
        last_timestamp: BlockTimestamp { slot: 0, timestamp: 0, },
    },
    last_vote: VoteStateUpdate(
        VoteStateUpdate {
            lockouts: [
                Lockout { slot: 17, confirmation_count: 31, },
                Lockout { slot: 18, confirmation_count: 30, },
                Lockout { slot: 19, confirmation_count: 29, },
                Lockout { slot: 20, confirmation_count: 28, },
                Lockout { slot: 21, confirmation_count: 27, },
                Lockout { slot: 22, confirmation_count: 26, },
                Lockout { slot: 23, confirmation_count: 25, },
                Lockout { slot: 24, confirmation_count: 24, },
                Lockout { slot: 25, confirmation_count: 23, },
                Lockout { slot: 26, confirmation_count: 22, },
                Lockout { slot: 27, confirmation_count: 21, },
                Lockout { slot: 28, confirmation_count: 20, },
                Lockout { slot: 29, confirmation_count: 19, },
                Lockout { slot: 30, confirmation_count: 18, },
                Lockout { slot: 31, confirmation_count: 17, },
                Lockout { slot: 32, confirmation_count: 16, },
                Lockout { slot: 33, confirmation_count: 15, },
                Lockout { slot: 34, confirmation_count: 14, },
                Lockout { slot: 35, confirmation_count: 13, },
                Lockout { slot: 36, confirmation_count: 12, },
                Lockout { slot: 37, confirmation_count: 11, },
                Lockout { slot: 38, confirmation_count: 10, },
                Lockout { slot: 39, confirmation_count: 9, },
                Lockout { slot: 40, confirmation_count: 8, },
                Lockout { slot: 41, confirmation_count: 7, },
                Lockout { slot: 42, confirmation_count: 6, },
                Lockout { slot: 43, confirmation_count: 5, },
                Lockout { slot: 44, confirmation_count: 4, },
                Lockout { slot: 45, confirmation_count: 3, },
                Lockout { slot: 46, confirmation_count: 2, },
                Lockout { slot: 47, confirmation_count: 1, },
            ],
            root: Some(
                16,
            ),
            hash: 2MLv8zvVZZYZ4FBGGbNeR1pj6DbhRDxktuCDAt83ochE,
            timestamp: Some(
                1686719604,
            ),
        },
    ),
    last_vote_tx_blockhash: 11111111111111111111111111111111,
    last_timestamp: BlockTimestamp
        slot: 46,
        timestamp: 1686719604,
    },
    stray_restored_slot: Some(47),
    last_switch_threshold_check: None,
},

With all of that being said, I also see that Node B does not enter that case when the test passes with the tip of master. I believe there may be a race condition with what votes land / state is updated when the local cluster is stopped to pick a restart slot.

Here are some logs from several runs:

Proposed Solution

I'm still digging, there is the suspicion that some "cleanup" should occur on Node B but isn't, but that theory is currently contradicted by the logs from when the test succeeds with tip of master. So, a few action items:

DONE ~~Continue looking into core/src/consensus.rs and better understand the assertion that was triggered~~
Consider marking the wait_for_supermajority_slot a root in BankForks if it is not already; this would make this slot considered a root if someone doesn't create a new snapshot with hard-fork. Whether we do this is probably dependent on results of above bullet.
- Proof of concept: https://github.com/solana-labs/solana/pull/32118

steviez commented 1 year ago

@carllin - Here are some collected thoughts from that item I spitballed to you earlier today

steviez commented 1 year ago

Following up on this, the PR authors determined that a consensus breaking change was present in the PR. The issue has since been addressed, and I elaborate a little more and tie into some logs here. The TLDR of that comment is:

The test involves setting a hard fork.
- A hard fork at slot X takes effect in slot X, and will not alter the state of any slots <= X.
The PR introduces a new sysvar that exposes the last hard fork.
The sysvar account was getting updated to contain X in slots prior to X
As a result, bank hashes changed for slots that should have been unaffected

The change in bank hashes had a few results on the two nodes that revealed an inconsistency in the validator. Again for clarity, the PR made it such that the bank hashes changed for slots [1, 39].

Node A:

With both the tip of master and the PR, Node A would discard its' local tower file due to the hard fork
With the tip of master, Node A would adopt tower state from the vote that it landed for slot 39 in block 40 (ReplayStage::compute_bank_stats()). The adopted state goes up to slot 39, and Node A begins voting on slot 41 in pretty normal fashion.
With the PR, Node A drops all the votes for slots [1, 39] when creating the snapshot. As a result, Node A did NOT have any landed vote state to rebuild its' local state from. Regardless, it starts voting with slot 41 like a brand new voting node would, nothing really noteworthy about this either.

Node B:

With both the tip of master and the PR, Node B would retain its' local tower file, which includes votes beyond the hard fork slot from pre-cluster-restart. Additionally, Node B last votes on slot 47 for both sets of logs that I compared.
With the tip of master, Node B calculates the new hashes (reflective of hard fork) for slots > 40. But, Node B's tower says it previously voted (now incorrectly because of the hard fork) for some of these slots. An item to note is that Tower::make_check_switch_threshold_decision()exits early for Node B due to the below item for slots 41-46: https://github.com/solana-labs/solana/blob/de024bf98d3670867608dddc7022298ac1c5b08f/core/src/consensus.rs#L855-L857 Node B refrains from voting until it is able to switch onto the new correct fork. Once Node B is able to switch over to the correct fork, the cluster begins making roots and the test succeeds.
With the PR, Node B also drops votes for slots [1, 39] when re-processing these blocks. The dropped votes (as reflected in the Bank) are at odds with its' local tower state. Eventually, block 47 is recreated with a parent of 46; this means that 46 and 47 now show up in ancestors and descendants maps created from BankForks that are passed to Tower. The chain for 47 is 40 ==> 45 ==> 46 ==> 47; 41-44 were created, but Node B built 45 off of 40. At this point in time, slot 43 is the heaviest bank that is passed to Tower::check_switch_threshold() through ReplayStage::select_vote_and_reset_forks():
- The early return mentioned above that master hit is NOT hit since the last vote (47) does not have the switch slot (43) as an ancestor.
- Slot 46 gets considered here: https://github.com/solana-labs/solana/blob/de024bf98d3670867608dddc7022298ac1c5b08f/core/src/consensus.rs#L871
- Slot 46 is an ancestor of slot 47 (the last vote slot from pre-restart) which means this assertion gets hits: https://github.com/solana-labs/solana/blob/de024bf98d3670867608dddc7022298ac1c5b08f/core/src/consensus.rs#L917

All this being stated, the flaw itself is that the tower from pre-restart is not invalidated. The extra digging was to check if there was a logic bug in Tower's impl, which does not appear to be the case. So, it seems that ensuring the tower gets properly invalidated when a hard fork is passed to the validator is the course of action.

carllin commented 1 year ago

The panic: assert!(!last_vote_ancestors.contains(candidate_slot)); is essentially complaining that given some last voted slot S and some candidate slot C, that ancestors(S).contains(C).

However, for this to to be true, it must also be true that descendants(C).contains(S), which is what this check https://github.com/AshwinSekar/solana/blob/24a3af8d81bd229ad7e8a7f82ccdf6f24256a206/core/src/consensus.rs#L887-L895 should have caught, because:

That check iterates through all the descendants
Checks if any of the descendants have the computed flag set, which should be true for all frozen banks after compute_bank_stats in replay_stage runs https://github.com/AshwinSekar/solana/blob/24a3af8d81bd229ad7e8a7f82ccdf6f24256a206/core/src/replay_stage.rs#L3094 (Should run on the voted bank right after startup).

So i would after reproducing the panic:

print out the descendants of C in https://github.com/AshwinSekar/solana/blob/24a3af8d81bd229ad7e8a7f82ccdf6f24256a206/core/src/consensus.rs#L887-L895, and check that S is in there
If S is in the list of descendants, print out whether the computed flag is set
Print out the list of slots that have set the computed flag here: https://github.com/AshwinSekar/solana/blob/24a3af8d81bd229ad7e8a7f82ccdf6f24256a206/core/src/replay_stage.rs#L309

steviez commented 1 year ago

However, for this to to be true, it must also be true that descendants(C).contains(S) ... should have caught, because:

Not sure if you read through my whole comment, but the third bullet With the PR, Node B also drops votes for slots ... explains what happened (after doing some of the same steps that you suggested like dumping descendants out). But in any case, addressing your specific items:

The last_voted_slot is 47 due to bug I called out (tower that extends beyond restart slot not getting invalidated)
Cluster restarted at slot 40 so 40 is essentially a root. There are two forks:
- 40 ==> 41 ==> 42 ==> 43 ==> 44
- 40 ==> 45 ==> 46 ==> 47 (47 not yet frozen)
Tower::make_check_switch_threshold_decision() gets called with switch_slot = 43
46 is an element of descendants
46 does not yet have stats.computed = true (recall that 47 is not frozen, 46 JUST got frozen)
None of the other initial checks are true so we do NOT hit the continue
We hit the assertion - last_vote_ancestors contains 46 as I mentioned that 46 is 47's parent

carllin commented 1 year ago

Not sure if you read through my whole comment, but the third bullet With the PR, Node B also drops votes for slots ... >explains what happened (after doing some of the same steps that you suggested like dumping descendants out).

Ah I see, sorry missed the key point there. The problem is 47 is the last vote in tower, but it was:

deleted from the ledger after the hard fork, but the preserved tower still says the last vote is `47
recreated and repropagated and thus not frozen in BankForks
Triggers the assert

Hmm yeah the proper thing to do is probably to set the root to the hard fork slot in tower and clear everything in tower above that slot.

steviez commented 4 months ago

This one is still a problem AFAIK; I think there is another layer of using --hard-fork with validator being broken. Will type up a new issue in agave repo and cross-link

solana-labs / solana

Validator behaves differently for hard-fork in ledger-tool create-snapshot vs. passed directly to the solana-validator #32117

Problem

Proposed Solution