solana-labs / solana

Web-Scale Blockchain for fast, secure, scalable, decentralized apps and marketplaces.
https://solanalabs.com
Apache License 2.0
13.21k stars 4.28k forks source link

Validator behaves differently for hard-fork in ledger-tool create-snapshot vs. passed directly to the solana-validator #32117

Closed steviez closed 4 months ago

steviez commented 1 year ago

Problem

The author(s) of https://github.com/solana-labs/solana/pull/31957 reported hitting a failure on the below test that at least at the surface doesn't seem to be directly caused by the PR. https://github.com/solana-labs/solana/blob/dcd66534dda718160c80752711e6c3a8c6686992/local-cluster/tests/local_cluster.rs#L2217

In general, the test looks to be somewhat flaky as the author mentioned seeing their branch fail on Linux but not on their Mac. I was able to reproduce the failure with very high reliability on a Linux machine, and was able to determine that it is failing due to hitting his assertion: https://github.com/solana-labs/solana/blob/dcd66534dda718160c80752711e6c3a8c6686992/core/src/consensus.rs#L881-L884

This test starts a local-cluster, makes several roots, stops the cluster, chooses a restart slot beyond the latest fork, and then restarts the cluster through wait-for-supermajority. In this test, the two nodes are restarted in different ways:

These two approaches are supposed to be equivalent; however, I discovered there is a subtle difference:

A key difference to note here is that these nodes have a different perspective of root. Technically, the cluster has not rooted slot 40, but by virtue of it being picked as the restart slot, it will become rooted. At this point, Node A considers 40 the latest root whereas Node B considers the latest root to be 16. This difference is important in this function https://github.com/solana-labs/solana/blob/077e29aa1eac997707a98a13e3cfb6000ba00ede/core/src/validator.rs#L1443-L1452 which then determines how post_process_restored_tower() functions: https://github.com/solana-labs/solana/blob/077e29aa1eac997707a98a13e3cfb6000ba00ede/core/src/validator.rs#L1469-L1496

From the logs, I determined that Node A was entering the above if statement whereas Node B was not; this follows from the first snippet where we'll only return a Some(...) if wait_slot_for_supermajority == root_slot.

// Node A log, no such log for Node B
[2023-06-13T22:04:52.089584487Z ERROR solana_core::validator]
Rebuilding a new tower from the latest vote account due to failed tower restore: The tower is useless because of new hard fork: 40

As further proof, here is the comparison of towers between the two nodes; noting that Node B contains votes past the hard fork.

// Node A
Tower {
    node_pubkey: AqEWUK8pdsfY2CTrBQLGS8w8ndMeuFcDpCkFwWaicaLL,
    threshold_depth: 8,
    threshold_size: 0.6666666666666666,
    vote_state: VoteState {
        node_pubkey: AqEWUK8pdsfY2CTrBQLGS8w8ndMeuFcDpCkFwWaicaLL,
        authorized_withdrawer: 8XdPBeEszdsPGBGU1sFPrQ6z4b7CqGYnbm5jFoQf3XPN,
        commission: 0,
        votes: [],
        root_slot: Some( 40, ),
        authorized_voters: AuthorizedVoters { authorized_voters: { 0: 8XdPBeEszdsPGBGU1sFPrQ6z4b7CqGYnbm5jFoQf3XPN, }, },
        prior_voters: CircBuf { ... },
        epoch_credits: [],
        last_timestamp: BlockTimestamp { slot: 0, timestamp: 0, },
    },
    last_vote: VoteStateUpdate(
        VoteStateUpdate {
            lockouts: [],
            root: None,
            hash: 11111111111111111111111111111111,
            timestamp: None,
        },
    ),
    last_vote_tx_blockhash: 11111111111111111111111111111111,
    last_timestamp: BlockTimestamp {
        slot: 0,
        timestamp: 0,
    },
    stray_restored_slot: None,
    last_switch_threshold_check: None,
},

// Node B
Tower {
    node_pubkey: Bz8byUe5bFKcQZWMdU1NQZuJ2GAN3vZvzkahcynXQi5S,
    threshold_depth: 8,
    threshold_size: 0.6666666666666666,
    vote_state: VoteState {
        node_pubkey: Bz8byUe5bFKcQZWMdU1NQZuJ2GAN3vZvzkahcynXQi5S,
        authorized_withdrawer: EsUBXKuWbFvPrkdchm68YfSwq91e5cFB45ZQmiF3fQV1,
        commission: 0,
        votes: [
            LandedVote { latency: 0, lockout: Lockout { slot: 17, confirmation_count: 31, }, },
            LandedVote { latency: 0, lockout: Lockout { slot: 18, confirmation_count: 30, }, },
            LandedVote { latency: 0, lockout: Lockout { slot: 19, confirmation_count: 29, }, },
            LandedVote { latency: 0, lockout: Lockout { slot: 20, confirmation_count: 28, }, },
            LandedVote { latency: 0, lockout: Lockout { slot: 21, confirmation_count: 27, }, },
            LandedVote { latency: 0, lockout: Lockout { slot: 22, confirmation_count: 26, }, },
            LandedVote { latency: 0, lockout: Lockout { slot: 23, confirmation_count: 25, }, },
            LandedVote { latency: 0, lockout: Lockout { slot: 24, confirmation_count: 24, }, },
            LandedVote { latency: 0, lockout: Lockout { slot: 25, confirmation_count: 23, }, },
            LandedVote { latency: 0, lockout: Lockout { slot: 26, confirmation_count: 22, }, },
            LandedVote { latency: 0, lockout: Lockout { slot: 27, confirmation_count: 21, }, },
            LandedVote { latency: 0, lockout: Lockout { slot: 28, confirmation_count: 20, }, },
            LandedVote { latency: 0, lockout: Lockout { slot: 29, confirmation_count: 19, }, },
            LandedVote { latency: 0, lockout: Lockout { slot: 30, confirmation_count: 18, }, },
            LandedVote { latency: 0, lockout: Lockout { slot: 31, confirmation_count: 17, }, },
            LandedVote { latency: 0, lockout: Lockout { slot: 32, confirmation_count: 16, }, },
            LandedVote { latency: 0, lockout: Lockout { slot: 33, confirmation_count: 15, }, },
            LandedVote { latency: 0, lockout: Lockout { slot: 34, confirmation_count: 14, }, },
            LandedVote { latency: 0, lockout: Lockout { slot: 35, confirmation_count: 13, }, },
            LandedVote { latency: 0, lockout: Lockout { slot: 36, confirmation_count: 12, }, },
            LandedVote { latency: 0, lockout: Lockout { slot: 37, confirmation_count: 11, }, },
            LandedVote { latency: 0, lockout: Lockout { slot: 38, confirmation_count: 10, }, },
            LandedVote { latency: 0, lockout: Lockout { slot: 39, confirmation_count: 9, }, },
            LandedVote { latency: 0, lockout: Lockout { slot: 40, confirmation_count: 8, }, },
            LandedVote { latency: 0, lockout: Lockout { slot: 41, confirmation_count: 7, }, },
            LandedVote { latency: 0, lockout: Lockout { slot: 42, confirmation_count: 6, }, },
            LandedVote { latency: 0, lockout: Lockout { slot: 43, confirmation_count: 5, }, },
            LandedVote { latency: 0, lockout: Lockout { slot: 44, confirmation_count: 4, }, },
            LandedVote { latency: 0, lockout: Lockout { slot: 45, confirmation_count: 3, }, },
            LandedVote { latency: 0, lockout: Lockout { slot: 46, confirmation_count: 2, }, },
            LandedVote { latency: 0, lockout: Lockout { slot: 47, confirmation_count: 1, }, },
        ],
        root_slot: Some( 16, ),
        authorized_voters: AuthorizedVoters { authorized_voters: { 0: EsUBXKuWbFvPrkdchm68YfSwq91e5cFB45ZQmiF3fQV1, }, },
        prior_voters: CircBuf { ... },
        epoch_credits: [ ( 0, 16, 0, ), ],
        last_timestamp: BlockTimestamp { slot: 0, timestamp: 0, },
    },
    last_vote: VoteStateUpdate(
        VoteStateUpdate {
            lockouts: [
                Lockout { slot: 17, confirmation_count: 31, },
                Lockout { slot: 18, confirmation_count: 30, },
                Lockout { slot: 19, confirmation_count: 29, },
                Lockout { slot: 20, confirmation_count: 28, },
                Lockout { slot: 21, confirmation_count: 27, },
                Lockout { slot: 22, confirmation_count: 26, },
                Lockout { slot: 23, confirmation_count: 25, },
                Lockout { slot: 24, confirmation_count: 24, },
                Lockout { slot: 25, confirmation_count: 23, },
                Lockout { slot: 26, confirmation_count: 22, },
                Lockout { slot: 27, confirmation_count: 21, },
                Lockout { slot: 28, confirmation_count: 20, },
                Lockout { slot: 29, confirmation_count: 19, },
                Lockout { slot: 30, confirmation_count: 18, },
                Lockout { slot: 31, confirmation_count: 17, },
                Lockout { slot: 32, confirmation_count: 16, },
                Lockout { slot: 33, confirmation_count: 15, },
                Lockout { slot: 34, confirmation_count: 14, },
                Lockout { slot: 35, confirmation_count: 13, },
                Lockout { slot: 36, confirmation_count: 12, },
                Lockout { slot: 37, confirmation_count: 11, },
                Lockout { slot: 38, confirmation_count: 10, },
                Lockout { slot: 39, confirmation_count: 9, },
                Lockout { slot: 40, confirmation_count: 8, },
                Lockout { slot: 41, confirmation_count: 7, },
                Lockout { slot: 42, confirmation_count: 6, },
                Lockout { slot: 43, confirmation_count: 5, },
                Lockout { slot: 44, confirmation_count: 4, },
                Lockout { slot: 45, confirmation_count: 3, },
                Lockout { slot: 46, confirmation_count: 2, },
                Lockout { slot: 47, confirmation_count: 1, },
            ],
            root: Some(
                16,
            ),
            hash: 2MLv8zvVZZYZ4FBGGbNeR1pj6DbhRDxktuCDAt83ochE,
            timestamp: Some(
                1686719604,
            ),
        },
    ),
    last_vote_tx_blockhash: 11111111111111111111111111111111,
    last_timestamp: BlockTimestamp
        slot: 46,
        timestamp: 1686719604,
    },
    stray_restored_slot: Some(47),
    last_switch_threshold_check: None,
},

With all of that being said, I also see that Node B does not enter that case when the test passes with the tip of master. I believe there may be a race condition with what votes land / state is updated when the local cluster is stopped to pick a restart slot.

Here are some logs from several runs:

Proposed Solution

I'm still digging, there is the suspicion that some "cleanup" should occur on Node B but isn't, but that theory is currently contradicted by the logs from when the test succeeds with tip of master. So, a few action items:

steviez commented 1 year ago

@carllin - Here are some collected thoughts from that item I spitballed to you earlier today

steviez commented 1 year ago

Following up on this, the PR authors determined that a consensus breaking change was present in the PR. The issue has since been addressed, and I elaborate a little more and tie into some logs here. The TLDR of that comment is:

The change in bank hashes had a few results on the two nodes that revealed an inconsistency in the validator. Again for clarity, the PR made it such that the bank hashes changed for slots [1, 39].

Node A:

Node B:

All this being stated, the flaw itself is that the tower from pre-restart is not invalidated. The extra digging was to check if there was a logic bug in Tower's impl, which does not appear to be the case. So, it seems that ensuring the tower gets properly invalidated when a hard fork is passed to the validator is the course of action.

carllin commented 1 year ago

The panic: assert!(!last_vote_ancestors.contains(candidate_slot)); is essentially complaining that given some last voted slot S and some candidate slot C, that ancestors(S).contains(C).

However, for this to to be true, it must also be true that descendants(C).contains(S), which is what this check https://github.com/AshwinSekar/solana/blob/24a3af8d81bd229ad7e8a7f82ccdf6f24256a206/core/src/consensus.rs#L887-L895 should have caught, because:

  1. That check iterates through all the descendants
  2. Checks if any of the descendants have the computed flag set, which should be true for all frozen banks after compute_bank_stats in replay_stage runs https://github.com/AshwinSekar/solana/blob/24a3af8d81bd229ad7e8a7f82ccdf6f24256a206/core/src/replay_stage.rs#L3094 (Should run on the voted bank right after startup).

So i would after reproducing the panic:

  1. print out the descendants of C in https://github.com/AshwinSekar/solana/blob/24a3af8d81bd229ad7e8a7f82ccdf6f24256a206/core/src/consensus.rs#L887-L895, and check that S is in there
  2. If S is in the list of descendants, print out whether the computed flag is set
  3. Print out the list of slots that have set the computed flag here: https://github.com/AshwinSekar/solana/blob/24a3af8d81bd229ad7e8a7f82ccdf6f24256a206/core/src/replay_stage.rs#L309
steviez commented 1 year ago

However, for this to to be true, it must also be true that descendants(C).contains(S) ... should have caught, because:

Not sure if you read through my whole comment, but the third bullet With the PR, Node B also drops votes for slots ... explains what happened (after doing some of the same steps that you suggested like dumping descendants out). But in any case, addressing your specific items:

carllin commented 1 year ago

Not sure if you read through my whole comment, but the third bullet With the PR, Node B also drops votes for slots ... >explains what happened (after doing some of the same steps that you suggested like dumping descendants out).

Ah I see, sorry missed the key point there. The problem is 47 is the last vote in tower, but it was:

  1. deleted from the ledger after the hard fork, but the preserved tower still says the last vote is `47
  2. recreated and repropagated and thus not frozen in BankForks
  3. Triggers the assert

Hmm yeah the proper thing to do is probably to set the root to the hard fork slot in tower and clear everything in tower above that slot.

steviez commented 4 months ago

This one is still a problem AFAIK; I think there is another layer of using --hard-fork with validator being broken. Will type up a new issue in agave repo and cross-link