solana-labs / solana

Web-Scale Blockchain for fast, secure, scalable, decentralized apps and marketplaces.
https://solanalabs.com
Apache License 2.0
13.35k stars 4.35k forks source link

vote hash mismatch in testnet with high tps #12670

Closed ryoqun closed 4 years ago

ryoqun commented 4 years ago

Problem

STR:

I haven't investigated well. but restoring snapshotted root bank succeeds. But immediately with the next of the bank, the bank hash starts to diverge like this:

[2020-10-05T09:31:31.679508257Z INFO  solana_runtime::bank] bank frozen: 801 hash: DgZKMR56GB2TyxoB2BVzHujMfWgShGhswAqpDXTa442 accounts_delta: 2GxJuagoNVHfrcbGQ1Qo9cTxAwcDfQasvM5x4QbsAPTi signature_count: 16119 last_blockhash: 34afVht5TSaJbZNM7ax1QhBeCL4KdY6Jh8CQ1p7q7wjT capitalization: 1000000808342242059
[2020-10-05T09:31:31.679546395Z INFO  solana_runtime::bank] accounts hash slot: 801 stats: BankHashStats { num_updated_accounts: 32248, num_removed_accounts: 0, num_lamports_stored: 36106811887435, total_data_len: 176296, num_executable_accounts: 0 }
[2020-10-05T09:33:37.253939880Z INFO  solana_runtime::bank] bank frozen: 801 hash: EQNias25stAcHYjEPcQW9dQPPJmLW8t9yvfgGo8kwnvN accounts_delta: 7p6svSrwTjgvPitEF5FS2aSFGeHatnyn7vyMhMf6jkbW signature_count: 16119 last_blockhash: 34afVht5TSaJbZNM7ax1QhBeCL4KdY6Jh8CQ1p7q7wjT capitalization: 1000000808342242059
[2020-10-05T09:33:37.253982797Z INFO  solana_runtime::bank] accounts hash slot: 801 stats: BankHashStats { num_updated_accounts: 32248, num_removed_accounts: 0, num_lamports_stored: 36106811887435, total_data_len: 176296, num_executable_accounts: 0 }

Proposed Solution

debug and fix.

also we should stop the validator if our computed bank hash is different than the rooted supermajority of the cluster.

ryoqun commented 4 years ago

here is the logs: https://drive.google.com/drive/u/1/folders/1coOQC_xa_tpiKiZtt4yJwjoQ5OsZPcnS

carllin commented 4 years ago

@ryoqun thanks for reporting, working on the repro!

carllin commented 4 years ago

@ryoqun So I think I finally have a repro with the ledgers saved. For context: 1) The "bad-ledger" was from a validator that was restarted from a snapshot of slot 1100 (that snapshot is included in the tar, call this version A), and encountered a slot mismatch at slot 1101 (expected GRfR7dRhm3Yv4D8NWg7srqHCwM1PikuHFBu1KW4nQCHn, got incorrect A4ANoYrpjSxSck74cpm6zmFnkUR4FGGQxuo8P5YSpTkn. 2) The "good-ledger" was from a validator that was never restarted, and includes an earliest snapshot of slot 200 3) I was running commit 0a89a952c751891b04e395df71399465e5473fc8 on master

Things I tried:

1) I generated a snapshot of slot 1100 from the good-ledger using the snapshot of slot 200 (call this version B) and the hashes matched the version in the bad-ledger. I then tried to generate a snapshot of slot 1101 from version B from both ledgers, and both produced snapshots with the bad hash A4ANoYrpjSxSck74cpm6zmFnkUR4FGGQxuo8P5YSpTkn

2) I tried generating a snapshot of slot 1101 from version A of snapshot 1100 using the good ledger, this also produced a snapshot with the bad hash A4ANoYrpjSxSck74cpm6zmFnkUR4FGGQxuo8P5YSpTkn.

From 1 + 2, This seems to imply it's not an issue with the ledger (i.e. both ledgers should have the same entries)

3) I directly generated a snapshot of slot 1101 from the good-ledger using slot 200, which produced the good hash GRfR7dRhm3Yv4D8NWg7srqHCwM1PikuHFBu1KW4nQCHn.

From 3) I suspect maybe those earlier slots/epochs need to be replayed in order to generate the correct hash, not too sure why yet, still investigating.

good-ledger: https://drive.google.com/file/d/1DH5obH-b3M_5cdU1qywjEwVThClyn-ph/view?usp=sharing bad-ledger: https://drive.google.com/file/d/1g9sKsStwhgQ-ATKfkr-BhSp0FY9ShWDv/view?usp=sharing

ryoqun commented 4 years ago

@carllin what a great progress. :)

carllin commented 4 years ago

@ryoqun thanks!

After parsing out the accounts, the only two different account hashes seem to be:

(Format is (hash, lamports))

Value mismatch for pubkey: SysvarFees111111111111111111111111111111111, good_value: (FJLca7SFSjjKPnEbdSmSFePm5o3rKHhmfdB8iWfKaMJa, 0), bad_value: (7WRzSJpDFotssrZ1CpRFdLkyZ9BAK15sDwUdT7AoqdLK, 0)

Value mismatch for pubkey: SysvarRecentB1ockHashes11111111111111111111, good_value: (3AXcmb4imKGJC1KBHbaBZ3bnf5P1hXSrmNAWhSB8hC2N, 0), bad_value: (CybeG5ZygNVrzsbYGc6dSRTHomiFeYPvRBiQ6exSxhJQ, 0)
carllin commented 4 years ago

@ryoqun I think I found the issue, the FeeRateGovernor.lamports_per_signature is not serialized: https://github.com/solana-labs/solana/blob/master/sdk/src/fee_calculator.rs#L70, so the deserialized bank has the default for that field, zero.

From deserialized snapshot bank, Bank::new_from_fields()

new_from_fields(), slot: 1100, fee_rate_governor: FeeRateGovernor { lamports_per_signature: 0, target_lamports_per_signature: 10000, target_signatures_per_slot: 20000, min_lamports_per_signature: 5000, max_lamports_per_signature: 100000, burn_percent: 50 }

From non-snapshot bank:

FeeRateGovernor { lamports_per_signature: 6500, target_lamports_per_signature: 10000, target_signatures_per_slot: 20000, min_lamports_per_signature: 5000, max_lamports_per_signature: 100000, burn_percent: 50 }, 6000

This means future children banks will have the wrong value as well, and leads to update_recent_blockhashes_locked() storing the wrong value into the sysvar::recent_blockhashes account

ryoqun commented 4 years ago

@carllin well, I just noticed the ci was actually detecting this vote mismatch:

https://buildkite.com/solana-labs/solana/builds/32181#2df29301-adee-48d1-8220-a86a085c2fbd/7304-7640

[2020-10-05T23:09:17.407616139Z WARN  solana_vote_program::vote_state] BpAfqgTSKtzCTfAS2kcnxmaTaDYyVEjn82CkyJbD43aY dropped vote Vote { slots: [0], hash: 9geL56EBzUb1Kv96Nx5WpkdXaQcNiAU43Avb4eeLtB5L, timestamp: Some(1601939353) } failed to match hash 9geL56EBzUb1Kv96Nx5WpkdXaQcNiAU43Avb4eeLtB5L 24kbXHfH6apPAnQb11hppaeHok61Qqe9MwYDvVotBtFZ
ryoqun commented 4 years ago

Hmm, I think this is different mismatched hash issue, I guess. I bisected quickly:

bad Add Builtin AbiExample
@mvines
mvines committed 8 days ago
https://buildkite.com/solana-labs/solana/builds/31717#bd52287b-8b7c-49b9-bd15-d70508ce5286/7167-7522

good chore: bump rocksdb from 0.14.0 to 0.15.0
@dependabot
@mvines
dependabot authored and mvines committed 8 days ago
https://buildkite.com/solana-labs/solana/builds/31705#b92d877d-a46d-4a4e-b5df-824165a7f25d/7068-7455

Maybe native loader/programs are stored differentially when replayed since genesis....

mvines commented 4 years ago

Fixed by https://github.com/solana-labs/solana/pull/12697

ryoqun commented 4 years ago

@carllin well, I just noticed the ci was actually detecting this vote mismatch:

https://buildkite.com/solana-labs/solana/builds/32181#2df29301-adee-48d1-8220-a86a085c2fbd/7304-7640

[2020-10-05T23:09:17.407616139Z WARN  solana_vote_program::vote_state] BpAfqgTSKtzCTfAS2kcnxmaTaDYyVEjn82CkyJbD43aY dropped vote Vote { slots: [0], hash: 9geL56EBzUb1Kv96Nx5WpkdXaQcNiAU43Avb4eeLtB5L, timestamp: Some(1601939353) } failed to match hash 9geL56EBzUb1Kv96Nx5WpkdXaQcNiAU43Avb4eeLtB5L 24kbXHfH6apPAnQb11hppaeHok61Qqe9MwYDvVotBtFZ

Hmm, I think this is different mismatched hash issue, I guess. I bisected quickly:

bad Add Builtin AbiExample
@mvines
mvines committed 8 days ago
https://buildkite.com/solana-labs/solana/builds/31717#bd52287b-8b7c-49b9-bd15-d70508ce5286/7167-7522

good chore: bump rocksdb from 0.14.0 to 0.15.0
@dependabot
@mvines
dependabot authored and mvines committed 8 days ago
https://buildkite.com/solana-labs/solana/builds/31705#b92d877d-a46d-4a4e-b5df-824165a7f25d/7068-7455

Maybe native loader/programs are stored differentially when replayed since genesis....

These are addressed at #12759