Closed ryoqun closed 4 years ago
@ryoqun thanks for reporting, working on the repro!
@ryoqun So I think I finally have a repro with the ledgers saved. For context:
1) The "bad-ledger" was from a validator that was restarted from a snapshot of slot 1100
(that snapshot is included in the tar, call this version A
), and encountered a slot mismatch at slot 1101
(expected GRfR7dRhm3Yv4D8NWg7srqHCwM1PikuHFBu1KW4nQCHn
, got incorrect A4ANoYrpjSxSck74cpm6zmFnkUR4FGGQxuo8P5YSpTkn
.
2) The "good-ledger" was from a validator that was never restarted, and includes an earliest snapshot of slot 200
3) I was running commit 0a89a952c751891b04e395df71399465e5473fc8
on master
Things I tried:
1) I generated a snapshot of slot 1100
from the good-ledger
using the snapshot of slot 200
(call this version B
) and the hashes matched the version in the bad-ledger
. I then tried to generate a snapshot of slot 1101
from version B
from both ledgers, and both produced snapshots with the bad hash A4ANoYrpjSxSck74cpm6zmFnkUR4FGGQxuo8P5YSpTkn
2) I tried generating a snapshot of slot 1101
from version A
of snapshot 1100
using the good ledger, this also produced a snapshot with the bad hash A4ANoYrpjSxSck74cpm6zmFnkUR4FGGQxuo8P5YSpTkn
.
From 1 + 2, This seems to imply it's not an issue with the ledger (i.e. both ledgers should have the same entries)
3) I directly generated a snapshot of slot 1101
from the good-ledger
using slot 200
, which produced the good hash GRfR7dRhm3Yv4D8NWg7srqHCwM1PikuHFBu1KW4nQCHn
.
From 3) I suspect maybe those earlier slots/epochs need to be replayed in order to generate the correct hash, not too sure why yet, still investigating.
good-ledger: https://drive.google.com/file/d/1DH5obH-b3M_5cdU1qywjEwVThClyn-ph/view?usp=sharing bad-ledger: https://drive.google.com/file/d/1g9sKsStwhgQ-ATKfkr-BhSp0FY9ShWDv/view?usp=sharing
@carllin what a great progress. :)
@ryoqun thanks!
After parsing out the accounts, the only two different account hashes seem to be:
(Format is (hash, lamports))
Value mismatch for pubkey: SysvarFees111111111111111111111111111111111, good_value: (FJLca7SFSjjKPnEbdSmSFePm5o3rKHhmfdB8iWfKaMJa, 0), bad_value: (7WRzSJpDFotssrZ1CpRFdLkyZ9BAK15sDwUdT7AoqdLK, 0)
Value mismatch for pubkey: SysvarRecentB1ockHashes11111111111111111111, good_value: (3AXcmb4imKGJC1KBHbaBZ3bnf5P1hXSrmNAWhSB8hC2N, 0), bad_value: (CybeG5ZygNVrzsbYGc6dSRTHomiFeYPvRBiQ6exSxhJQ, 0)
@ryoqun I think I found the issue, the FeeRateGovernor.lamports_per_signature
is not serialized: https://github.com/solana-labs/solana/blob/master/sdk/src/fee_calculator.rs#L70, so the deserialized bank has the default for that field, zero.
From deserialized snapshot bank, Bank::new_from_fields()
new_from_fields(), slot: 1100, fee_rate_governor: FeeRateGovernor { lamports_per_signature: 0, target_lamports_per_signature: 10000, target_signatures_per_slot: 20000, min_lamports_per_signature: 5000, max_lamports_per_signature: 100000, burn_percent: 50 }
From non-snapshot bank:
FeeRateGovernor { lamports_per_signature: 6500, target_lamports_per_signature: 10000, target_signatures_per_slot: 20000, min_lamports_per_signature: 5000, max_lamports_per_signature: 100000, burn_percent: 50 }, 6000
This means future children banks will have the wrong value as well, and leads to update_recent_blockhashes_locked()
storing the wrong value into the sysvar::recent_blockhashes
account
@carllin well, I just noticed the ci was actually detecting this vote mismatch:
https://buildkite.com/solana-labs/solana/builds/32181#2df29301-adee-48d1-8220-a86a085c2fbd/7304-7640
[2020-10-05T23:09:17.407616139Z WARN solana_vote_program::vote_state] BpAfqgTSKtzCTfAS2kcnxmaTaDYyVEjn82CkyJbD43aY dropped vote Vote { slots: [0], hash: 9geL56EBzUb1Kv96Nx5WpkdXaQcNiAU43Avb4eeLtB5L, timestamp: Some(1601939353) } failed to match hash 9geL56EBzUb1Kv96Nx5WpkdXaQcNiAU43Avb4eeLtB5L 24kbXHfH6apPAnQb11hppaeHok61Qqe9MwYDvVotBtFZ
Hmm, I think this is different mismatched hash issue, I guess. I bisected quickly:
bad Add Builtin AbiExample
@mvines
mvines committed 8 days ago
https://buildkite.com/solana-labs/solana/builds/31717#bd52287b-8b7c-49b9-bd15-d70508ce5286/7167-7522
good chore: bump rocksdb from 0.14.0 to 0.15.0
@dependabot
@mvines
dependabot authored and mvines committed 8 days ago
https://buildkite.com/solana-labs/solana/builds/31705#b92d877d-a46d-4a4e-b5df-824165a7f25d/7068-7455
Maybe native loader/programs are stored differentially when replayed since genesis
....
@carllin well, I just noticed the ci was actually detecting this vote mismatch:
https://buildkite.com/solana-labs/solana/builds/32181#2df29301-adee-48d1-8220-a86a085c2fbd/7304-7640
[2020-10-05T23:09:17.407616139Z WARN solana_vote_program::vote_state] BpAfqgTSKtzCTfAS2kcnxmaTaDYyVEjn82CkyJbD43aY dropped vote Vote { slots: [0], hash: 9geL56EBzUb1Kv96Nx5WpkdXaQcNiAU43Avb4eeLtB5L, timestamp: Some(1601939353) } failed to match hash 9geL56EBzUb1Kv96Nx5WpkdXaQcNiAU43Avb4eeLtB5L 24kbXHfH6apPAnQb11hppaeHok61Qqe9MwYDvVotBtFZ
Hmm, I think this is different mismatched hash issue, I guess. I bisected quickly:
bad Add Builtin AbiExample @mvines mvines committed 8 days ago https://buildkite.com/solana-labs/solana/builds/31717#bd52287b-8b7c-49b9-bd15-d70508ce5286/7167-7522 good chore: bump rocksdb from 0.14.0 to 0.15.0 @dependabot @mvines dependabot authored and mvines committed 8 days ago https://buildkite.com/solana-labs/solana/builds/31705#b92d877d-a46d-4a4e-b5df-824165a7f25d/7068-7455
Maybe native loader/programs are stored differentially when replayed since
genesis
....
These are addressed at #12759
Problem
STR:
./net/gce.sh create -n 4 -c 1 -p <prefix>
./net/net.sh start
gcloud beta compute ssh --zone "us-west1-b" <validator node> -- sudo pkill -f solana-validator
I haven't investigated well. but restoring snapshotted root bank succeeds. But immediately with the next of the bank, the bank hash starts to diverge like this:
Proposed Solution
debug and fix.
also we should stop the validator if our computed bank hash is different than the rooted supermajority of the cluster.