solana-labs / solana

Web-Scale Blockchain for fast, secure, scalable, decentralized apps and marketplaces.
https://solanalabs.com
Apache License 2.0
13.35k stars 4.35k forks source link

Consensus (bank hash) mismatch between v1.8.10 and tip of master/v1.9.1 (mainnet) #21986

Closed steviez closed 2 years ago

steviez commented 2 years ago

Problem

A node running master (commit 94aa9e568a04cf66fadd05275c5b9728ea6b55d1) against mainnet was found to have a consensus failure; this occurred in slot 111883235:

[2021-12-17T19:36:41.237049780Z INFO  solana_ledger_tool] solana-ledger-tool 1.8.10 (src:f97d28ca; feat:52865992)
...
[2021-12-17T19:38:18.800288320Z INFO  solana_runtime::bank]
bank frozen: 111883235
hash: 4JTffyL2mXd17GXRDCBbUSg5FD2h6BUsCBobf7ZUer9g 
accounts_delta: 396ZSKNKJhth9CmSte7Lyf8HS5vsd3TJLNcEost6VHZX
signature_count: 753
last_blockhash: H9uGK6rZVsXG8J6akmJ5oyXczUR22rRnLu4UGBw1wfJ9
capitalization: 511214148638373382
[2021-12-17T19:38:18.800317947Z INFO  solana_runtime::bank]
accounts hash slot: 111883235
stats: BankHashStats {
    num_updated_accounts: 3466,
    num_removed_accounts: 8,
    num_lamports_stored: 11203702455212782,
    total_data_len: 115749862,
    num_executable_accounts: 1 }

[2021-12-17T19:47:53.521219209Z INFO  solana_ledger_tool] solana-ledger-tool 1.10.0 (src:3dab1e71; feat:3834082194)
[2021-12-17T19:49:11.369273370Z INFO  solana_runtime::bank]
bank frozen: 111883235
hash: CYvAB1HULgZQHmKbYAnRrCEMJ2jRju3UsSq7U6GQZ3FU
accounts_delta: 7mrWSpzRjRuDWXp84iu37CptBAZGVxJbzznJg64dJEM3
signature_count: 753
last_blockhash: H9uGK6rZVsXG8J6akmJ5oyXczUR22rRnLu4UGBw1wfJ9
capitalization: 511214148638373382
[2021-12-17T19:49:11.369305502Z INFO  solana_runtime::bank]
accounts hash slot: 111883235
stats: BankHashStats {
    num_updated_accounts: 3463,
    num_removed_accounts: 7,
    num_lamports_stored: 11203689947327182,
    total_data_len: 113953143,
    num_executable_accounts: 0

Here is the offending transaction:

  Entry 629 - num_hashes: 4225, hash: 4ypLNZyp1k3DLtfGCwb616RHiUSdLH2mvBcpgixiM1nd, transactions: 1
    Transaction 0
      Recent Blockhash: 6MVtZba5uYYcsmmnYv3Yht4XVQy43H7X285wbLXe66ge
      Signature 0: t4nmKNdkLSQTMvxGj9JRV29Ff7stD1okMGiens9gH3wEkNVzRB3NmwSFs6FCbX9dWuxUZU6wrKfV3UKptiZbFNU
      Account 0: srw- 3rWmJZgB2wtSsYCBYouL4mYRBZhH9V363BVs7HBaJUy9 (fee payer)
      Account 1: -rw- 65dX1iH5k59ganpRL1sdgVdGQSyVVXwPES1voMuoRG3x
      Account 2: -rw- 4eaHV6yMLTv2oj1zq64H9JcVaZyUu8xDiSpRuF5p2BAa
      Account 3: -rw- 7ek4EpFaoForoFzzx8wBVQzN9oYP5reNK9JkYHMnTxo9
      Account 4: -rwx wormDTUJ6AWPNvk59vGQbDvGJmqbDTdgWgAqcLBCgUb
      Account 5: -r-- yKkPofCpS7TapAnJxGYMan8QtNAurzXeisYKg2rhJWp
      Account 6: -r-- DHyAcRbFpRWTkcsAsfwQpbABXvtjs6bQ1dq5ScNhRDoQ
      Account 7: -r-- SysvarRent111111111111111111111111111111111
      Account 8: -r-- SysvarC1ock11111111111111111111111111111111
      Account 9: -r-- BPFLoaderUpgradeab1e11111111111111111111111
      Account 10: -r-- 11111111111111111111111111111111
      Instruction 0
        Program:   wormDTUJ6AWPNvk59vGQbDvGJmqbDTdgWgAqcLBCgUb (4)
        Account 0: 3rWmJZgB2wtSsYCBYouL4mYRBZhH9V363BVs7HBaJUy9 (0)
        Account 1: yKkPofCpS7TapAnJxGYMan8QtNAurzXeisYKg2rhJWp (5)
        Account 2: 65dX1iH5k59ganpRL1sdgVdGQSyVVXwPES1voMuoRG3x (1)
        Account 3: DHyAcRbFpRWTkcsAsfwQpbABXvtjs6bQ1dq5ScNhRDoQ (6)
        Account 4: 3rWmJZgB2wtSsYCBYouL4mYRBZhH9V363BVs7HBaJUy9 (0)
        Account 5: 4eaHV6yMLTv2oj1zq64H9JcVaZyUu8xDiSpRuF5p2BAa (2)
        Account 6: 7ek4EpFaoForoFzzx8wBVQzN9oYP5reNK9JkYHMnTxo9 (3)
        Account 7: wormDTUJ6AWPNvk59vGQbDvGJmqbDTdgWgAqcLBCgUb (4)
        Account 8: SysvarRent111111111111111111111111111111111 (7)
        Account 9: SysvarC1ock11111111111111111111111111111111 (8)
        Account 10: BPFLoaderUpgradeab1e11111111111111111111111 (9)
        Account 11: 11111111111111111111111111111111 (10)
        Data: [8]

And some logging output about this transaction:

[2021-12-17T20:19:42.381621999Z DEBUG solana_program_runtime::stable_log] Program wormDTUJ6AWPNvk59vGQbDvGJmqbDTdgWgAqcLBCgUb invoke [1]
[2021-12-17T20:19:42.381782384Z DEBUG solana_program_runtime::stable_log] Program 11111111111111111111111111111111 invoke [2]
[2021-12-17T20:19:42.381795339Z DEBUG solana_program_runtime::stable_log] Program 11111111111111111111111111111111 success
[2021-12-17T20:19:42.381840094Z DEBUG solana_program_runtime::invoke_context] wormDTUJ6AWPNvk59vGQbDvGJmqbDTdgWgAqcLBCgUb's writable privilege escalated
[2021-12-17T20:19:42.381844743Z DEBUG solana_bpf_loader_program] Program wormDTUJ6AWPNvk59vGQbDvGJmqbDTdgWgAqcLBCgUb consumed 48494 of 200000 compute units
[2021-12-17T20:19:42.381850574Z DEBUG solana_program_runtime::stable_log] Program wormDTUJ6AWPNvk59vGQbDvGJmqbDTdgWgAqcLBCgUb failed: Cross-program invocation with unauthorized signer or writable account
[2021-12-17T20:19:42.382633506Z DEBUG solana_runtime::bank] check: 1us load: 578us execute: 20111us txs_len=1
[2021-12-17T20:19:42.382638044Z DEBUG solana_runtime::bank] tx error: Err(InstructionError(0, PrivilegeEscalation))
SanitizedTransaction {
    message: Legacy(Message {
        header: MessageHeader {
            num_required_signatures: 1,
            num_readonly_signed_accounts: 0,
            num_readonly_unsigned_accounts: 6 },
        account_keys: [3rWmJZgB2wtSsYCBYouL4mYRBZhH9V363BVs7HBaJUy9,
                       65dX1iH5k59ganpRL1sdgVdGQSyVVXwPES1voMuoRG3x,
                       4eaHV6yMLTv2oj1zq64H9JcVaZyUu8xDiSpRuF5p2BAa,
                       7ek4EpFaoForoFzzx8wBVQzN9oYP5reNK9JkYHMnTxo9,
                       wormDTUJ6AWPNvk59vGQbDvGJmqbDTdgWgAqcLBCgUb,
                       yKkPofCpS7TapAnJxGYMan8QtNAurzXeisYKg2rhJWp,
                       DHyAcRbFpRWTkcsAsfwQpbABXvtjs6bQ1dq5ScNhRDoQ,
                       SysvarRent111111111111111111111111111111111,
                       SysvarC1ock11111111111111111111111111111111,
                       BPFLoaderUpgradeab1e11111111111111111111111,
                       11111111111111111111111111111111],
        recent_blockhash: 6MVtZba5uYYcsmmnYv3Yht4XVQy43H7X285wbLXe66ge,
        instructions: [CompiledInstruction {
            program_id_index: 4,
            accounts: [0, 5, 1, 6, 0, 2, 3, 4, 7, 8, 9, 10], data: [8] }] }),
            message_hash: ArogkuF3r2wci1GHDLfHVihJAj9tAxUAExznE7UfcFmn,
            is_simple_vote_tx: false,
            signatures: [t4nmKNdkLSQTMvxGj9JRV29Ff7stD1okMGiens9gH3wEkNVzRB3NmwSFs6FCbX9dWuxUZU6wrKfV3UKptiZbFNU] }
[2021-12-17T20:19:42.382679443Z DEBUG solana_runtime::bank] 1 errors of 1 txs

Steps to Reproduce

1) Pull down and expand this ledger 2) Run solana-ledger-tool verify --ledger ~/21986_ledger and note the bank frozen: 111883235 ... line of output

jeffwashington commented 2 years ago

this occurred on at least 6 machines nearly simultaneously:

[2021-12-16T23:54:12.765055816Z INFO solana_core::replay_stage] new root 111883207          tTvENthWk6UXNjRzaq6Uzd5nVELaMNjr5nyG3hhpZUz
[2021-12-16T23:54:12.140488994Z INFO solana_core::replay_stage] new root 111883205          38VjntUzeof8BSfbUyCnfXytVebd36L9HvoaTKisYYV8
[2021-12-16T23:54:12.424963130Z INFO solana_core::replay_stage] new root 111883206          FwQpCUn3bmB27bfhmt9Cepeyn6UPBAXCYWCwkYj5qUd9
[2021-12-16T23:54:12.743621818Z INFO solana_core::replay_stage] new root 111883207          3Ex1LtnZa5vwyZgAWC7WXy2aXcn1V1P2bpvK9Z4ZWrHj
[2021-12-16T23:54:12.889492962Z INFO solana_core::replay_stage] new root 111883207          8XWpJyTb3dFgEa1nVkTKoTY3ivg1CXN5Wr6149dXTpVP
[2021-12-16T23:54:12.824509029Z INFO solana_core::replay_stage] new root 111883207          2YfHpT4grkn2U5rrgmBL4NUArxYoHRYi8s5sfxHeCKqd
jeffwashington commented 2 years ago

one of these machines was running a custom build build off this: f4308bdb64436b0afab90434e99672ac5a353a3c

others were running a master image from around the same time. I believe both had:

commit 509bcd2e742e158ae14bface4aeefbb37130ec8f
Author: Jack May <jack@solana.com>
Date:   Tue Dec 14 07:51:23 2021 -0800

    Bump rbpf to v0.2.19 (#21880)
steviez commented 2 years ago

The transaction error was a PrivilegeEscalation; we recently saw a something similar in https://github.com/solana-labs/solana/issues/21397

jeffwashington commented 2 years ago

I believe I've seen this before: My metrics that are easy to read go back about a month. I saw something similar on 2 nodes simultaneously near this date utc: 2021/12/07 20:00. Multiple other nodes were running and did not fail on that slot. I do not have good enough records to know what was running on what machine. It is possible the other nodes were running older code and the 2 that failed were running newer code. fyi @Lichtso since I edited this comment a few times. image

jeffwashington commented 2 years ago

the violent, synchronous failure: image

Lichtso commented 2 years ago

Hmm, 3 days ago, could also be caused by: #21882 Edit: Seems to be older than that.

steviez commented 2 years ago

v1.9 shows the issue as well so we'll need to backport any fix there

Lichtso commented 2 years ago

I haven't found the exact commits, only investigated ranges so far. But, I can already tell that the hash for 111883235 changed at least twice. Both ranges (from 19469 to 19102, and from 18869 to 18068) should be bisected further.

PR,Git,111883234,111883235
#19469,88c1b8f047d52a2936c4b26fa018974eb4ede02b,6JdyMivxvv6WrSjsNe2qJ3cJBLzV4pQRchr19RbE4Zfk,CYvAB1HULgZQHmKbYAnRrCEMJ2jRju3UsSq7U6GQZ3FU
#19102,85befbc8cdde5699d8f86012a17c17b4d2927f11,6JdyMivxvv6WrSjsNe2qJ3cJBLzV4pQRchr19RbE4Zfk,4JTffyL2mXd17GXRDCBbUSg5FD2h6BUsCBobf7ZUer9g
#18869,0a63f65c036676c4a800107a13c04d189cf8048c,6JdyMivxvv6WrSjsNe2qJ3cJBLzV4pQRchr19RbE4Zfk,4JTffyL2mXd17GXRDCBbUSg5FD2h6BUsCBobf7ZUer9g
#18068,8d5c04e25704df0788d7a9808fd9d7594cd3ca8f,6JdyMivxvv6WrSjsNe2qJ3cJBLzV4pQRchr19RbE4Zfk,2PTLvuvmgZ41MmMNcs9KnfWYuMtwFVERM4pFJzsh755h
#17898,7462c27d076d0f259b25361b24ac7c15ca88a5cd,6JdyMivxvv6WrSjsNe2qJ3cJBLzV4pQRchr19RbE4Zfk,2PTLvuvmgZ41MmMNcs9KnfWYuMtwFVERM4pFJzsh755h
steviez commented 2 years ago

I haven't found the exact commits, only investigated ranges so far.

Awesome, thanks for kicking this off.

from 19469 to 19102

Bisection between 85befbc8cdde5699d8f86012a17c17b4d2927f11 (#19102) and 88c1b8f047d52a2936c4b26fa018974eb4ede02b (#19469) showed that decec3cd8b27dca05814ac884f6d61c23acd3131 (#19593) ~is the breaking commit.~ Edit: see https://github.com/solana-labs/solana/issues/21986#issuecomment-997270762

https://github.com/solana-labs/solana/pull/19633 (which is a backport of https://github.com/solana-labs/solana/pull/19593 to 1.7/v1.8) was a breaking change on v1.8 as well. However, https://github.com/solana-labs/solana/pull/20295/ (which is a backport of https://github.com/solana-labs/solana/pull/20265) gets 1.8 back to working for the transaction in the linked ledger. Unfortunately, https://github.com/solana-labs/solana/pull/20265 doesn't get master back to working. Additionally, the bank hashes on master for both before and after the "fix commit" differ from those in v1.8.

That is:

commit (pr / note), 111883235_bank_hash

v1.7 (v1.8):
2f2948f9987adde089fb82198d4910765bc65446 (commit right before #20295), CYvAB1HULgZQHmKbYAnRrCEMJ2jRju3UsSq7U6GQZ3FU
a005a6b8164b12e335012ba03bbdbd37a04b8a62 (#20295), 4JTffyL2mXd17GXRDCBbUSg5FD2h6BUsCBobf7ZUer9g

master:
30bce9ddbcf35453f61f984ab7342b6659562894 (commit immediately before #20265), 76C9HpJeekK2e2dTvPv53cmu9s1efAtw4U8XyNrvz5Zx
2cd9dc99b660fd555e1975f06aceba8da6a59480 (#20265), HX3kqFzT42voYuLxM4GFLaMzVAnAKvGzDaKDcc75CqdN
94aa9e568a04cf66fadd05275c5b9728ea6b55d1 (tip of master), CYvAB1HULgZQHmKbYAnRrCEMJ2jRju3UsSq7U6GQZ3FU

So, as @Lichtso pointed out above, looks like more than one commit causing the issue.

steviez commented 2 years ago

Per suggestion from @CriesofCarrots, I did a sanity check and confirmed that running with master immediately before https://github.com/solana-labs/solana/pull/21949 still shows the issue

CriesofCarrots commented 2 years ago

I wonder if it might have something to do with MappedMessage. The gymnastics around that struct was the biggest difference between #19593 and #19633. I'll look at the hashes before and after those two commits on their respective branches tomorrow, if Steve doesn't beat me to it. (spoiler alert) Steve beat me to it. I'll take a close look at #19593 and following commits touching MappedMessage

steviez commented 2 years ago

Result on master with 7578db7ee351a2f1ed4a454b2496e1cc74c2b3bf (parent commit of decec3cd8b27dca05814ac884f6d61c23acd3131 == #19593): 4JTffyL2mXd17GXRDCBbUSg5FD2h6BUsCBobf7ZUer9g Result on v1.8 with 2e4a2c15be6be13d643fa645a60099effec7938c (parent commit of fcda5d4a7d0a60f75240cde560f54a9eaa4b1562 == #19633): 4JTffyL2mXd17GXRDCBbUSg5FD2h6BUsCBobf7ZUer9g

And referencing above, 4JTffyL2mXd17GXRDCBbUSg5FD2h6BUsCBobf7ZUer9g is same as what v1.8.10 produces.

CriesofCarrots commented 2 years ago

I'm now reasonably confident that https://github.com/solana-labs/solana/commit/decec3cd8b27dca05814ac884f6d61c23acd3131 is not the problem, as that commit produces the expected no-self-upgrades hash CYvAB1HULgZQHmKbYAnRrCEMJ2jRju3UsSq7U6GQZ3FU. The problem seems to be between that commit and https://github.com/solana-labs/solana/commit/30bce9ddbcf35453f61f984ab7342b6659562894 Will bisect further

steviez commented 2 years ago

I'm now reasonably confident that decec3c is not the problem, as that commit produces the expected no-self-upgrades hash CYvAB1HULgZQHmKbYAnRrCEMJ2jRju3UsSq7U6GQZ3FU.

Agreed; I'll add an edit to my above comment

Will bisect further

👍

CriesofCarrots commented 2 years ago

Next commit-of-interest is https://github.com/solana-labs/solana/commit/df6905c3a67e46c17d66a282d200741508c9e5f0

commit (PR): bank hash at 111883235
9542bae56e4a0511b023f21e70968ff7e9ada85c (#20223): CYvAB1HULgZQHmKbYAnRrCEMJ2jRju3UsSq7U6GQZ3FU
df6905c3a67e46c17d66a282d200741508c9e5f0 (#20165): 76C9HpJeekK2e2dTvPv53cmu9s1efAtw4U8XyNrvz5Zx

Can you take a look at that one @Lichtso ?

It appears that, on this commit, this transaction fails with a Custom InstructionError instead of failing on PrivilegeEscalation, as would be expected before #20265:

[2021-12-18T20:58:14.982671021Z DEBUG solana_runtime::bank] tx error: Err(InstructionError(0, Custom(0))) 
SanitizedTransaction {
    message: Legacy(Message {
        header: MessageHeader {
            num_required_signatures: 1,
            num_readonly_signed_accounts: 0,
            num_readonly_unsigned_accounts: 6
        },
        account_keys: [
            3rWmJZgB2wtSsYCBYouL4mYRBZhH9V363BVs7HBaJUy9,
            65dX1iH5k59ganpRL1sdgVdGQSyVVXwPES1voMuoRG3x,
            4eaHV6yMLTv2oj1zq64H9JcVaZyUu8xDiSpRuF5p2BAa,
            7ek4EpFaoForoFzzx8wBVQzN9oYP5reNK9JkYHMnTxo9,
            wormDTUJ6AWPNvk59vGQbDvGJmqbDTdgWgAqcLBCgUb,
            yKkPofCpS7TapAnJxGYMan8QtNAurzXeisYKg2rhJWp,
            DHyAcRbFpRWTkcsAsfwQpbABXvtjs6bQ1dq5ScNhRDoQ,
            SysvarRent111111111111111111111111111111111,
            SysvarC1ock11111111111111111111111111111111,
            BPFLoaderUpgradeab1e11111111111111111111111,
            11111111111111111111111111111111
        ],
        recent_blockhash: 6MVtZba5uYYcsmmnYv3Yht4XVQy43H7X285wbLXe66ge,
        instructions: [CompiledInstruction { program_id_index: 4, accounts: [0, 5, 1, 6, 0, 2, 3, 4, 7, 8, 9, 10], data: [8] }]
    }),
    message_hash: ArogkuF3r2wci1GHDLfHVihJAj9tAxUAExznE7UfcFmn,
    signatures: [t4nmKNdkLSQTMvxGj9JRV29Ff7stD1okMGiens9gH3wEkNVzRB3NmwSFs6FCbX9dWuxUZU6wrKfV3UKptiZbFNU] }
Lichtso commented 2 years ago

Here are more samples:

PR,Git Commit Hash,Hash of Block 111883235
#20785,97c2732d02f57e4588bb5f62af922dd0f10947c5,CYvAB1HULgZQHmKbYAnRrCEMJ2jRju3UsSq7U6GQZ3FU
#20669,bbc0bb4f0214644cc52c2542ddeda95323835651,CYvAB1HULgZQHmKbYAnRrCEMJ2jRju3UsSq7U6GQZ3FU
#20448,4e65487d2f9c312cbf3b95a3a69461359c390b76,CYvAB1HULgZQHmKbYAnRrCEMJ2jRju3UsSq7U6GQZ3FU
#20529,a6a4cfda89ed4862c006876375e29972a27667a9,4JTffyL2mXd17GXRDCBbUSg5FD2h6BUsCBobf7ZUer9g
#20308,57c8abf4991916e7d2f6de0ad891a3d8ed87ffa9,4JTffyL2mXd17GXRDCBbUSg5FD2h6BUsCBobf7ZUer9g
#20301,491877de3d2fab50053e609526c1abc8e355696a,4JTffyL2mXd17GXRDCBbUSg5FD2h6BUsCBobf7ZUer9g
#20165,df6905c3a67e46c17d66a282d200741508c9e5f0,76C9HpJeekK2e2dTvPv53cmu9s1efAtw4U8XyNrvz5Zx
#19593,decec3cd8b27dca05814ac884f6d61c23acd3131,CYvAB1HULgZQHmKbYAnRrCEMJ2jRju3UsSq7U6GQZ3FU
#19469,88c1b8f047d52a2936c4b26fa018974eb4ede02b,CYvAB1HULgZQHmKbYAnRrCEMJ2jRju3UsSq7U6GQZ3FU
#19102,85befbc8cdde5699d8f86012a17c17b4d2927f11,4JTffyL2mXd17GXRDCBbUSg5FD2h6BUsCBobf7ZUer9g
#18869,0a63f65c036676c4a800107a13c04d189cf8048c,4JTffyL2mXd17GXRDCBbUSg5FD2h6BUsCBobf7ZUer9g
#18068,8d5c04e25704df0788d7a9808fd9d7594cd3ca8f,2PTLvuvmgZ41MmMNcs9KnfWYuMtwFVERM4pFJzsh755h
#17898,7462c27d076d0f259b25361b24ac7c15ca88a5cd,2PTLvuvmgZ41MmMNcs9KnfWYuMtwFVERM4pFJzsh755h

It is not #20165, as that was reverted in #20301 and fixed in #20308. Instead it is #20448 again (just as we originally suspected because this PR has the same error message as #21397).

Lichtso commented 2 years ago

Update, I think I understand what happened:

  1. 20448 introduced a bug

  2. 20604 modifies a test incorrectly, because of the existing bug. When applied the the commit (a6a4cfda89ed4862c006876375e29972a27667a9) immediately before #20448 it fails:

    thread test_program_bpf_instruction_introspection panicked at 'assertion failed: result.is_ok()', tests/programs.rs:1463:5 test test_program_bpf_instruction_introspection ... FAILED

  3. 21397 discovers the bug

  4. 21535 fixes the bug, but is incomplete / incorrect, as it makes a wrong assumption about the intended behavior of test_program_bpf_instruction_introspection, which ironically was only caused by the bug it tried to fix.

So both #20604 and #21535 need to be fixed (again).

github-actions[bot] commented 2 years ago

This issue has been automatically locked since there has not been any activity in past 7 days after it was closed. Please open a new issue for related bugs.