Closed steviez closed 2 years ago
this occurred on at least 6 machines nearly simultaneously:
[2021-12-16T23:54:12.765055816Z INFO solana_core::replay_stage] new root 111883207 tTvENthWk6UXNjRzaq6Uzd5nVELaMNjr5nyG3hhpZUz
[2021-12-16T23:54:12.140488994Z INFO solana_core::replay_stage] new root 111883205 38VjntUzeof8BSfbUyCnfXytVebd36L9HvoaTKisYYV8
[2021-12-16T23:54:12.424963130Z INFO solana_core::replay_stage] new root 111883206 FwQpCUn3bmB27bfhmt9Cepeyn6UPBAXCYWCwkYj5qUd9
[2021-12-16T23:54:12.743621818Z INFO solana_core::replay_stage] new root 111883207 3Ex1LtnZa5vwyZgAWC7WXy2aXcn1V1P2bpvK9Z4ZWrHj
[2021-12-16T23:54:12.889492962Z INFO solana_core::replay_stage] new root 111883207 8XWpJyTb3dFgEa1nVkTKoTY3ivg1CXN5Wr6149dXTpVP
[2021-12-16T23:54:12.824509029Z INFO solana_core::replay_stage] new root 111883207 2YfHpT4grkn2U5rrgmBL4NUArxYoHRYi8s5sfxHeCKqd
one of these machines was running a custom build build off this: f4308bdb64436b0afab90434e99672ac5a353a3c
others were running a master image from around the same time. I believe both had:
commit 509bcd2e742e158ae14bface4aeefbb37130ec8f
Author: Jack May <jack@solana.com>
Date: Tue Dec 14 07:51:23 2021 -0800
Bump rbpf to v0.2.19 (#21880)
The transaction error was a PrivilegeEscalation
; we recently saw a something similar in https://github.com/solana-labs/solana/issues/21397
I believe I've seen this before: My metrics that are easy to read go back about a month. I saw something similar on 2 nodes simultaneously near this date utc: 2021/12/07 20:00. Multiple other nodes were running and did not fail on that slot. I do not have good enough records to know what was running on what machine. It is possible the other nodes were running older code and the 2 that failed were running newer code. fyi @Lichtso since I edited this comment a few times.
the violent, synchronous failure:
Hmm, 3 days ago, could also be caused by: #21882 Edit: Seems to be older than that.
v1.9
shows the issue as well so we'll need to backport any fix there
I haven't found the exact commits, only investigated ranges so far. But, I can already tell that the hash for 111883235 changed at least twice. Both ranges (from 19469 to 19102, and from 18869 to 18068) should be bisected further.
PR,Git,111883234,111883235
#19469,88c1b8f047d52a2936c4b26fa018974eb4ede02b,6JdyMivxvv6WrSjsNe2qJ3cJBLzV4pQRchr19RbE4Zfk,CYvAB1HULgZQHmKbYAnRrCEMJ2jRju3UsSq7U6GQZ3FU
#19102,85befbc8cdde5699d8f86012a17c17b4d2927f11,6JdyMivxvv6WrSjsNe2qJ3cJBLzV4pQRchr19RbE4Zfk,4JTffyL2mXd17GXRDCBbUSg5FD2h6BUsCBobf7ZUer9g
#18869,0a63f65c036676c4a800107a13c04d189cf8048c,6JdyMivxvv6WrSjsNe2qJ3cJBLzV4pQRchr19RbE4Zfk,4JTffyL2mXd17GXRDCBbUSg5FD2h6BUsCBobf7ZUer9g
#18068,8d5c04e25704df0788d7a9808fd9d7594cd3ca8f,6JdyMivxvv6WrSjsNe2qJ3cJBLzV4pQRchr19RbE4Zfk,2PTLvuvmgZ41MmMNcs9KnfWYuMtwFVERM4pFJzsh755h
#17898,7462c27d076d0f259b25361b24ac7c15ca88a5cd,6JdyMivxvv6WrSjsNe2qJ3cJBLzV4pQRchr19RbE4Zfk,2PTLvuvmgZ41MmMNcs9KnfWYuMtwFVERM4pFJzsh755h
I haven't found the exact commits, only investigated ranges so far.
Awesome, thanks for kicking this off.
from 19469 to 19102
Bisection between 85befbc8cdde5699d8f86012a17c17b4d2927f11 (#19102) and 88c1b8f047d52a2936c4b26fa018974eb4ede02b (#19469) showed that decec3cd8b27dca05814ac884f6d61c23acd3131 (#19593) ~is the breaking commit.~ Edit: see https://github.com/solana-labs/solana/issues/21986#issuecomment-997270762
https://github.com/solana-labs/solana/pull/19633 (which is a backport of https://github.com/solana-labs/solana/pull/19593 to 1.7/v1.8
) was a breaking change on v1.8
as well. However, https://github.com/solana-labs/solana/pull/20295/ (which is a backport of https://github.com/solana-labs/solana/pull/20265) gets 1.8
back to working for the transaction in the linked ledger. Unfortunately, https://github.com/solana-labs/solana/pull/20265 doesn't get master
back to working. Additionally, the bank hashes on master for both before and after the "fix commit" differ from those in v1.8
.
That is:
commit (pr / note), 111883235_bank_hash
v1.7 (v1.8):
2f2948f9987adde089fb82198d4910765bc65446 (commit right before #20295), CYvAB1HULgZQHmKbYAnRrCEMJ2jRju3UsSq7U6GQZ3FU
a005a6b8164b12e335012ba03bbdbd37a04b8a62 (#20295), 4JTffyL2mXd17GXRDCBbUSg5FD2h6BUsCBobf7ZUer9g
master:
30bce9ddbcf35453f61f984ab7342b6659562894 (commit immediately before #20265), 76C9HpJeekK2e2dTvPv53cmu9s1efAtw4U8XyNrvz5Zx
2cd9dc99b660fd555e1975f06aceba8da6a59480 (#20265), HX3kqFzT42voYuLxM4GFLaMzVAnAKvGzDaKDcc75CqdN
94aa9e568a04cf66fadd05275c5b9728ea6b55d1 (tip of master), CYvAB1HULgZQHmKbYAnRrCEMJ2jRju3UsSq7U6GQZ3FU
So, as @Lichtso pointed out above, looks like more than one commit causing the issue.
Per suggestion from @CriesofCarrots, I did a sanity check and confirmed that running with master
immediately before https://github.com/solana-labs/solana/pull/21949 still shows the issue
I wonder if it might have something to do with MappedMessage. The gymnastics around that struct was the biggest difference between #19593 and #19633. I'll look at the hashes before and after those two commits on their respective branches tomorrow, if Steve doesn't beat me to it. (spoiler alert) Steve beat me to it. I'll take a close look at #19593 and following commits touching MappedMessage
Result on master
with 7578db7ee351a2f1ed4a454b2496e1cc74c2b3bf (parent commit of decec3cd8b27dca05814ac884f6d61c23acd3131 == #19593): 4JTffyL2mXd17GXRDCBbUSg5FD2h6BUsCBobf7ZUer9g
Result on v1.8
with 2e4a2c15be6be13d643fa645a60099effec7938c (parent commit of fcda5d4a7d0a60f75240cde560f54a9eaa4b1562 == #19633): 4JTffyL2mXd17GXRDCBbUSg5FD2h6BUsCBobf7ZUer9g
And referencing above, 4JTffyL2mXd17GXRDCBbUSg5FD2h6BUsCBobf7ZUer9g
is same as what v1.8.10
produces.
I'm now reasonably confident that https://github.com/solana-labs/solana/commit/decec3cd8b27dca05814ac884f6d61c23acd3131 is not the problem, as that commit produces the expected no-self-upgrades hash CYvAB1HULgZQHmKbYAnRrCEMJ2jRju3UsSq7U6GQZ3FU
.
The problem seems to be between that commit and https://github.com/solana-labs/solana/commit/30bce9ddbcf35453f61f984ab7342b6659562894
Will bisect further
I'm now reasonably confident that decec3c is not the problem, as that commit produces the expected no-self-upgrades hash
CYvAB1HULgZQHmKbYAnRrCEMJ2jRju3UsSq7U6GQZ3FU
.
Agreed; I'll add an edit to my above comment
Will bisect further
👍
Next commit-of-interest is https://github.com/solana-labs/solana/commit/df6905c3a67e46c17d66a282d200741508c9e5f0
commit (PR): bank hash at 111883235
9542bae56e4a0511b023f21e70968ff7e9ada85c (#20223): CYvAB1HULgZQHmKbYAnRrCEMJ2jRju3UsSq7U6GQZ3FU
df6905c3a67e46c17d66a282d200741508c9e5f0 (#20165): 76C9HpJeekK2e2dTvPv53cmu9s1efAtw4U8XyNrvz5Zx
Can you take a look at that one @Lichtso ?
It appears that, on this commit, this transaction fails with a Custom InstructionError instead of failing on PrivilegeEscalation, as would be expected before #20265:
[2021-12-18T20:58:14.982671021Z DEBUG solana_runtime::bank] tx error: Err(InstructionError(0, Custom(0)))
SanitizedTransaction {
message: Legacy(Message {
header: MessageHeader {
num_required_signatures: 1,
num_readonly_signed_accounts: 0,
num_readonly_unsigned_accounts: 6
},
account_keys: [
3rWmJZgB2wtSsYCBYouL4mYRBZhH9V363BVs7HBaJUy9,
65dX1iH5k59ganpRL1sdgVdGQSyVVXwPES1voMuoRG3x,
4eaHV6yMLTv2oj1zq64H9JcVaZyUu8xDiSpRuF5p2BAa,
7ek4EpFaoForoFzzx8wBVQzN9oYP5reNK9JkYHMnTxo9,
wormDTUJ6AWPNvk59vGQbDvGJmqbDTdgWgAqcLBCgUb,
yKkPofCpS7TapAnJxGYMan8QtNAurzXeisYKg2rhJWp,
DHyAcRbFpRWTkcsAsfwQpbABXvtjs6bQ1dq5ScNhRDoQ,
SysvarRent111111111111111111111111111111111,
SysvarC1ock11111111111111111111111111111111,
BPFLoaderUpgradeab1e11111111111111111111111,
11111111111111111111111111111111
],
recent_blockhash: 6MVtZba5uYYcsmmnYv3Yht4XVQy43H7X285wbLXe66ge,
instructions: [CompiledInstruction { program_id_index: 4, accounts: [0, 5, 1, 6, 0, 2, 3, 4, 7, 8, 9, 10], data: [8] }]
}),
message_hash: ArogkuF3r2wci1GHDLfHVihJAj9tAxUAExznE7UfcFmn,
signatures: [t4nmKNdkLSQTMvxGj9JRV29Ff7stD1okMGiens9gH3wEkNVzRB3NmwSFs6FCbX9dWuxUZU6wrKfV3UKptiZbFNU] }
Here are more samples:
PR,Git Commit Hash,Hash of Block 111883235
#20785,97c2732d02f57e4588bb5f62af922dd0f10947c5,CYvAB1HULgZQHmKbYAnRrCEMJ2jRju3UsSq7U6GQZ3FU
#20669,bbc0bb4f0214644cc52c2542ddeda95323835651,CYvAB1HULgZQHmKbYAnRrCEMJ2jRju3UsSq7U6GQZ3FU
#20448,4e65487d2f9c312cbf3b95a3a69461359c390b76,CYvAB1HULgZQHmKbYAnRrCEMJ2jRju3UsSq7U6GQZ3FU
#20529,a6a4cfda89ed4862c006876375e29972a27667a9,4JTffyL2mXd17GXRDCBbUSg5FD2h6BUsCBobf7ZUer9g
#20308,57c8abf4991916e7d2f6de0ad891a3d8ed87ffa9,4JTffyL2mXd17GXRDCBbUSg5FD2h6BUsCBobf7ZUer9g
#20301,491877de3d2fab50053e609526c1abc8e355696a,4JTffyL2mXd17GXRDCBbUSg5FD2h6BUsCBobf7ZUer9g
#20165,df6905c3a67e46c17d66a282d200741508c9e5f0,76C9HpJeekK2e2dTvPv53cmu9s1efAtw4U8XyNrvz5Zx
#19593,decec3cd8b27dca05814ac884f6d61c23acd3131,CYvAB1HULgZQHmKbYAnRrCEMJ2jRju3UsSq7U6GQZ3FU
#19469,88c1b8f047d52a2936c4b26fa018974eb4ede02b,CYvAB1HULgZQHmKbYAnRrCEMJ2jRju3UsSq7U6GQZ3FU
#19102,85befbc8cdde5699d8f86012a17c17b4d2927f11,4JTffyL2mXd17GXRDCBbUSg5FD2h6BUsCBobf7ZUer9g
#18869,0a63f65c036676c4a800107a13c04d189cf8048c,4JTffyL2mXd17GXRDCBbUSg5FD2h6BUsCBobf7ZUer9g
#18068,8d5c04e25704df0788d7a9808fd9d7594cd3ca8f,2PTLvuvmgZ41MmMNcs9KnfWYuMtwFVERM4pFJzsh755h
#17898,7462c27d076d0f259b25361b24ac7c15ca88a5cd,2PTLvuvmgZ41MmMNcs9KnfWYuMtwFVERM4pFJzsh755h
It is not #20165, as that was reverted in #20301 and fixed in #20308. Instead it is #20448 again (just as we originally suspected because this PR has the same error message as #21397).
Update, I think I understand what happened:
thread test_program_bpf_instruction_introspection
panicked at 'assertion failed: result.is_ok()', tests/programs.rs:1463:5
test test_program_bpf_instruction_introspection
... FAILED
test_program_bpf_instruction_introspection
, which ironically was only caused by the bug it tried to fix.So both #20604 and #21535 need to be fixed (again).
This issue has been automatically locked since there has not been any activity in past 7 days after it was closed. Please open a new issue for related bugs.
Problem
A node running
master
(commit 94aa9e568a04cf66fadd05275c5b9728ea6b55d1) against mainnet was found to have a consensus failure; this occurred in slot111883235
:Here is the offending transaction:
And some logging output about this transaction:
Steps to Reproduce
1) Pull down and expand this ledger 2) Run
solana-ledger-tool verify --ledger ~/21986_ledger
and note thebank frozen: 111883235 ...
line of output