Closed willhickey closed 1 year ago
The mce2 log contained 11_177_732 instances of this warning:
[2023-08-21T17:15:56.846190843Z WARN solana_vote_program::vote_state] DDnAqxJVFo2GVTujibHt5cjevHMSE9bo8HJaydHoshdp dropped vote VoteStateUpdate { lockouts: [Lockout { slot: 212791598, confirmation_count: 31 }, Lockout { slot: 212791599, confirmation_count: 30 }, Lockout { slot: 212791600, confirmation_count: 29 }, Lockout { slot: 212791601, confirmation_count: 28 }, Lockout { slot: 212791602, confirmation_count: 27 }, Lockout { slot: 212791603, confirmation_count: 26 }, Lockout { slot: 212791604, confirmation_count: 25 }, Lockout { slot: 212791605, confirmation_count: 24 }, Lockout { slot: 212791606, confirmation_count: 23 }, Lockout { slot: 212791607, confirmation_count: 22 }, Lockout { slot: 212791608, confirmation_count: 21 }, Lockout { slot: 212791609, confirmation_count: 20 }, Lockout { slot: 212791610, confirmation_count: 19 }, Lockout { slot: 212791611, confirmation_count: 18 }, Lockout { slot: 212791612, confirmation_count: 17 }, Lockout { slot: 212791613, confirmation_count: 16 }, Lockout { slot: 212791614, confirmation_count: 15 }, Lockout { slot: 212791615, confirmation_count: 14 }, Lockout { slot: 212791616, confirmation_count: 13 }, Lockout { slot: 212791617, confirmation_count: 12 }, Lockout { slot: 212791618, confirmation_count: 11 }, Lockout { slot: 212791619, confirmation_count: 10 }, Lockout { slot: 212791620, confirmation_count: 9 }, Lockout { slot: 212791621, confirmation_count: 8 }, Lockout { slot: 212791622, confirmation_count: 7 }, Lockout { slot: 212791623, confirmation_count: 6 }, Lockout { slot: 212791624, confirmation_count: 5 }, Lockout { slot: 212791625, confirmation_count: 4 }, Lockout { slot: 212791626, confirmation_count: 3 }, Lockout { slot: 212791627, confirmation_count: 2 }, Lockout { slot: 212791628, confirmation_count: 1 }], root: Some(212791597), hash: 7jVS4LkriqepyuB6v8Mms4rdWi7n9WgcZpNytHJBBCJ9, timestamp: Some(1692637784) } failed to match hash 7jVS4LkriqepyuB6v8Mms4rdWi7n9WgcZpNytHJBBCJ9 GEpSV5y4CqxUm9DJNMaTW1yesKuCBbJjGBzLMJoeuVnG
I've saved the log in mce2:~/logs/segv_gpf_2023-08-22/
Aug 29 14:26:04 LuckyStake kernel: traps: solBstoreProc00[4924] general protection fault ip:55fbc2817361 sp:7ef2a8603968 error:0 in solana-validator[55fbc0597000+2335000]
Aug 29 18:01:11 LuckyStake kernel: traps: solBstoreProc07[9059] general protection fault ip:55fe4f64e361 sp:7f39fa272428 error:0 in solana-validator[55fe4d3ce000+2335000]
@Timoon21 Can you add the results of running:
uname -srv
too, please?
uname -srv
sol@LuckyStake:~$ uname -srv
Linux 5.15.0-67-lowlatency #74-Ubuntu SMP PREEMPT Wed Feb 22 15:27:12 UTC 2023
The mce2 log contained 11_177_732 instances of this warning:
[2023-08-21T17:15:56.846190843Z WARN solana_vote_program::vote_state] DDnAqxJVFo2GVTujibHt5cjevHMSE9bo8HJaydHoshdp dropped vote VoteStateUpdate ...
I've saved the log in
mce2:~/logs/segv_gpf_2023-08-22/
This warning with this incidence rate typically indicates that the node deviated from cluster consensus. This node panicked not terribly long after:
[2023-08-21T17:18:03.091054115Z ERROR solana_metrics::metrics] datapoint:
panic program="validator"
thread="solReplayStage"
one=1i
message="panicked at 'We have tried to repair duplicate slot: 212792212 more than 10 times and are unable to freeze a block with bankhash ACD7cLcvWgAP5jeAN158VXx8hs4aLbYkssRqU6aeFHzq, instead we have a block with bankhash Some(9HJuQ55VrSaKCMDUZfyHpXhgN2ZkxCXF7KJbiWka6Txe). This is most likely a bug in the runtime. At this point manual intervention is needed to make progress. Exiting', core/src/replay_stage.rs:1360:25" location="core/src/replay_stage.rs:1360:25" version="1.17.0 (src:6bbf514e; feat:2067127942, client:SolanaLabs)"
Pre-panic, we see the corresponding message:
[2023-08-21T17:17:52.163034212Z WARN solana_core::repair::cluster_slot_state_verifier]
Cluster duplicate confirmed slot 212792212 with hash ACD7cLcvWgAP5jeAN158VXx8hs4aLbYkssRqU6aeFHzq,
but our version has hash 9HJuQ55VrSaKCMDUZfyHpXhgN2ZkxCXF7KJbiWka6Txe
The node restarted shortly before:
[2023-08-21T17:11:34.853702597Z INFO solana_validator] Starting validator with: ArgsOs {
And looks like it finished initialization before it started dropping votes; init finished at 17:14:15 whereas first vote dropped at 17:15:56.
[2023-08-21T17:14:16.504022290Z INFO solana_metrics::metrics] datapoint: validator-new id="mce2CKApCodefxDBUDWXCdBkqoh2dg1vpWJJX2qfuvV" version="1.17.0 (src:6bbf514e; feat:2067127942, client:SolanaLabs)" cluster_type=1i
The node did output a debug file:
[2023-08-21T17:17:52.177245862Z INFO solana_runtime::bank::bank_hash_details] writing details of bank 212792212 to /home/sol/ledger/bank_hash_details/212792212-9HJuQ55VrSaKCMDUZfyHpXhgN2ZkxCXF7KJbiWka6Txe.json
We could have compared this file to one generated from replaying the same slot 212792212
to figure out what deviated. However, it looks like the ledger was completely wiped on that machine as that file/directory is gone. We could still replay that slot and compare higher level numbers (ie `capitalization") to see if we get lucky and see something like a single bit flip, but if the difference came from an account hash, we don't really have any recourse.
As for the actual fault, the node restarted shortly before
[2023-08-21T17:44:27.666706415Z INFO solana_validator] Starting validator with: ArgsOs {
And the fault happened here:
Aug 22 17:45:03 canary-am-2 kernel: traps: solBstoreProc02[666122] general protection fault ip:555881b65d94 sp:7f7df77f4b98 error:0 in solana-validator[55587f8f7000+22e5000]
However, the node would not have started blockstore processing yet:
// Intent to unpack snapshot
[2023-08-21T17:44:36.042722209Z INFO solana_runtime::snapshot_bank_utils] Loading bank from full snapshot archive: /home/sol/ledger/snapshot-212775262-5js42qM5CG8bSBK5Gk7aDYytsyPsQinyAd3P3Yof9baB.tar.zst, and incremental snapshot archive: Some("/home/sol/ledger/incremental-snapshot-212775262-212791574-3pvhEdKRxSbwbTbkcifkHYgZR7ghPp3BhDBdycbEK3xJ.tar.zst")
// fault
// Freeze bank for snapshot slot
[2023-08-21T17:47:43.647564994Z INFO solana_runtime::bank] bank frozen: 212791574 hash: 4cVQ8wd67TNQp4rm5q7Z5o99sVKANmRuFxhXRg2oFpVB accounts_delta: 6bQKJZVeNMmZuMboLvqKFy4YbApxSsQAwR33CgsrRKty signature_count: 2715 last_blockhash: 9qLLMFVSMgLgcruMyf3AaEaUnptiVDSUrHCcoB2LsveA capitalization: 555464624110280265, stats: BankHashStats { num_updated_accounts: 7383, num_removed_accounts: 21, num_lamports_stored: 1085459649670384, total_data_len: 25926274, num_executable_accounts: 0 }
// Begin local ledger replay
[2023-08-21T17:47:44.465208236Z INFO solana_ledger::blockstore_processor] Processing ledger from slot 212791574...
The threads in this pool are only accessed to perform transaction processing within blockstore_processor.rs
:
https://github.com/solana-labs/solana/blob/0f417199186042028269830401687efe23e0ae18/ledger/src/blockstore_processor.rs#L90-L96
Here is the only place this thread pool is used: https://github.com/solana-labs/solana/blob/0f417199186042028269830401687efe23e0ae18/ledger/src/blockstore_processor.rs#L226-L239
Given that the node had not started processing blocks yet, these threads would not have had work getting sent to them yet. So, it would seem to me that some other thread corrupted things that lead to solBstoreProcXY
threads faulting.
wow, segvs :)
maybe just run solana-validator
via gdb or lldb? and hope we get a thread backtrace?? so, it will clear why the thread segv-ed even if it wasn't processing txes...
maybe just run solana-validator via gdb or lldb?
Unfortunately, I don't think we can reliably reproduce at the moment. Additionally, I think running with a debugger attached would slow the process down enough that the node would be unable to keep up which would obfuscate debugging efforts. Core dump might be an option tho, make sure they're enabled and compile with something like this (this is what I have used for mem profiling):
RUSTFLAGS='-g -C force-frame-pointers=yes' cargo build --release --bin solana-validator
I've had pretty good success reproducing this. I compiled a binary w debug symbols and expect to be able to repro in the next day or so. Once I have a core dump w debug symbols it's GG.
Reposting from discord -- was able to replicate w the debug build
bt
#0 0x00005564907f3faa in _rjem_je_arena_tcache_fill_small (tsdn=
it seems the program crashed inside of jemalloc
Am running it again so we have even more data
It's also interesting that in Timoon's discord post the code crashes 3x at the same exact offset.
It's possible we're corrupting jemalloc's memory in a way that manifests on that specific instruction
it seems to just be updating stats... one of those pointers seems to have gone bad https://github.com/jemalloc/jemalloc/blob/5.2.1/src/arena.c#L1435
have the same offset too and 384Gb RAM
Summarizing latest from GitHub channel (https://discord.com/channels/428295358100013066/1146546152309801142):
Sounds like we're leaning more towards logic bug as opposed to HW level issue given we're seeing crashes happen in the exact same place w/ same backtrace (would expect random bit flips to result in more random symptoms). But we still see some system level affinity - e.g. Ben can repro easily while other operators have never seen this issue.
Seems like an outsized number of Ubuntu 22.04 systems have seen the problem, but we've also observed on 20.04 and across different Kernel versions. We've observed on both Labs and Jito clients. Only observed on v1.16. Unable to repro when replaying the same slot we previously died on, so it's not some straightforward logic bug. -We've also seen crash during startup before we've even started replaying blocks (still in blockstore processor thread) - still unpacking snapshot- (not clear if we can trust log timestamps to say this for sure).
We are crashing inside of malloc. The layout is correct and the memory metrics look fine --> mem corruption. We crashed while updating jemalloc metrics. Seems likely jemalloc memory was corrupted by some other thread. No jemalloc changes between v1.14 and v1.16.
With --enable-debug enabled, the variable / stack region that gets corrupted is "sopts" (static options) in the __malloc_default call. It's always that same region of memory that gets corrupted -- so definitely something deterministically writing out of bounds or something.
Current suspicion is something in the runtime. Alessandro has a speculative fix for possible heap corruption: https://github.com/alessandrod/rbpf/commit/c9b2a6a53af829ac8cfe803e02c537d5d4d98c7f
Running multiple machines against MNB that have seen the GPF in an attempt to reproduce. Also asking Ben to run w/ Alessandro's speculative fix to see if he can still repro
fixed in 1.16.14
Several nodes have gotten GPF while running v1.16 on mainnet-beta. Debugging discussion is on Discord in the debug-gpf-1_16 channel. The faults are caused by one of the
solBstoreProcXY
threads in this threadpool; this pool is used solely for transaction replay. Here is a table trackingmcb7 on 2023-07-03
https://discord.com/channels/428295358100013066/1027231858565586985/1125509913746083961
mce2 on 2023-08-22
mce2
was running9d83bb2a
when solana-validator received SEGV at2023-08-22 17:45:12 UTC
/var/log/kern.log
:/var/log/apport.log
:addr2line -e /home/sol/.local/share/solana/install/active_release/bin/solana-validator --functions --demangle 0x22e5000
2023-09-11 meyebro