v1.16 nodes general protection fault

willhickey commented 1 year ago

Several nodes have gotten GPF while running v1.16 on mainnet-beta. Debugging discussion is on Discord in the debug-gpf-1_16 channel. The faults are caused by one of the solBstoreProcXY threads in this threadpool; this pool is used solely for transaction replay. Here is a table tracking

Date (UTC)	Host	Commit	OS/Kernel	Thread	Link
2023-07-03 18:26:35	mcb7	02d5647b
2023-08-22 17:45:03	mce2	9d83bb2a	Ubuntu 20.04.3 LTS / 5.4.0-81-generic	solBstoreProc02
2023-08-28 23:05:45	community	v1.16.9-jito		solBstoreProc02	Discord depool
2023-08-29 00:03:40	community	v1.16.9-jito		solBstoreProc09	Discord 7Layer Overclock
2023-08-29 14:26:04	community	v1.16.9-jito		solBstoreProc00	Discord Timoon_21
2023-08-29 ~17:00	community	v1.16.9-jito		solBstoreProc19	Discord depool
2023-09-11 ~07:00	community	v1.16.13-jito		solBstoreProc01	Discord meyerbro
2023-09-04 repeat	community	???	Ubuntu 22.04.2 LTS / 5.15.0-79-generic	solBstoreProcXY	Discord Ben

mcb7 on 2023-07-03

https://discord.com/channels/428295358100013066/1027231858565586985/1125509913746083961

mce2 on 2023-08-22

mce2 was running 9d83bb2a when solana-validator received SEGV at 2023-08-22 17:45:12 UTC

/var/log/kern.log:

Aug 22 17:45:03 localhost kernel: [51336395.493217] traps: solBstoreProc02[666122] general protection fault ip:555881b65d94 sp:7f7df77f4b98 error:0 in solana-validator[55587f8f7000+22e5000]

/var/log/apport.log:

ERROR: apport (pid 783652) Tue Aug 22 17:45:03 2023: called for pid 664500, signal 11, core limit 0, dump mode 1
ERROR: apport (pid 783652) Tue Aug 22 17:45:03 2023: executable: /home/sol/.local/share/solana/install/releases/edge-9d83bb2a897d289a52996c0e3a384188064ed4d1/bin/solana-validator (command line "solana-validator --dynamic-port-range 8002-8020 --gossip-port 8001 --identity /home/sol/identity/mce2-identity.json --ledger /home/sol/ledger --limit-ledger-size --log /home/sol/logs/solana-validator.log --rpc-port 8899 --expected-genesis-hash 5eykt4UsFv8P8NJdTREpY1vzqKqZKvdpKuc147dw2N9d --wal-recovery-mode skip_any_corrupted_record --no-voting --trusted-validator 7Np41oeYqPefeNQEHSv1UDhYrehxin3NStELsSKCT4K2 --trusted-validator GdnSyH3YtwcxFvQrVVJMm1JhTS4QVX7MFsX56uJLUfiZ --trusted-validator DE1bawNcRJB9rVm3buyMVfr8mBEoyyu73NBovf2oXJsJ --trusted-validator CakcnaRDHka2gXyfbEd2d3xsvkJkqsLw2akB3zsN1D2S --halt-on-trusted-validators-accounts-hash-mismatch --no-untrusted-rpc --expected-shred-version 56177 --entrypoint mainnet-beta.solana.com:8001 --entrypoint entrypoint.mainnet-beta.solana.com:8001 --entrypoint entrypoint2.mainnet-beta.solana.com:8001 --entrypoint entrypoint3.mainnet-beta.solana.com:8001 --entrypoint entrypoint4.mainnet-beta.solana.com:8001 --entrypoint entrypoint5.mainnet-beta.solana.com:8001 --no-genesis-fetch --no-snapshot-fetch --entrypoint entrypoint.mainnet-beta.solana.com:8001 --entrypoint entrypoint2.mainnet-beta.solana.com:8001 --entrypoint entrypoint3.mainnet-beta.solana.com:8001 --entrypoint entrypoint4.mainnet-beta.solana.com:8001 --entrypoint entrypoint5.mainnet-beta.solana.com:8001")
ERROR: apport (pid 783652) Tue Aug 22 17:45:03 2023: executable does not belong to a package, ignoring

addr2line -e /home/sol/.local/share/solana/install/active_release/bin/solana-validator --functions --demangle 0x22e5000

core::ptr::drop_in_place<[alloc::sync::Arc<tokio::util::slab::Page<tokio::runtime::io::scheduled_io::ScheduledIo>>; 19]>
tokio.948cf0d3-cgu.9:?

2023-09-11 meyebro

kernel: traps: solBstoreProc01[2476276] general protection fault ip:5636c60cf361 sp:7fa015231058 error:0 in solana-validator[5636c3e58000+232c000]

$ addr2line -e ... --functions --demangle  0x232c000
<tokio::net::tcp::split::ReadHalf as tokio::io::async_read::AsyncRead>::poll_read
??:?

willhickey commented 1 year ago

The mce2 log contained 11_177_732 instances of this warning:

[2023-08-21T17:15:56.846190843Z WARN  solana_vote_program::vote_state] DDnAqxJVFo2GVTujibHt5cjevHMSE9bo8HJaydHoshdp dropped vote VoteStateUpdate { lockouts: [Lockout { slot: 212791598, confirmation_count: 31 }, Lockout { slot: 212791599, confirmation_count: 30 }, Lockout { slot: 212791600, confirmation_count: 29 }, Lockout { slot: 212791601, confirmation_count: 28 }, Lockout { slot: 212791602, confirmation_count: 27 }, Lockout { slot: 212791603, confirmation_count: 26 }, Lockout { slot: 212791604, confirmation_count: 25 }, Lockout { slot: 212791605, confirmation_count: 24 }, Lockout { slot: 212791606, confirmation_count: 23 }, Lockout { slot: 212791607, confirmation_count: 22 }, Lockout { slot: 212791608, confirmation_count: 21 }, Lockout { slot: 212791609, confirmation_count: 20 }, Lockout { slot: 212791610, confirmation_count: 19 }, Lockout { slot: 212791611, confirmation_count: 18 }, Lockout { slot: 212791612, confirmation_count: 17 }, Lockout { slot: 212791613, confirmation_count: 16 }, Lockout { slot: 212791614, confirmation_count: 15 }, Lockout { slot: 212791615, confirmation_count: 14 }, Lockout { slot: 212791616, confirmation_count: 13 }, Lockout { slot: 212791617, confirmation_count: 12 }, Lockout { slot: 212791618, confirmation_count: 11 }, Lockout { slot: 212791619, confirmation_count: 10 }, Lockout { slot: 212791620, confirmation_count: 9 }, Lockout { slot: 212791621, confirmation_count: 8 }, Lockout { slot: 212791622, confirmation_count: 7 }, Lockout { slot: 212791623, confirmation_count: 6 }, Lockout { slot: 212791624, confirmation_count: 5 }, Lockout { slot: 212791625, confirmation_count: 4 }, Lockout { slot: 212791626, confirmation_count: 3 }, Lockout { slot: 212791627, confirmation_count: 2 }, Lockout { slot: 212791628, confirmation_count: 1 }], root: Some(212791597), hash: 7jVS4LkriqepyuB6v8Mms4rdWi7n9WgcZpNytHJBBCJ9, timestamp: Some(1692637784) } failed to match hash 7jVS4LkriqepyuB6v8Mms4rdWi7n9WgcZpNytHJBBCJ9 GEpSV5y4CqxUm9DJNMaTW1yesKuCBbJjGBzLMJoeuVnG

I've saved the log in mce2:~/logs/segv_gpf_2023-08-22/

willhickey commented 1 year ago

https://discord.com/channels/428295358100013066/689412830075551748/1146082511378714665 https://discord.com/channels/428295358100013066/689412830075551748/1145944038818709576

Timoon21 commented 1 year ago

Aug 29 14:26:04 LuckyStake kernel: traps: solBstoreProc00[4924] general protection fault ip:55fbc2817361 sp:7ef2a8603968 error:0 in solana-validator[55fbc0597000+2335000]

2023-08-29T14.log

Timoon21 commented 1 year ago

Aug 29 18:01:11 LuckyStake kernel: traps: solBstoreProc07[9059] general protection fault ip:55fe4f64e361 sp:7f39fa272428 error:0 in solana-validator[55fe4d3ce000+2335000]

2023-08-29T18.log

brooksprumo commented 1 year ago

@Timoon21 Can you add the results of running:

uname -srv

too, please?

Timoon21 commented 1 year ago

uname -srv

sol@LuckyStake:~$ uname -srv
Linux 5.15.0-67-lowlatency #74-Ubuntu SMP PREEMPT Wed Feb 22 15:27:12 UTC 2023

steviez commented 1 year ago

The mce2 log contained 11_177_732 instances of this warning:
[2023-08-21T17:15:56.846190843Z WARN  solana_vote_program::vote_state] DDnAqxJVFo2GVTujibHt5cjevHMSE9bo8HJaydHoshdp dropped vote VoteStateUpdate ...
I've saved the log in mce2:~/logs/segv_gpf_2023-08-22/

This warning with this incidence rate typically indicates that the node deviated from cluster consensus. This node panicked not terribly long after:

[2023-08-21T17:18:03.091054115Z ERROR solana_metrics::metrics] datapoint:
    panic program="validator"
    thread="solReplayStage"
    one=1i
    message="panicked at 'We have tried to repair duplicate slot: 212792212 more than 10 times and are unable to freeze a block with bankhash ACD7cLcvWgAP5jeAN158VXx8hs4aLbYkssRqU6aeFHzq, instead we have a block with bankhash Some(9HJuQ55VrSaKCMDUZfyHpXhgN2ZkxCXF7KJbiWka6Txe). This is most likely a bug in the runtime. At this point manual intervention is needed to make progress. Exiting', core/src/replay_stage.rs:1360:25" location="core/src/replay_stage.rs:1360:25" version="1.17.0 (src:6bbf514e; feat:2067127942, client:SolanaLabs)"

Pre-panic, we see the corresponding message:

[2023-08-21T17:17:52.163034212Z WARN  solana_core::repair::cluster_slot_state_verifier]
Cluster duplicate confirmed slot 212792212 with hash ACD7cLcvWgAP5jeAN158VXx8hs4aLbYkssRqU6aeFHzq,
but our version has hash 9HJuQ55VrSaKCMDUZfyHpXhgN2ZkxCXF7KJbiWka6Txe

The node restarted shortly before:

[2023-08-21T17:11:34.853702597Z INFO  solana_validator] Starting validator with: ArgsOs {

And looks like it finished initialization before it started dropping votes; init finished at 17:14:15 whereas first vote dropped at 17:15:56.

[2023-08-21T17:14:16.504022290Z INFO  solana_metrics::metrics] datapoint: validator-new id="mce2CKApCodefxDBUDWXCdBkqoh2dg1vpWJJX2qfuvV" version="1.17.0 (src:6bbf514e; feat:2067127942, client:SolanaLabs)" cluster_type=1i

The node did output a debug file:

[2023-08-21T17:17:52.177245862Z INFO  solana_runtime::bank::bank_hash_details] writing details of bank 212792212 to /home/sol/ledger/bank_hash_details/212792212-9HJuQ55VrSaKCMDUZfyHpXhgN2ZkxCXF7KJbiWka6Txe.json

We could have compared this file to one generated from replaying the same slot 212792212 to figure out what deviated. However, it looks like the ledger was completely wiped on that machine as that file/directory is gone. We could still replay that slot and compare higher level numbers (ie `capitalization") to see if we get lucky and see something like a single bit flip, but if the difference came from an account hash, we don't really have any recourse.

steviez commented 1 year ago

As for the actual fault, the node restarted shortly before

[2023-08-21T17:44:27.666706415Z INFO  solana_validator] Starting validator with: ArgsOs {

And the fault happened here:

Aug 22 17:45:03 canary-am-2 kernel: traps: solBstoreProc02[666122] general protection fault ip:555881b65d94 sp:7f7df77f4b98 error:0 in solana-validator[55587f8f7000+22e5000]

However, the node would not have started blockstore processing yet:

// Intent to unpack snapshot
[2023-08-21T17:44:36.042722209Z INFO  solana_runtime::snapshot_bank_utils] Loading bank from full snapshot archive: /home/sol/ledger/snapshot-212775262-5js42qM5CG8bSBK5Gk7aDYytsyPsQinyAd3P3Yof9baB.tar.zst, and incremental snapshot archive: Some("/home/sol/ledger/incremental-snapshot-212775262-212791574-3pvhEdKRxSbwbTbkcifkHYgZR7ghPp3BhDBdycbEK3xJ.tar.zst")

// fault

// Freeze bank for snapshot slot
[2023-08-21T17:47:43.647564994Z INFO  solana_runtime::bank] bank frozen: 212791574 hash: 4cVQ8wd67TNQp4rm5q7Z5o99sVKANmRuFxhXRg2oFpVB accounts_delta: 6bQKJZVeNMmZuMboLvqKFy4YbApxSsQAwR33CgsrRKty signature_count: 2715 last_blockhash: 9qLLMFVSMgLgcruMyf3AaEaUnptiVDSUrHCcoB2LsveA capitalization: 555464624110280265, stats: BankHashStats { num_updated_accounts: 7383, num_removed_accounts: 21, num_lamports_stored: 1085459649670384, total_data_len: 25926274, num_executable_accounts: 0 }

// Begin local ledger replay 
[2023-08-21T17:47:44.465208236Z INFO  solana_ledger::blockstore_processor] Processing ledger from slot 212791574...

The threads in this pool are only accessed to perform transaction processing within blockstore_processor.rs: https://github.com/solana-labs/solana/blob/0f417199186042028269830401687efe23e0ae18/ledger/src/blockstore_processor.rs#L90-L96

Here is the only place this thread pool is used: https://github.com/solana-labs/solana/blob/0f417199186042028269830401687efe23e0ae18/ledger/src/blockstore_processor.rs#L226-L239

Given that the node had not started processing blocks yet, these threads would not have had work getting sent to them yet. So, it would seem to me that some other thread corrupted things that lead to solBstoreProcXY threads faulting.

ryoqun commented 1 year ago

wow, segvs :) maybe just run solana-validator via gdb or lldb? and hope we get a thread backtrace?? so, it will clear why the thread segv-ed even if it wasn't processing txes...

steviez commented 1 year ago

maybe just run solana-validator via gdb or lldb?

Unfortunately, I don't think we can reliably reproduce at the moment. Additionally, I think running with a debugger attached would slow the process down enough that the node would be unable to keep up which would obfuscate debugging efforts. Core dump might be an option tho, make sure they're enabled and compile with something like this (this is what I have used for mem profiling):

RUSTFLAGS='-g -C force-frame-pointers=yes' cargo build --release --bin solana-validator

SOELTH commented 1 year ago

I've had pretty good success reproducing this. I compiled a binary w debug symbols and expect to be able to repro in the next day or so. Once I have a core dump w debug symbols it's GG.

SOELTH commented 1 year ago

Reposting from discord -- was able to replicate w the debug build

bt #0 0x00005564907f3faa in _rjem_je_arena_tcache_fill_small (tsdn=, arena=, tcache=, tbin=, binind=, prof_accumbytes=) at src/arena.c:1435 #1 0x0000556490850ac9 in _rjem_je_tcache_alloc_small_hard (tsdn=tsdn@entry=0x7f738cc6e438, arena=arena@entry=0x7fae868008c0, tcache=tcache@entry=0x7f738cc6e628, tbin=tbin@entry=0x7f738cc6e698, binind=, tcache_success=tcache_success@entry=0x7f738cc68197) at src/tcache.c:94 #2 0x00005564907e8401 in tcache_alloc_small (slow_path=false, zero=false, binind=, size=, tcache=0x7f738cc6e628, arena=0x7fae868008c0, tsd=0x7f738cc6e438) at include/jemalloc/internal/tcache_inlines.h:60 #3 arena_malloc (slow_path=false, tcache=0x7f738cc6e628, zero=false, ind=, size=, arena=0x0, tsdn=0x7f738cc6e438) at include/jemalloc/internal/arena_inlines_b.h:165 #4 iallocztm (slow_path=false, arena=0x0, is_internal=false, tcache=0x7f738cc6e628, zero=false, ind=, size=, tsdn=0x7f738cc6e438) at include/jemalloc/internal/jemalloc_internal_inlines_c.h:53 #5 imalloc_no_sample (ind=, usize=64, size=, tsd=0x7f738cc6e438, dopts=, sopts=) at src/jemalloc.c:1949 #6 imalloc_body (tsd=0x7f738cc6e438, dopts=, sopts=) at src/jemalloc.c:2149 #7 imalloc (dopts=, sopts=) at src/jemalloc.c:2258 #8 _rjem_je_malloc_default (size=) at src/jemalloc.c:2289 #9 0x00005564901faf2c in alloc::alloc::alloc () at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/alloc/src/alloc.rs:95 #10 solana_rbpf::jit::{impl#8}::emit_subroutines::allocate_error (result=0x7f738cc68828) at src/jit.rs:1263 ... 9 more unknown frames

it seems the program crashed inside of jemalloc

Am running it again so we have even more data

It's also interesting that in Timoon's discord post the code crashes 3x at the same exact offset.

It's possible we're corrupting jemalloc's memory in a way that manifests on that specific instruction

it seems to just be updating stats... one of those pointers seems to have gone bad https://github.com/jemalloc/jemalloc/blob/5.2.1/src/arena.c#L1435

depool-live commented 1 year ago

have the same offset too and 384Gb RAM

bw-solana commented 1 year ago

Summarizing latest from GitHub channel (https://discord.com/channels/428295358100013066/1146546152309801142):

Sounds like we're leaning more towards logic bug as opposed to HW level issue given we're seeing crashes happen in the exact same place w/ same backtrace (would expect random bit flips to result in more random symptoms). But we still see some system level affinity - e.g. Ben can repro easily while other operators have never seen this issue.

Seems like an outsized number of Ubuntu 22.04 systems have seen the problem, but we've also observed on 20.04 and across different Kernel versions. We've observed on both Labs and Jito clients. Only observed on v1.16. Unable to repro when replaying the same slot we previously died on, so it's not some straightforward logic bug. -We've also seen crash during startup before we've even started replaying blocks (still in blockstore processor thread) - still unpacking snapshot- (not clear if we can trust log timestamps to say this for sure).

We are crashing inside of malloc. The layout is correct and the memory metrics look fine --> mem corruption. We crashed while updating jemalloc metrics. Seems likely jemalloc memory was corrupted by some other thread. No jemalloc changes between v1.14 and v1.16.

With --enable-debug enabled, the variable / stack region that gets corrupted is "sopts" (static options) in the __malloc_default call. It's always that same region of memory that gets corrupted -- so definitely something deterministically writing out of bounds or something.

Current suspicion is something in the runtime. Alessandro has a speculative fix for possible heap corruption: https://github.com/alessandrod/rbpf/commit/c9b2a6a53af829ac8cfe803e02c537d5d4d98c7f

Running multiple machines against MNB that have seen the GPF in an attempt to reproduce. Also asking Ben to run w/ Alessandro's speculative fix to see if he can still repro

t-nelson commented 1 year ago

fixed in 1.16.14

solana-labs / solana