Closed brooksprumo closed 3 years ago
I have seen similar crashes in my validator several times.
[2467628.738572] blockstore_proc[25701]: segfault at 7fd763a42782 ip 00007fd763a42782 sp 00007fdd2a7311b0 error 14 [2467628.738577] Code: Bad RIP value.
I reproduced on one of my validators running against testnet
, and believe I was able to pinpoint the transaction. Here are the steps to reproduce:
~/21278_ledger/
$ ledger-tool verify -l ~/21278_ledger
I found the bad tx is in slot 103354509
, so I created a snapshot at 103354508
. In order to pinpoint which tx was causing the segfault, I serialized the blockstore_processor::execute_batches function to check results and report progress batch by batch. Doing so, this looks to be the problematic tx.
$ ledger-tool slot 103354509 -l ~/21278_ledger/
...
Entry 4 - num_hashes: 2945, hashes: 6EuG8igRQbSjKohwwWnEpC6A1fSazoXZZf9tRFPjZQqd, transactions: 1
Transaction 0
Recent Blockhash: 66ogGDnjCEwFjbo1sz4iUhvxxk1HF4PxZSnJXAfeyCBV
Signature 0: 45cfShbqK2AQ9Posh4j56AvziWmEw8ZzPzNn7q7otaMtuRAJG7MvPoRGrjRNSNDXLcPG64YdTXetM8scihkVCYDY
Signature 1: 29s8eSRWHuCkQ3z3rdFtu89dxeNf2Toc815yz8zy4tj5wUw971euaepUBAZPfAxAnsefYFxLBeKy7vLjPzHZrxZY
Account 0: srw- J75jd3kjsABQSDrEdywcyhmbq8eHDowfW9xtEWsVALy9 (fee payer)
Account 1: srw- CjDiyTgSFRbCxzfmj7NAikHScDrgBvEYWCfJLRpysvZV
Account 2: -rw- FuEm7UMaCYHThzKaf9DcJ7MdM4t4SALfeNnYQq46foVv
Account 3: -rw- Hsj5SnZTDvNis3dZCtzpj5NCy8JzBHRXKrgdSSS26cnr
Account 4: -r-- 11111111111111111111111111111111
Account 5: -r-x 7FNWTfCo3AyRBFCvr49daqKHehdn2GjNgpjuTsqy5twk
Instruction 0
Program: 7FNWTfCo3AyRBFCvr49daqKHehdn2GjNgpjuTsqy5twk (5)
Account 0: J75jd3kjsABQSDrEdywcyhmbq8eHDowfW9xtEWsVALy9 (0)
Account 1: FuEm7UMaCYHThzKaf9DcJ7MdM4t4SALfeNnYQq46foVv (2)
Account 2: Hsj5SnZTDvNis3dZCtzpj5NCy8JzBHRXKrgdSSS26cnr (3)
Account 3: CjDiyTgSFRbCxzfmj7NAikHScDrgBvEYWCfJLRpysvZV (1)
Account 4: 11111111111111111111111111111111 (4)
Data: [0, 44, 13, 146, 97, 0, 0, 0, 0, 164, 13, 146, 97, 0, 0, 0, 0, 0, 225, 245, 5, 0, 0, 0, 0]
Status: Unavailable
Here is the diff if anyone wants to apply the single-thread change locally; basically, just moving the logic out of the thread-pool.
diff --git a/ledger/src/blockstore_processor.rs b/ledger/src/blockstore_processor.rs
index 50ee8ace5..fb3a5b913 100644
--- a/ledger/src/blockstore_processor.rs
+++ b/ledger/src/blockstore_processor.rs
@@ -253,6 +253,7 @@ fn execute_batches(
cost_capacity_meter: Arc<RwLock<BlockCostCapacityMeter>>,
) -> Result<()> {
inc_new_counter_debug!("bank-par_execute_entries-count", batches.len());
+ /*
let (results, new_timings): (Vec<Result<()>>, Vec<ExecuteTimings>) =
PAR_THREAD_POOL.with(|thread_pool| {
thread_pool.borrow().install(|| {
@@ -276,12 +277,36 @@ fn execute_batches(
.unzip()
})
});
+ */
+ let mut results = vec![];
+ for (batch_idx, batch) in batches.iter().enumerate() {
+ trace!("Batch {}, batch contents: {:#?}", batch_idx, batch.sanitized_transactions());
+ let mut timings = ExecuteTimings::default();
+ let result = execute_batch(
+ batch,
+ bank,
+ transaction_status_sender,
+ replay_vote_sender,
+ &mut timings,
+ cost_capacity_meter.clone(),
+ );
+ if let Some(entry_callback) = entry_callback {
+ entry_callback(bank);
+ }
+ results.push(result);
+ }
timings.total_batches_len += batches.len();
timings.num_execute_batches += 1;
+ /*
for timing in new_timings {
timings.accumulate(&timing);
}
+ */
first_err(&results)
}
I confirmed that things were fine after the iteration for batch_idx == 2
, and went south on batch_idx == 3
. Skipping batch_idx == 3
iteration allowed the rest of the slot to process without segfault; the above tx was the only one in batch_idx == 3
batch.
I managed to trace the program and it looks like the problem comes from #18981 not #21041.
Running 7FNWTfCo3AyRBFCvr49daqKHehdn2GjNgpjuTsqy5twk
0 [0000000000000000, 0000000400000000, 0000000000000000, 0000000000000000, 0000000000000000, 0000000000000000, 0000000000000000, 0000000000000000, 0000000000000000, 0000000000000000, 0000000200001000] 4602: mov64 r2, r1
1 [0000000000000000, 0000000400000000, 0000000400000000, 0000000000000000, 0000000000000000, 0000000000000000, 0000000000000000, 0000000000000000, 0000000000000000, 0000000000000000, 0000000200001000] 4603: mov64 r1, r10
2 [0000000000000000, 0000000200001000, 0000000400000000, 0000000000000000, 0000000000000000, 0000000000000000, 0000000000000000, 0000000000000000, 0000000000000000, 0000000000000000, 0000000200001000] 4604: add64 r1, -112
3 [0000000000000000, 0000000200000F90, 0000000400000000, 0000000000000000, 0000000000000000, 0000000000000000, 0000000000000000, 0000000000000000, 0000000000000000, 0000000000000000, 0000000200001000] 4605: call function_7400
[...]
2369 [000000000018A380, FFFFFFFFFFFFFFFF, 43EFFFFFFFFFFFFF, 4128A38000000000, 0000000000000000, 7FFFFFFFFFFFFFFF, 4138A38000000000, 000000000018A380, 0000000300007F10, 0000000300007FA0, 0000000200007000] 16615: exit
2370 [000000000018A380, FFFFFFFFFFFFFFFF, 43EFFFFFFFFFFFFF, 4128A38000000000, 0000000000000000, 7FFFFFFFFFFFFFFF, 0000000200004E88, 0000000300007F40, 0000000300007F10, 0000000300007FA0, 0000000200005000] 445: stxdw [r10-0x288], r0
2371 [000000000018A380, FFFFFFFFFFFFFFFF, 43EFFFFFFFFFFFFF, 4128A38000000000, 0000000000000000, 7FFFFFFFFFFFFFFF, 0000000200004E88, 0000000300007F40, 0000000300007F10, 0000000300007FA0, 0000000200005000] 446: mov64 r1, r10
2372 [000000000018A380, 0000000200005000, 43EFFFFFFFFFFFFF, 4128A38000000000, 0000000000000000, 7FFFFFFFFFFFFFFF, 0000000200004E88, 0000000300007F40, 0000000300007F10, 0000000300007FA0, 0000000200005000] 447: add64 r1, -112
2373 [000000000018A380, 0000000200004F90, 43EFFFFFFFFFFFFF, 4128A38000000000, 0000000000000000, 7FFFFFFFFFFFFFFF, 0000000200004E88, 0000000300007F40, 0000000300007F10, 0000000300007FA0, 0000000200005000] 448: call function_18434
2374 [000000000018A380, 0000000200004F90, 43EFFFFFFFFFFFFF, 4128A38000000000, 0000000000000000, 7FFFFFFFFFFFFFFF, 0000000200004E88, 0000000300007F40, 0000000300007F10, 0000000300007FA0, 0000000200007000] 18463: mov64 r6, r1
2375 [000000000018A380, 0000000200004F90, 43EFFFFFFFFFFFFF, 4128A38000000000, 0000000000000000, 7FFFFFFFFFFFFFFF, 0000000200004F90, 0000000300007F40, 0000000300007F10, 0000000300007FA0, 0000000200007000] 18464: mov64 r7, 0
2376 [000000000018A380, 0000000200004F90, 43EFFFFFFFFFFFFF, 4128A38000000000, 0000000000000000, 7FFFFFFFFFFFFFFF, 0000000200004F90, 0000000000000000, 0000000300007F10, 0000000300007FA0, 0000000200007000] 18465: stxdw [r10-0x8], r7
2377 [000000000018A380, 0000000200004F90, 43EFFFFFFFFFFFFF, 4128A38000000000, 0000000000000000, 7FFFFFFFFFFFFFFF, 0000000200004F90, 0000000000000000, 0000000300007F10, 0000000300007FA0, 0000000200007000] 18466: mov64 r1, r10
2378 [000000000018A380, 0000000200007000, 43EFFFFFFFFFFFFF, 4128A38000000000, 0000000000000000, 7FFFFFFFFFFFFFFF, 0000000200004F90, 0000000000000000, 0000000300007F10, 0000000300007FA0, 0000000200007000] 18467: add64 r1, -8
2379 [000000000018A380, 0000000200006FF8, 43EFFFFFFFFFFFFF, 4128A38000000000, 0000000000000000, 7FFFFFFFFFFFFFFF, 0000000200004F90, 0000000000000000, 0000000300007F10, 0000000300007FA0, 0000000200007000] 18468: syscall sol_get_fees_sysvar
It crashes here and should instead continue with:
2380 [0000000000000000, 0000000200006FF8, 43EFFFFFFFFFFFFF, 4128A38000000000, 0000000000000000, 7FFFFFFFFFFFFFFF, 0000000200004F90, 0000000000000000, 0000000300007F10, 0000000300007FA0, 0000000200007000] 18469: jeq r0, 0, lbb_18447
2381 [0000000000000000, 0000000200006FF8, 43EFFFFFFFFFFFFF, 4128A38000000000, 0000000000000000, 7FFFFFFFFFFFFFFF, 0000000200004F90, 0000000000000000, 0000000300007F10, 0000000300007FA0, 0000000200007000] 18476: ldxdw r1, [r10-0x8]
2382 [0000000000000000, 0000000000001388, 43EFFFFFFFFFFFFF, 4128A38000000000, 0000000000000000, 7FFFFFFFFFFFFFFF, 0000000200004F90, 0000000000000000, 0000000300007F10, 0000000300007FA0, 0000000200007000] 18477: stxdw [r6+0x8], r1
2383 [0000000000000000, 0000000000001388, 43EFFFFFFFFFFFFF, 4128A38000000000, 0000000000000000, 7FFFFFFFFFFFFFFF, 0000000200004F90, 0000000000000000, 0000000300007F10, 0000000300007FA0, 0000000200007000] 18478: stxdw [r6+0x0], r7
2384 [0000000000000000, 0000000000001388, 43EFFFFFFFFFFFFF, 4128A38000000000, 0000000000000000, 7FFFFFFFFFFFFFFF, 0000000200004F90, 0000000000000000, 0000000300007F10, 0000000300007FA0, 0000000200007000] 18479: exit
[...]
35867 [0000000300007F10, 0000000300007F10, 00000000000000F0, 0000000000000008, 0000000000000000, 0000000000000003, 00000000000000F0, 0000000200000F78, 000000040000CA08, 0000000000000000, 0000000200007000] 12496: exit
35868 [0000000300007F10, 0000000300007F10, 00000000000000F0, 0000000000000008, 0000000000000000, 0000000000000003, 00000000000000F0, 0000000200000F78, 000000040000CA08, 0000000000000000, 0000000200005000] 13222: exit
35869 [0000000300007F10, 0000000300007F10, 00000000000000F0, 0000000000000008, 0000000000000000, 0000000000000003, 00000000000000F0, 0000000200000F78, 000000040000CA08, 0000000000000000, 0000000200003000] 12384: exit
35870 [0000000300007F10, 0000000300007F10, 00000000000000F0, 0000000000000008, 0000000000000000, 0000000000000003, 0000000000000000, 0000000200000F78, 000000040000CA08, 0000000000000000, 0000000200001000] 12468: mov64 r0, r6
35871 [0000000000000000, 0000000300007F10, 00000000000000F0, 0000000000000008, 0000000000000000, 0000000000000003, 0000000000000000, 0000000200000F78, 000000040000CA08, 0000000000000000, 0000000200001000] 12469: exit
Program 7FNWTfCo3AyRBFCvr49daqKHehdn2GjNgpjuTsqy5twk consumed 41482 of 200000 compute units
Based on that trace, my current understanding is:
SyscallGetFeesSysvar
syscall by not registering it, because the feature gate condition is not inverted and I presume it should be because the matching bind_feature_gated_syscall_context_object()
has an inverted condition:if invoke_context.is_feature_active(&disable_fees_sysvar::id())
vs let is_fee_sysvar_via_syscall_active = !invoke_context.is_feature_active(&disable_fees_sysvar::id());
EDIT: Now tracked in #21297
2379 [000000000018A380, 0000000200006FF8, 43EFFFFFFFFFFFFF, 4128A38000000000, 0000000000000000, 7FFFFFFFFFFFFFFF, 0000000200004F90, 0000000000000000, 0000000300007F10, 0000000300007FA0, 0000000200007000] 18468: syscall sol_get_fees_sysvar
.vm::REPORT_UNRESOLVED_SYMBOL_INDEX
, which was broken from Rust 1.56.0 onward, until I addressed it in https://github.com/solana-labs/rbpf/pull/216/commits/5394026f4b9ec0d26125880b042c70fcf5974ee7I tried updating to solana_rbpf=0.2.15
which picked up the fix of vm::REPORT_UNRESOLVED_SYMBOL_INDEX
and it still crashes, so there might be more to the story.
EDIT: Seems that fetching v0.2.15
is different from using path = "../../solana_rbpf"
in Cargo.toml
. Maybe build.rs
does not work if fetched from crates.io ?
Second half should be fixed with: https://github.com/solana-labs/rbpf/pull/231 However, until RBPF v0.2.16 is released this issue should stay open.
This issue has been automatically locked since there has not been any activity in past 7 days after it was closed. Please open a new issue for related bugs.
Problem
Running a validator against mnb crashes. There is no information in the validator's log file, but dmesg will say:
While trying to narrow the cause down, I found that this crash started at this commit:
53358ab1aa619e9c475c6220cff8fb6d89d9eb5d
The previous commit (
7a41b2c5d1fd148ad8bec634b9323c687f8c3c23
) did not have this crash, but instead had a different panic, which was fixed by https://github.com/solana-labs/solana/pull/21041. It may be that this PR also introduced this crash, but I have not been able to confirm/deny.Unfortunately, core dumps from debug builds do not provide any useful information. I believe the stack is getting borked, so the frame pointers are not helpful:
I can provide the core dumps too. They are ~100 GB, which is why I haven't attached them to this issue directly.
How to Reproduce
To reproduce, run a validator built from
53358ab1aa619e9c475c6220cff8fb6d89d9eb5d
or later against mnb. It may take a very long time to crash (I've seen it take as long as ~10 hours).