solana-labs / solana

Web-Scale Blockchain for fast, secure, scalable, decentralized apps and marketplaces.
https://solanalabs.com
Apache License 2.0
13.34k stars 4.35k forks source link

Crash due to bad RIP value #21278

Closed brooksprumo closed 3 years ago

brooksprumo commented 3 years ago

Problem

Running a validator against mnb crashes. There is no information in the validator's log file, but dmesg will say:

[474846.462052] blockstore_proc[19703]: segfault at 7f27c453c79f ip 00007f27c453c79f sp 00007f3742f421d0 error 14
[474846.472198] Code: Bad RIP value.

While trying to narrow the cause down, I found that this crash started at this commit: 53358ab1aa619e9c475c6220cff8fb6d89d9eb5d

The previous commit (7a41b2c5d1fd148ad8bec634b9323c687f8c3c23) did not have this crash, but instead had a different panic, which was fixed by https://github.com/solana-labs/solana/pull/21041. It may be that this PR also introduced this crash, but I have not been able to confirm/deny.

Unfortunately, core dumps from debug builds do not provide any useful information. I believe the stack is getting borked, so the frame pointers are not helpful:

(lldb) bt
* thread #1, name = 'solana-validato', stop reason = signal SIGSEGV
  * frame #0: 0x00007fcb81f5c6de
    frame #1: 0xfffffffffffffff8

I can provide the core dumps too. They are ~100 GB, which is why I haven't attached them to this issue directly.

How to Reproduce

To reproduce, run a validator built from 53358ab1aa619e9c475c6220cff8fb6d89d9eb5d or later against mnb. It may take a very long time to crash (I've seen it take as long as ~10 hours).

lijunwangs commented 3 years ago

I have seen similar crashes in my validator several times.

lijunwangs commented 3 years ago

[2467628.738572] blockstore_proc[25701]: segfault at 7fd763a42782 ip 00007fd763a42782 sp 00007fdd2a7311b0 error 14 [2467628.738577] Code: Bad RIP value.

steviez commented 3 years ago

I reproduced on one of my validators running against testnet, and believe I was able to pinpoint the transaction. Here are the steps to reproduce:

I found the bad tx is in slot 103354509, so I created a snapshot at 103354508. In order to pinpoint which tx was causing the segfault, I serialized the blockstore_processor::execute_batches function to check results and report progress batch by batch. Doing so, this looks to be the problematic tx.

$ ledger-tool slot 103354509 -l ~/21278_ledger/
...
  Entry 4 - num_hashes: 2945, hashes: 6EuG8igRQbSjKohwwWnEpC6A1fSazoXZZf9tRFPjZQqd, transactions: 1
    Transaction 0
      Recent Blockhash: 66ogGDnjCEwFjbo1sz4iUhvxxk1HF4PxZSnJXAfeyCBV
      Signature 0: 45cfShbqK2AQ9Posh4j56AvziWmEw8ZzPzNn7q7otaMtuRAJG7MvPoRGrjRNSNDXLcPG64YdTXetM8scihkVCYDY
      Signature 1: 29s8eSRWHuCkQ3z3rdFtu89dxeNf2Toc815yz8zy4tj5wUw971euaepUBAZPfAxAnsefYFxLBeKy7vLjPzHZrxZY
      Account 0: srw- J75jd3kjsABQSDrEdywcyhmbq8eHDowfW9xtEWsVALy9 (fee payer)
      Account 1: srw- CjDiyTgSFRbCxzfmj7NAikHScDrgBvEYWCfJLRpysvZV
      Account 2: -rw- FuEm7UMaCYHThzKaf9DcJ7MdM4t4SALfeNnYQq46foVv
      Account 3: -rw- Hsj5SnZTDvNis3dZCtzpj5NCy8JzBHRXKrgdSSS26cnr
      Account 4: -r-- 11111111111111111111111111111111
      Account 5: -r-x 7FNWTfCo3AyRBFCvr49daqKHehdn2GjNgpjuTsqy5twk
      Instruction 0
        Program:   7FNWTfCo3AyRBFCvr49daqKHehdn2GjNgpjuTsqy5twk (5)
        Account 0: J75jd3kjsABQSDrEdywcyhmbq8eHDowfW9xtEWsVALy9 (0)
        Account 1: FuEm7UMaCYHThzKaf9DcJ7MdM4t4SALfeNnYQq46foVv (2)
        Account 2: Hsj5SnZTDvNis3dZCtzpj5NCy8JzBHRXKrgdSSS26cnr (3)
        Account 3: CjDiyTgSFRbCxzfmj7NAikHScDrgBvEYWCfJLRpysvZV (1)
        Account 4: 11111111111111111111111111111111 (4)
        Data: [0, 44, 13, 146, 97, 0, 0, 0, 0, 164, 13, 146, 97, 0, 0, 0, 0, 0, 225, 245, 5, 0, 0, 0, 0]
      Status: Unavailable

Here is the diff if anyone wants to apply the single-thread change locally; basically, just moving the logic out of the thread-pool.

diff --git a/ledger/src/blockstore_processor.rs b/ledger/src/blockstore_processor.rs
index 50ee8ace5..fb3a5b913 100644
--- a/ledger/src/blockstore_processor.rs
+++ b/ledger/src/blockstore_processor.rs
@@ -253,6 +253,7 @@ fn execute_batches(
     cost_capacity_meter: Arc<RwLock<BlockCostCapacityMeter>>,
 ) -> Result<()> {
     inc_new_counter_debug!("bank-par_execute_entries-count", batches.len());
+    /*
     let (results, new_timings): (Vec<Result<()>>, Vec<ExecuteTimings>) =
         PAR_THREAD_POOL.with(|thread_pool| {
             thread_pool.borrow().install(|| {
@@ -276,12 +277,36 @@ fn execute_batches(
                     .unzip()
             })
         });
+    */
+    let mut results = vec![];
+    for (batch_idx, batch) in batches.iter().enumerate() {
+        trace!("Batch {}, batch contents: {:#?}", batch_idx, batch.sanitized_transactions());
+        let mut timings = ExecuteTimings::default();
+        let result = execute_batch(
+            batch,
+            bank,
+            transaction_status_sender,
+            replay_vote_sender,
+            &mut timings,
+            cost_capacity_meter.clone(),
+        );
+        if let Some(entry_callback) = entry_callback {
+            entry_callback(bank);
+        }
+        results.push(result);
+    }

     timings.total_batches_len += batches.len();
     timings.num_execute_batches += 1;
+    /*
     for timing in new_timings {
         timings.accumulate(&timing);
     }
+    */

     first_err(&results)
 }

I confirmed that things were fine after the iteration for batch_idx == 2, and went south on batch_idx == 3. Skipping batch_idx == 3 iteration allowed the rest of the slot to process without segfault; the above tx was the only one in batch_idx == 3 batch.

Lichtso commented 3 years ago

https://explorer.solana.com/tx/45cfShbqK2AQ9Posh4j56AvziWmEw8ZzPzNn7q7otaMtuRAJG7MvPoRGrjRNSNDXLcPG64YdTXetM8scihkVCYDY?cluster=testnet

Lichtso commented 3 years ago

I managed to trace the program and it looks like the problem comes from #18981 not #21041.

Running 7FNWTfCo3AyRBFCvr49daqKHehdn2GjNgpjuTsqy5twk
        0 [0000000000000000, 0000000400000000, 0000000000000000, 0000000000000000, 0000000000000000, 0000000000000000, 0000000000000000, 0000000000000000, 0000000000000000, 0000000000000000, 0000000200001000]  4602: mov64 r2, r1
        1 [0000000000000000, 0000000400000000, 0000000400000000, 0000000000000000, 0000000000000000, 0000000000000000, 0000000000000000, 0000000000000000, 0000000000000000, 0000000000000000, 0000000200001000]  4603: mov64 r1, r10
        2 [0000000000000000, 0000000200001000, 0000000400000000, 0000000000000000, 0000000000000000, 0000000000000000, 0000000000000000, 0000000000000000, 0000000000000000, 0000000000000000, 0000000200001000]  4604: add64 r1, -112
        3 [0000000000000000, 0000000200000F90, 0000000400000000, 0000000000000000, 0000000000000000, 0000000000000000, 0000000000000000, 0000000000000000, 0000000000000000, 0000000000000000, 0000000200001000]  4605: call function_7400
[...]
     2369 [000000000018A380, FFFFFFFFFFFFFFFF, 43EFFFFFFFFFFFFF, 4128A38000000000, 0000000000000000, 7FFFFFFFFFFFFFFF, 4138A38000000000, 000000000018A380, 0000000300007F10, 0000000300007FA0, 0000000200007000] 16615: exit
     2370 [000000000018A380, FFFFFFFFFFFFFFFF, 43EFFFFFFFFFFFFF, 4128A38000000000, 0000000000000000, 7FFFFFFFFFFFFFFF, 0000000200004E88, 0000000300007F40, 0000000300007F10, 0000000300007FA0, 0000000200005000]   445: stxdw [r10-0x288], r0
     2371 [000000000018A380, FFFFFFFFFFFFFFFF, 43EFFFFFFFFFFFFF, 4128A38000000000, 0000000000000000, 7FFFFFFFFFFFFFFF, 0000000200004E88, 0000000300007F40, 0000000300007F10, 0000000300007FA0, 0000000200005000]   446: mov64 r1, r10
     2372 [000000000018A380, 0000000200005000, 43EFFFFFFFFFFFFF, 4128A38000000000, 0000000000000000, 7FFFFFFFFFFFFFFF, 0000000200004E88, 0000000300007F40, 0000000300007F10, 0000000300007FA0, 0000000200005000]   447: add64 r1, -112
     2373 [000000000018A380, 0000000200004F90, 43EFFFFFFFFFFFFF, 4128A38000000000, 0000000000000000, 7FFFFFFFFFFFFFFF, 0000000200004E88, 0000000300007F40, 0000000300007F10, 0000000300007FA0, 0000000200005000]   448: call function_18434
     2374 [000000000018A380, 0000000200004F90, 43EFFFFFFFFFFFFF, 4128A38000000000, 0000000000000000, 7FFFFFFFFFFFFFFF, 0000000200004E88, 0000000300007F40, 0000000300007F10, 0000000300007FA0, 0000000200007000] 18463: mov64 r6, r1
     2375 [000000000018A380, 0000000200004F90, 43EFFFFFFFFFFFFF, 4128A38000000000, 0000000000000000, 7FFFFFFFFFFFFFFF, 0000000200004F90, 0000000300007F40, 0000000300007F10, 0000000300007FA0, 0000000200007000] 18464: mov64 r7, 0
     2376 [000000000018A380, 0000000200004F90, 43EFFFFFFFFFFFFF, 4128A38000000000, 0000000000000000, 7FFFFFFFFFFFFFFF, 0000000200004F90, 0000000000000000, 0000000300007F10, 0000000300007FA0, 0000000200007000] 18465: stxdw [r10-0x8], r7
     2377 [000000000018A380, 0000000200004F90, 43EFFFFFFFFFFFFF, 4128A38000000000, 0000000000000000, 7FFFFFFFFFFFFFFF, 0000000200004F90, 0000000000000000, 0000000300007F10, 0000000300007FA0, 0000000200007000] 18466: mov64 r1, r10
     2378 [000000000018A380, 0000000200007000, 43EFFFFFFFFFFFFF, 4128A38000000000, 0000000000000000, 7FFFFFFFFFFFFFFF, 0000000200004F90, 0000000000000000, 0000000300007F10, 0000000300007FA0, 0000000200007000] 18467: add64 r1, -8
     2379 [000000000018A380, 0000000200006FF8, 43EFFFFFFFFFFFFF, 4128A38000000000, 0000000000000000, 7FFFFFFFFFFFFFFF, 0000000200004F90, 0000000000000000, 0000000300007F10, 0000000300007FA0, 0000000200007000] 18468: syscall sol_get_fees_sysvar

It crashes here and should instead continue with:

     2380 [0000000000000000, 0000000200006FF8, 43EFFFFFFFFFFFFF, 4128A38000000000, 0000000000000000, 7FFFFFFFFFFFFFFF, 0000000200004F90, 0000000000000000, 0000000300007F10, 0000000300007FA0, 0000000200007000] 18469: jeq r0, 0, lbb_18447
     2381 [0000000000000000, 0000000200006FF8, 43EFFFFFFFFFFFFF, 4128A38000000000, 0000000000000000, 7FFFFFFFFFFFFFFF, 0000000200004F90, 0000000000000000, 0000000300007F10, 0000000300007FA0, 0000000200007000] 18476: ldxdw r1, [r10-0x8]
     2382 [0000000000000000, 0000000000001388, 43EFFFFFFFFFFFFF, 4128A38000000000, 0000000000000000, 7FFFFFFFFFFFFFFF, 0000000200004F90, 0000000000000000, 0000000300007F10, 0000000300007FA0, 0000000200007000] 18477: stxdw [r6+0x8], r1
     2383 [0000000000000000, 0000000000001388, 43EFFFFFFFFFFFFF, 4128A38000000000, 0000000000000000, 7FFFFFFFFFFFFFFF, 0000000200004F90, 0000000000000000, 0000000300007F10, 0000000300007FA0, 0000000200007000] 18478: stxdw [r6+0x0], r7
     2384 [0000000000000000, 0000000000001388, 43EFFFFFFFFFFFFF, 4128A38000000000, 0000000000000000, 7FFFFFFFFFFFFFFF, 0000000200004F90, 0000000000000000, 0000000300007F10, 0000000300007FA0, 0000000200007000] 18479: exit
[...]
    35867 [0000000300007F10, 0000000300007F10, 00000000000000F0, 0000000000000008, 0000000000000000, 0000000000000003, 00000000000000F0, 0000000200000F78, 000000040000CA08, 0000000000000000, 0000000200007000] 12496: exit
    35868 [0000000300007F10, 0000000300007F10, 00000000000000F0, 0000000000000008, 0000000000000000, 0000000000000003, 00000000000000F0, 0000000200000F78, 000000040000CA08, 0000000000000000, 0000000200005000] 13222: exit
    35869 [0000000300007F10, 0000000300007F10, 00000000000000F0, 0000000000000008, 0000000000000000, 0000000000000003, 00000000000000F0, 0000000200000F78, 000000040000CA08, 0000000000000000, 0000000200003000] 12384: exit
    35870 [0000000300007F10, 0000000300007F10, 00000000000000F0, 0000000000000008, 0000000000000000, 0000000000000003, 0000000000000000, 0000000200000F78, 000000040000CA08, 0000000000000000, 0000000200001000] 12468: mov64 r0, r6
    35871 [0000000000000000, 0000000300007F10, 00000000000000F0, 0000000000000008, 0000000000000000, 0000000000000003, 0000000000000000, 0000000200000F78, 000000040000CA08, 0000000000000000, 0000000200001000] 12469: exit
Program 7FNWTfCo3AyRBFCvr49daqKHehdn2GjNgpjuTsqy5twk consumed 41482 of 200000 compute units

Based on that trace, my current understanding is:

I tried updating to solana_rbpf=0.2.15 which picked up the fix of vm::REPORT_UNRESOLVED_SYMBOL_INDEX and it still crashes, so there might be more to the story. EDIT: Seems that fetching v0.2.15 is different from using path = "../../solana_rbpf" in Cargo.toml. Maybe build.rs does not work if fetched from crates.io ?

Lichtso commented 3 years ago

Second half should be fixed with: https://github.com/solana-labs/rbpf/pull/231 However, until RBPF v0.2.16 is released this issue should stay open.

github-actions[bot] commented 2 years ago

This issue has been automatically locked since there has not been any activity in past 7 days after it was closed. Please open a new issue for related bugs.