Open sakridge opened 4 years ago
Linked rocksdb issue: https://github.com/facebook/rocksdb/issues/6435
The --no-rocksdb-compaction
flag resulted in delinquent validators for Pete -- LunaNova and me. We are both running without it.
I've not heard any reports of this issue recently, closing for now
actually, there is sporadic reports of this (search our discord server).....
I've been waiting to exclude our unusual manual compaction codepath could possibly be causing this. But, after the write stall fix is well rolled out, there's still reports. So, I dug up a bit:
$ diff -U3 <(xxd slot-88733596-shreds-good) <(xxd slot-88733596-shreds-bad)
--- /dev/fd/63 2021-08-11 22:53:03.943777438 +0900
+++ /dev/fd/62 2021-08-11 22:53:03.943777438 +0900
@@ -78745,7 +78745,7 @@
00133980: 3033 4145 4532 3846 4630 4238 3838 4442 03AEE28FF0B888DB
00133990: 3746 3146 4331 4534 3939 3030 4238 4331 7F1FC1E49900B8C1
001339a0: 3633 3930 4637 3842 3234 4132 4134 3630 6390F78B24A2A460
-001339b0: 3841 4531 3644 3033 4331 4231 3743 3330 8AE16D03C1B17C30
+001339b0: 3941 4531 3644 3033 4331 4231 3743 3330 9AE16D03C1B17C30
001339c0: 3641 3744 3531 3731 3932 4630 4141 4643 6A7D517192F0AAFC
001339d0: 3646 3236 3545 3346 4237 3743 4337 4144 6F265E3FB77CC7AD
001339e0: 4138 3243 3532 3944 3042 4533 4231 3336 A82C529D0BE3B136
making a fun debugging story short, this boils down rocksdb actually caught single bit rot from underlying hardware.
So, our next action here would be to properly handle the error and show message and abort solana-validator
/solana-ledger-tool
:
Unfortunately, RocksDb detected data corruption. This usually means your ledger integrity is irrevocably compromised due to underlying hardware problem like silent data corruption and can't continue to operate. If possible, bug report to https://github.com/solana-labs/solana/issues/9009 with the mentioned SST file attached. After that, remove your current ledger to restore operation or restore using your backup system. Original RocksDB error: RocksDb(Error { message: \"Corruption: block checksum mismatch: stored = 1822493296, computed = 125871199 in /home/solana/validator-ledger/rocksdb/005714.sst offset 2076534 size 4631\" })
@ryoqun @sakridge As requested, a crash report: We experienced it on 1.10.27
:
Jun 25 21:53:12 testnet testnet.sh[499257]: [2022-06-25T21:53:12.250867140Z INFO solana_ledger::blockstore_processor] processing ledger from slot 138536385...
Jun 25 21:53:12 testnet testnet.sh[499257]: thread 'main' panicked at 'Couldn't set root slot on startup: RocksDb(Error { message: "Corruption: block checksum mismatch: stored = 2324967102, computed = 2123576394, type = 1 in /storage/validator-ledger/rocksdb/011080.sst offset 36943576 size 4271" })', ledger/src/blockstore_processor.rs:677:14
Jun 25 21:53:12 testnet testnet.sh[499257]: stack backtrace:
Jun 25 21:53:12 testnet testnet.sh[499257]: 0: rust_begin_unwind
Jun 25 21:53:12 testnet testnet.sh[499257]: at ./rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/std/src/panicking.rs:498:5
Jun 25 21:53:12 testnet testnet.sh[499257]: 1: core::panicking::panic_fmt
Jun 25 21:53:12 testnet testnet.sh[499257]: at ./rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/core/src/panicking.rs:116:14
Jun 25 21:53:12 testnet testnet.sh[499257]: 2: core::result::unwrap_failed
Jun 25 21:53:12 testnet testnet.sh[499257]: at ./rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/core/src/result.rs:1690:5
Jun 25 21:53:12 testnet testnet.sh[499257]: 3: solana_ledger::blockstore_processor::process_blockstore_from_root
Jun 25 21:53:12 testnet testnet.sh[499257]: 4: solana_core::validator::Validator::new
Jun 25 21:53:12 testnet testnet.sh[499257]: 5: solana_validator::main
Jun 25 21:53:12 testnet testnet.sh[499257]: note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
Jun 25 21:53:12 testnet testnet.sh[499257]: [2022-06-25T21:53:12.258596408Z ERROR solana_metrics::metrics] datapoint: panic program="validator" thread="main" one=1i message="panicked at 'Couldn't set root slot on startup: RocksDb(Error { message: \"Corruption: block checksum mismatch: stored = 2324967102, computed = 2123576394, type = 1 in /storage/validator-ledger/rocksdb/011080.sst offset 36943576 size 4271\" })', ledger/src/blockstore_processor.rs:677:14" location="ledger/src/blockstore_processor.rs:677:14" version="\"1.10.27 (src:8a9f5962; feat:1425680972)\""
File: https://github.com/calypso-it/sol-rocksdb-panic/blob/main/011080.sst
@yhchiang-sol could you take a look at that report? Is there a way we can recover from checksum mismatch maybe by removing that data which is corrupted and re-downloading it?
@yhchiang-sol could you take a look at that report? Is there a way we can recover from checksum mismatch maybe by removing that data which is corrupted and re-downloading it?
Should be something doable, but we need to write some code to make it happen.
Since the corruption happens in one or more blocks inside a sst file, it means we might still be able to read its sst file metadata to obtain its key range. If so, then we might be able to recover using the following steps:
Note that there would be some subtle difference between the recovered db and the db before the corruption as the recovered entries will become newer than they used to be, but if we always read the db without snapshot, then we should be fine.
- obtain the key range of the file via the https://github.com/facebook/rocksdb/pull/10207
Note that the assumption here is we are still able to read the metadata of the sst file that has one or more block corruption.
Btw, in case the corrupted instance is still there, can I have a copy of that instance for debug purposes?
@yhchiang-sol What files do you mean by "corrupted instance"? The issues re-appears frequently, so I can save them on the next crash
@yhchiang-sol What files do you mean by "corrupted instance"?
Ideally, the entire rocksdb database, but that will definitely be too big to keep. And, thanks for attaching the corrupted sst file, I just saw that in your previous message. I will do some debugging.
The issues re-appears frequently, so I can save them on the next crash
Thank you, can you save the sst file as well on the next crash.
Btw, can I know whether this happens frequently to all your validators or just one validator if you happen to have multiple validators?
(i think this is another data corruption. let me revive the last investigation patch...)
@yhchiang-sol @sakridge
so, this time wan't that the root cause is a single bit flip. but there was a oddly zero-ed range in the reported sst file:
$ xxd /home/ryoqun/Downloads/011080.sst
...
0233be70: 0000 0000 0000 d953 4108 0000 0000 1f00 .......SA.......
0233be80: 0000 da53 4108 0000 0000 1e00 0000 db53 ...SA..........S
0233be90: 4108 0000 0000 1d00 0000 dc53 4108 0000 A..........SA...
0233bea0: 0000 1c00 0000 ec53 4108 0000 0000 1b00 .......SA.......
0233beb0: 0000 ed53 4108 0000 0000 1a00 0000 ee53 ...SA..........S
0233bec0: 4108 0000 0000 1900 0000 ef53 4108 0000 A..........SA...
0233bed0: 0000 1800 0000 f053 4108 0000 0000 1700 .......SA.......
0233bee0: 0000 f153 4108 0000 0000 1600 0000 f853 ...SA..........S
0233bef0: 4108 0000 0000 1500 0000 0854 4108 0000 A..........TA...
0233bf00: 0000 1400 0000 0954 4108 0000 0000 1300 .......TA.......
0233bf10: 0000 1054 4108 0000 0000 1200 0000 1154 ...TA..........T
0233bf20: 4108 0000 0000 1100 0000 1254 4108 0000 A..........TA...
0233bf30: 0000 1000 0000 1354 4108 0000 0000 0f00 .......TA.......
0233bf40: 0000 1454 4108 0000 0000 0e00 0000 1554 ...TA..........T
0233bf50: 4108 0000 0000 0d00 0000 1c54 4108 0000 A..........TA...
0233bf60: 0000 0c00 0000 1d54 4108 0000 0000 0b00 .......TA.......
0233bf70: 0000 2454 4108 0000 0000 0a00 0000 2554 ..$TA.........%T
0233bf80: 4108 0000 0000 0900 0000 2654 4108 0000 A.........&TA...
0233bf90: 0000 0800 0000 2754 4108 0000 0000 0700 ......'TA.......
0233bfa0: 0000 3d54 4108 0000 0000 0600 0000 4454 ..=TA.........DT
0233bfb0: 4108 0000 0000 0500 0000 5054 4108 0000 A.........PTA...
0233bfc0: 0000 0400 0000 5154 4108 0000 0000 0300 ......QTA.......
0233bfd0: 0000 5254 4108 0000 0000 0200 0000 5354 ..RTA.........ST
0233bfe0: 4108 0000 0000 010f 098b 06e3 019d 93db A...............
0233bff0: 1700 0000 ff5f 9d41 8d71 076a 2217 cb9a ....._.A.q.j"...
0233c000: 0000 0000 0000 0000 0000 0000 0000 0000 ................
0233c010: 0000 0000 0000 0000 0000 0000 0000 0000 ................
0233c020: 0000 0000 0000 0000 0000 0000 0000 0000 ................
0233c030: 0000 0000 0000 0000 0000 0000 0000 0000 ................
0233c040: 0000 0000 0000 0000 0000 0000 0000 0000 ................
0233c050: 0000 0000 0000 0000 0000 0000 0000 0000 ................
0233c060: 0000 0000 0000 0000 0000 0000 0000 0000 ................
0233c070: 0000 0000 0000 0000 0000 0000 0000 0000 ................
...
0235bfc0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
0235bfd0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
0235bfe0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
0235bff0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
0235c000: 0104 27b7 6200 0000 0001 1686 15da 7295 ..'.b.........r.
[1] pry(main)> "%0x" % (0x0235c000 - 0x0233c000)
=> "20000"
as far as I understand, sst files serialization isn't aligned that way. Also, sst files won't waste file space with those zeros under its usual operation.
so, i highly suspect the underlying filesystem/hardware wiped the nicely-hex-round block range due to some failure.
... so rocksdb is innocent for this case as well. :)
Thanks for adding some data points, @ryoqun.
Btw, in case the underlying filesystem/hardware is the root cause, we are still interested in providing the recovery tool to recover the sst file?
@calypso-it hi, did you experience this panic at testnet or mainnet-beta?
@ryoqun
hi, did you experience this panic at testnet or mainnet-beta?
Testnet
so, i highly suspect the underlying filesystem/hardware wiped the nicely-hex-round block range due to some failure.
That was my suspicion, too. I will change the NVMe and see, if the issue persists. Nonetheless auto-recovery / repair would be a nice feature, if repairable
Nonetheless auto-recovery / repair would be a nice feature, if repairable
Good to hear that. I will invest some time into this.
Aghh, a bit show stopper of my previously proposed solution based on my WIP PR (https://github.com/solana-labs/solana/pull/26790) as we are not able to obtain the key range of the corrupted sst file even if the corruption only happens on the data block.
Output from a normal sst file:
[2022-07-27T10:33:05.218841891Z INFO solana_ledger::blockstore] "/mnt/disks/ledger/mainnet/rocksdb" open took 4.7s
cf_name: "transaction_status", start_slot: 143304188, end_slot: 143306321
[2022-07-27T10:33:05.286706397Z INFO solana_ledger_tool] ledger tool took 4.8s
output from a corrupted sst file, where the tool is not able to obtain the sst file metadata. Probably need to change rocksdb code in order to bypass the check.
[2022-07-27T10:43:59.747644892Z INFO solana_ledger::blockstore] "/mnt/disks/ledger/mainnet/rocksdb" open took 4.7s
Failed to find the specified file "006024.sst"
[2022-07-27T10:43:59.814122652Z INFO solana_ledger_tool] ledger tool took 4.8s
Aghh, a bit show stopper of my previously proposed solution based on my WIP PR (https://github.com/solana-labs/solana/pull/26790) as we are not able to obtain the key range of the corrupted sst file even if the corruption only happens on the data block.
Please ignore my previous comment as it turns out that xxd somehow always padding one more byte that cause the file unreadable to rocksdb.
If the corruption only happens in the data block of an sst file without changing the file size, then we are able to obtain the key range of that corrupted file! This opens the possibility that we might be able to replace the corrupted file by copying uncorrupted data of that key range from a healthy validator instance.
[2022-07-27T12:19:46.258790292Z INFO solana_ledger::blockstore] "/mnt/disks/ledger/mainnet/rocksdb" open took 4.4s
cf_name: "code_shred", start_key: 143342061, end_key: 143342160
[2022-07-27T12:19:46.302937014Z INFO solana_ledger_tool] ledger tool took 4.5s
Created a separate issue for recovery tools: #26813
Problem
Seen messages like this on many validators from rocksdb.
Proposed Solution
Debug and fix.