Rocksdb corruption issues on compaction

sakridge commented 4 years ago

Problem

Seen messages like this on many validators from rocksdb.

[2020-03-22T10:22:48.766837593Z ERROR solana_core::window_service] thread 
Some("solana-window-insert") error BlockstoreError(RocksDb(Error { 
message: "Corruption: block checksum mismatch: expected 3583270445, got 3398136873  
in /mnt/vol1/ledger/rocksdb/165855.sst offset 25107936 size 3758" }))

Proposed Solution

Debug and fix.

sakridge commented 4 years ago

Linked rocksdb issue: https://github.com/facebook/rocksdb/issues/6435

brianlong commented 4 years ago

The --no-rocksdb-compaction flag resulted in delinquent validators for Pete -- LunaNova and me. We are both running without it.

mvines commented 4 years ago

I've not heard any reports of this issue recently, closing for now

ryoqun commented 3 years ago

actually, there is sporadic reports of this (search our discord server).....

ryoqun commented 3 years ago

I've been waiting to exclude our unusual manual compaction codepath could possibly be causing this. But, after the write stall fix is well rolled out, there's still reports. So, I dug up a bit:

$ diff -U3 <(xxd slot-88733596-shreds-good) <(xxd slot-88733596-shreds-bad) 
--- /dev/fd/63  2021-08-11 22:53:03.943777438 +0900
+++ /dev/fd/62  2021-08-11 22:53:03.943777438 +0900
@@ -78745,7 +78745,7 @@
 00133980: 3033 4145 4532 3846 4630 4238 3838 4442  03AEE28FF0B888DB
 00133990: 3746 3146 4331 4534 3939 3030 4238 4331  7F1FC1E49900B8C1
 001339a0: 3633 3930 4637 3842 3234 4132 4134 3630  6390F78B24A2A460
-001339b0: 3841 4531 3644 3033 4331 4231 3743 3330  8AE16D03C1B17C30
+001339b0: 3941 4531 3644 3033 4331 4231 3743 3330  9AE16D03C1B17C30
 001339c0: 3641 3744 3531 3731 3932 4630 4141 4643  6A7D517192F0AAFC
 001339d0: 3646 3236 3545 3346 4237 3743 4337 4144  6F265E3FB77CC7AD
 001339e0: 4138 3243 3532 3944 3042 4533 4231 3336  A82C529D0BE3B136

making a fun debugging story short, this boils down rocksdb actually caught single bit rot from underlying hardware.

So, our next action here would be to properly handle the error and show message and abort solana-validator/solana-ledger-tool:

Unfortunately, RocksDb detected data corruption. This usually means your ledger integrity is irrevocably compromised due to underlying hardware problem like silent data corruption and can't continue to operate. If possible, bug report to https://github.com/solana-labs/solana/issues/9009 with the mentioned SST file attached. After that, remove your current ledger to restore operation or restore using your backup system. Original RocksDB error: RocksDb(Error { message: \"Corruption: block checksum mismatch: stored = 1822493296, computed = 125871199 in /home/solana/validator-ledger/rocksdb/005714.sst offset 2076534 size 4631\" })

calypso-it commented 2 years ago

@ryoqun @sakridge As requested, a crash report: We experienced it on 1.10.27:

Jun 25 21:53:12 testnet testnet.sh[499257]: [2022-06-25T21:53:12.250867140Z INFO  solana_ledger::blockstore_processor] processing ledger from slot 138536385...
Jun 25 21:53:12 testnet testnet.sh[499257]: thread 'main' panicked at 'Couldn't set root slot on startup: RocksDb(Error { message: "Corruption: block checksum mismatch: stored = 2324967102, computed = 2123576394, type = 1  in /storage/validator-ledger/rocksdb/011080.sst offset 36943576 size 4271" })', ledger/src/blockstore_processor.rs:677:14
Jun 25 21:53:12 testnet testnet.sh[499257]: stack backtrace:
Jun 25 21:53:12 testnet testnet.sh[499257]:    0: rust_begin_unwind
Jun 25 21:53:12 testnet testnet.sh[499257]:              at ./rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/std/src/panicking.rs:498:5
Jun 25 21:53:12 testnet testnet.sh[499257]:    1: core::panicking::panic_fmt
Jun 25 21:53:12 testnet testnet.sh[499257]:              at ./rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/core/src/panicking.rs:116:14
Jun 25 21:53:12 testnet testnet.sh[499257]:    2: core::result::unwrap_failed
Jun 25 21:53:12 testnet testnet.sh[499257]:              at ./rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/core/src/result.rs:1690:5
Jun 25 21:53:12 testnet testnet.sh[499257]:    3: solana_ledger::blockstore_processor::process_blockstore_from_root
Jun 25 21:53:12 testnet testnet.sh[499257]:    4: solana_core::validator::Validator::new
Jun 25 21:53:12 testnet testnet.sh[499257]:    5: solana_validator::main
Jun 25 21:53:12 testnet testnet.sh[499257]: note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
Jun 25 21:53:12 testnet testnet.sh[499257]: [2022-06-25T21:53:12.258596408Z ERROR solana_metrics::metrics] datapoint: panic program="validator" thread="main" one=1i message="panicked at 'Couldn't set root slot on startup: RocksDb(Error { message: \"Corruption: block checksum mismatch: stored = 2324967102, computed = 2123576394, type = 1  in /storage/validator-ledger/rocksdb/011080.sst offset 36943576 size 4271\" })', ledger/src/blockstore_processor.rs:677:14" location="ledger/src/blockstore_processor.rs:677:14" version="\"1.10.27 (src:8a9f5962; feat:1425680972)\""

File: https://github.com/calypso-it/sol-rocksdb-panic/blob/main/011080.sst

sakridge commented 2 years ago

@yhchiang-sol could you take a look at that report? Is there a way we can recover from checksum mismatch maybe by removing that data which is corrupted and re-downloading it?

yhchiang-sol commented 2 years ago

@yhchiang-sol could you take a look at that report? Is there a way we can recover from checksum mismatch maybe by removing that data which is corrupted and re-downloading it?

Should be something doable, but we need to write some code to make it happen.

Since the corruption happens in one or more blocks inside a sst file, it means we might still be able to read its sst file metadata to obtain its key range. If so, then we might be able to recover using the following steps:

obtain the key range of the file via the get_column_family_metadata() API
delete the file using the delete_file API (which may break the consistency in the deleted file has any deletion entries, but it's better than not able to read the file.)
based on the key range obtained in step 1, redownload those keys and reinsert them to rocksdb.

Note that there would be some subtle difference between the recovered db and the db before the corruption as the recovered entries will become newer than they used to be, but if we always read the db without snapshot, then we should be fine.

yhchiang-sol commented 2 years ago

obtain the key range of the file via the https://github.com/facebook/rocksdb/pull/10207

Note that the assumption here is we are still able to read the metadata of the sst file that has one or more block corruption.

Btw, in case the corrupted instance is still there, can I have a copy of that instance for debug purposes?

calypso-it commented 2 years ago

@yhchiang-sol What files do you mean by "corrupted instance"? The issues re-appears frequently, so I can save them on the next crash

yhchiang-sol commented 2 years ago

@yhchiang-sol What files do you mean by "corrupted instance"?

Ideally, the entire rocksdb database, but that will definitely be too big to keep. And, thanks for attaching the corrupted sst file, I just saw that in your previous message. I will do some debugging.

The issues re-appears frequently, so I can save them on the next crash

Thank you, can you save the sst file as well on the next crash.

Btw, can I know whether this happens frequently to all your validators or just one validator if you happen to have multiple validators?

ryoqun commented 2 years ago

(i think this is another data corruption. let me revive the last investigation patch...)

ryoqun commented 2 years ago

@yhchiang-sol @sakridge

so, this time wan't that the root cause is a single bit flip. but there was a oddly zero-ed range in the reported sst file:

$ xxd /home/ryoqun/Downloads/011080.sst
...
0233be70: 0000 0000 0000 d953 4108 0000 0000 1f00  .......SA.......
0233be80: 0000 da53 4108 0000 0000 1e00 0000 db53  ...SA..........S
0233be90: 4108 0000 0000 1d00 0000 dc53 4108 0000  A..........SA...
0233bea0: 0000 1c00 0000 ec53 4108 0000 0000 1b00  .......SA.......
0233beb0: 0000 ed53 4108 0000 0000 1a00 0000 ee53  ...SA..........S
0233bec0: 4108 0000 0000 1900 0000 ef53 4108 0000  A..........SA...
0233bed0: 0000 1800 0000 f053 4108 0000 0000 1700  .......SA.......
0233bee0: 0000 f153 4108 0000 0000 1600 0000 f853  ...SA..........S
0233bef0: 4108 0000 0000 1500 0000 0854 4108 0000  A..........TA...
0233bf00: 0000 1400 0000 0954 4108 0000 0000 1300  .......TA.......
0233bf10: 0000 1054 4108 0000 0000 1200 0000 1154  ...TA..........T
0233bf20: 4108 0000 0000 1100 0000 1254 4108 0000  A..........TA...
0233bf30: 0000 1000 0000 1354 4108 0000 0000 0f00  .......TA.......
0233bf40: 0000 1454 4108 0000 0000 0e00 0000 1554  ...TA..........T
0233bf50: 4108 0000 0000 0d00 0000 1c54 4108 0000  A..........TA...
0233bf60: 0000 0c00 0000 1d54 4108 0000 0000 0b00  .......TA.......
0233bf70: 0000 2454 4108 0000 0000 0a00 0000 2554  ..$TA.........%T
0233bf80: 4108 0000 0000 0900 0000 2654 4108 0000  A.........&TA...
0233bf90: 0000 0800 0000 2754 4108 0000 0000 0700  ......'TA.......
0233bfa0: 0000 3d54 4108 0000 0000 0600 0000 4454  ..=TA.........DT
0233bfb0: 4108 0000 0000 0500 0000 5054 4108 0000  A.........PTA...
0233bfc0: 0000 0400 0000 5154 4108 0000 0000 0300  ......QTA.......
0233bfd0: 0000 5254 4108 0000 0000 0200 0000 5354  ..RTA.........ST
0233bfe0: 4108 0000 0000 010f 098b 06e3 019d 93db  A...............
0233bff0: 1700 0000 ff5f 9d41 8d71 076a 2217 cb9a  ....._.A.q.j"...
0233c000: 0000 0000 0000 0000 0000 0000 0000 0000  ................
0233c010: 0000 0000 0000 0000 0000 0000 0000 0000  ................
0233c020: 0000 0000 0000 0000 0000 0000 0000 0000  ................
0233c030: 0000 0000 0000 0000 0000 0000 0000 0000  ................
0233c040: 0000 0000 0000 0000 0000 0000 0000 0000  ................
0233c050: 0000 0000 0000 0000 0000 0000 0000 0000  ................
0233c060: 0000 0000 0000 0000 0000 0000 0000 0000  ................
0233c070: 0000 0000 0000 0000 0000 0000 0000 0000  ................
...
0235bfc0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
0235bfd0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
0235bfe0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
0235bff0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
0235c000: 0104 27b7 6200 0000 0001 1686 15da 7295  ..'.b.........r.

[1] pry(main)> "%0x" % (0x0235c000 - 0x0233c000)
=> "20000"

as far as I understand, sst files serialization isn't aligned that way. Also, sst files won't waste file space with those zeros under its usual operation.

so, i highly suspect the underlying filesystem/hardware wiped the nicely-hex-round block range due to some failure.

ryoqun commented 2 years ago

... so rocksdb is innocent for this case as well. :)

yhchiang-sol commented 2 years ago

Thanks for adding some data points, @ryoqun.

Btw, in case the underlying filesystem/hardware is the root cause, we are still interested in providing the recovery tool to recover the sst file?

ryoqun commented 2 years ago

@calypso-it hi, did you experience this panic at testnet or mainnet-beta?

calypso-it commented 2 years ago

@ryoqun

hi, did you experience this panic at testnet or mainnet-beta?

Testnet

so, i highly suspect the underlying filesystem/hardware wiped the nicely-hex-round block range due to some failure.

That was my suspicion, too. I will change the NVMe and see, if the issue persists. Nonetheless auto-recovery / repair would be a nice feature, if repairable

yhchiang-sol commented 2 years ago

Nonetheless auto-recovery / repair would be a nice feature, if repairable

Good to hear that. I will invest some time into this.

yhchiang-sol commented 2 years ago

Aghh, a bit show stopper of my previously proposed solution based on my WIP PR (https://github.com/solana-labs/solana/pull/26790) as we are not able to obtain the key range of the corrupted sst file even if the corruption only happens on the data block.

Output from a normal sst file:

[2022-07-27T10:33:05.218841891Z INFO  solana_ledger::blockstore] "/mnt/disks/ledger/mainnet/rocksdb" open took 4.7s
cf_name: "transaction_status", start_slot: 143304188, end_slot: 143306321
[2022-07-27T10:33:05.286706397Z INFO  solana_ledger_tool] ledger tool took 4.8s

output from a corrupted sst file, where the tool is not able to obtain the sst file metadata. Probably need to change rocksdb code in order to bypass the check.

[2022-07-27T10:43:59.747644892Z INFO  solana_ledger::blockstore] "/mnt/disks/ledger/mainnet/rocksdb" open took 4.7s
Failed to find the specified file "006024.sst"
[2022-07-27T10:43:59.814122652Z INFO  solana_ledger_tool] ledger tool took 4.8s

yhchiang-sol commented 2 years ago

Aghh, a bit show stopper of my previously proposed solution based on my WIP PR (https://github.com/solana-labs/solana/pull/26790) as we are not able to obtain the key range of the corrupted sst file even if the corruption only happens on the data block.

Please ignore my previous comment as it turns out that xxd somehow always padding one more byte that cause the file unreadable to rocksdb.

If the corruption only happens in the data block of an sst file without changing the file size, then we are able to obtain the key range of that corrupted file! This opens the possibility that we might be able to replace the corrupted file by copying uncorrupted data of that key range from a healthy validator instance.

[2022-07-27T12:19:46.258790292Z INFO  solana_ledger::blockstore] "/mnt/disks/ledger/mainnet/rocksdb" open took 4.4s
cf_name: "code_shred", start_key: 143342061, end_key: 143342160
[2022-07-27T12:19:46.302937014Z INFO  solana_ledger_tool] ledger tool took 4.5s

yhchiang-sol commented 2 years ago

Created a separate issue for recovery tools: #26813

solana-labs / solana

Rocksdb corruption issues on compaction #9009

Problem

Proposed Solution