RPC: Blocks go missing - Githubissues

linuskendall commented 3 years ago

Problem

We've noticed on several occasions that blocks goes missing from RPC. This typically happens when the RPC node becomes overwhelemd (not necessarily crashes, but becomes unresponsive for sometime) or when there's an abrupt reboot or something like this.

In these cases, minimumLedgerSlot will report an old value, but there will be holes in the confirmed blocks between minimumLedgerSlot and the current slot. The RPC service will report that "Slot xxx was skipped, or missing due to ledger jump to recent snapshot". However, this occurs even when we haven't pulled a snapshot. Also if pulling snapshot would we not presume that minimumLedgerSlot would get updated?

We've seen between 300 and 900 blocks missing typically. Most common (anecdotally) seems to be around300 blocks.

Sample output from a node that was missing blocks:

$ curl http://hk1.rpcpool.com:8899 -X POST  -H "Content-Type: application/json" -d '{"jsonrpc": "2.0","id":1,"method":"getConfirmedBlock","params":[75140922, {"encoding": "json","transactionDetails":"full","rewards":false}]}'
{"jsonrpc":"2.0","error":{"code":-32007,"message":"Slot 75140922 was skipped, or missing due to ledger jump to recent snapshot"},"id":1}

However, the slot clearly exists in the ledger directory as solana ledger tool can access it:

solana@hk1:~$ solana-ledger-tool -l /solana/ledger slot 75140922 >75140922.bak.slot
[2021-04-25T19:28:45.836799091Z INFO  solana_ledger_tool] solana-ledger-tool 1.6.6 (src:fe775a97; feat:3714435735)
[2021-04-25T19:28:45.836851189Z INFO  solana_ledger::blockstore] Maximum open file descriptors: 750000
[2021-04-25T19:28:45.836867639Z INFO  solana_ledger::blockstore] Opening database at "/solana/ledger/rocksdb"
[2021-04-25T19:28:45.862167110Z WARN  solana_ledger::blockstore_db] Error when opening as primary: IO error: While lock file: /solana/ledger/rocksdb/LOCK: Resource temporarily unavailable
[2021-04-25T19:28:45.862181937Z WARN  solana_ledger::blockstore_db] Trying as secondary at : "/solana/ledger/rocksdb/solana-secondary"
[2021-04-25T19:28:45.862189672Z WARN  solana_ledger::blockstore_db] This active secondary db use may temporarily cause the performance of another db use (like by validator) to degrade
[2021-04-25T19:29:08.644763501Z INFO  solana_ledger::blockstore] "/solana/ledger/rocksdb" open took 22.8s

solana-ledger-tool-75140922.slot.txt

We have one node where we have backed up the broken ledger which have about 900 missing slots. Blocks downloaded with getConfirmedBlocks between epoch start and current slot in epoch (at the time 175). minimumLedgerSlot was at the time in epoch 174 for all nodes, so there should have been a contiguous history for all nodes. All nodes agreed except one (in this case fra1).

$ diff hk1-175.slots sg1-175.slots | wc -l
0
$ diff fra1-175.slots sg1-175.slots |wc -l
926

When we request any of the 926 missing blocks we get the following error from the RPC service:

$ curl fra1.rpcpool.com:8899  -X POST -H "Content-Type: application/json" -d '{"jsonrpc": "2.0","id":1,"method":"getConfirmedBlock","params":['75634639', {"encoding": "json","transactionDetails":"full","rewards":false}]}'
{"jsonrpc":"2.0","error":{"code":-32007,"message":"Slot 75634639 was skipped, or missing due to ledger jump to recent snapshot"},"id":1}

On a node without missing blocks we get the data we expect:

curl ams1.rpcpool.com:8899  -X POST -H "Content-Type: application/json" -d '{"jsonrpc": "2.0","id":1,"method":"getConfirmedBlock","params":['75634639', {"encoding": "json","transactionDetails":"full","rewards":false}]}'

ams1-75634639.json.txt

Looking at the ledger on the broken node we see that the data is present in ledger:

$ solana-ledger-tool  slot 75634639 > fra1-75634639.txt

fra1-75634639.txt

This pattern repeats for all "missing blocks". The issue persists and the "hole" in the history will continue to exist until minimumLedgerSlot advances beyond the hole. We can currently identify it by looking calling getConfirmedBlocks and comparing it to other nodes. If the total slots stored (currentSlot - minimumLedgerSlot) are comparable on two nodes we can check getConfirmedBlocks to see if one of them are missing blocks.

Proposed solution

A temporary fix would be if the RPC node could reach out to BigTable to fetch the blocks that are missing in case it cannot find the block in the current ledger. Currently it seems that BigTable is only used for when minimumLedgerSlot is higher than requested slot, meaning that we'll get those missing slot errors in the case of these ledger holes.

Another option would be if we could have a way to easily correct the issue, i.e. run a "repair" that would fix a broken ledger and allow those blocks to be visible. We can monitor for holes, take a node out of pool and repair it if we had some way to do so.

This is an issue that has cropped up for app developers, I have heard reports of people having problem with the "Slot xxx was skipped or missing" error message. Ideally an RPC node configured with BigTable should never return this result.

mvines commented 3 years ago

@CriesofCarrots - can you take this please

ryoqun commented 3 years ago

@linuskendall what a detailed report. Thanks for the efforts.

To preclude any rocksdb compaction/corruption possibilities (#14586), could you provide more details?:

Do you see those Stalling messages like in https://github.com/solana-labs/solana/issues/14586#issuecomment-767646552 in your ledger/rocksdb/LOG* files?
what's the underlying file system type (ext4, or xfs?)
does abrupt restarts include OOM, diskful or hw power loss (i.e. unclean solana-validator process termination)? And those missing bad holes always correlates to the unhealthy symptom always in time?

tjbaker commented 3 years ago

We're also experiencing this issue on 1.5.19, so it is not just a 1.6 branch issue. RPC calls report "error":{"code":-32007 while solana-ledger-tool slot XXX is able to retrieve slot info without issue.

CriesofCarrots commented 3 years ago

@ryoqun , there are some Stalling messages in the rocksdb/LOG* files in the ledger Linus set aside for me. However those messages are from 20210426, and the missing ranges are (according to solana block-time) from 20210427 and 20210428.

The file system type appears to be ext4.

I'll let @linuskendall chime in on the rest; I don't have details.

CriesofCarrots commented 3 years ago

The proximal cause of these blocks appearing "missing" is that they are no longer recorded as roots in the blockstore Root CF. I saw a similar issue on one of our warehouse-node snapshots; in that case, however, on attempting to open RocksDb I received an error that the .sst was too short to be an sstable. In the case of this RPC node, there are no errors generated when reading the ledger. @ryoqun can you point me toward anything to inspect in the ssts or LOG that might indicate what is happening to this Root data?

I'll ponder the Bigtable fallback (proposal 1). We would most likely need to add Bigtable tracking for latest uploaded slot. The issue here is that some partners (cc @tjbaker) rely on knowing whether a slot was explicitly skipped. Because the Root column/chain is affected, these blocks meet the Blockstore criteria of explicitly skipped. If we check Bigtable for any block not found, we don't know whether the block was skipped or it just hasn't been uploaded to Bigtable yet.

In the meantime, I believe I can write a root-repair tool (proposal 2) fairly easily. Seems like a bandaid that's better than nothing.

CriesofCarrots commented 3 years ago

@linuskendall I ran this tool on the ledger you set aside for me with the command: solana-ledger-tool repair-roots --before 75634760 --until 75634607 targeting one of the two missing ranges. While I haven't diffed the actual block data with a healthy ledger, all the blocks are returning. Do you have any additional tests we should run to verify?

Unfortunately, you do have to stop the node to run the command, as secondary db access is read-only.

linuskendall commented 3 years ago

@ryoqun There has definitely been occasions we have seen this after OOM or crashes. However, we've also seen it when the machine didn't crash but started swapping (so really slow) and we had to go in and stop the solana process. After this, the missing blocks would appear. We also saw it once when we did multiple reboots of a machine in short succession so that the solana service started/stopped regularly (but not uncleanly).

We don't have an exact pattern of events/activities that triggers this issue. It's definitely at times when the machine is under heavy load/working poorly and/or crashing.

linuskendall commented 3 years ago

@CriesofCarrots Stopping the node and running this wouldn't be the end of the world as a work around. I think we have a node with this problem right now so I'm going to check if this helps us fix it.

ryoqun commented 3 years ago

@linuskendall thanks for detailed report!

The file system type appears to be ext4.

I suspect ext4 could do bad things in general from unclean OS shutdown. I routinely see unordered writes and NULs in the files from both common data=writeback and data=ordered, except the rare and deadly-slow data=journal. So, I wonder more rocksdb-friendly btrfs or xfs could fix the issue.

we had to go in and stop the solana process.

Are you issuing kill -KILL or kill -TERM for that? I hope the latter.. ;)

@CriesofCarrots Anyway, I suspect unclean solana-validator process shutdown could leave rocksdb in somewhat inconsistent state. I think that's reason we are giving some scary options as blind (--wal-recovery-mode skip_any_corrupted_record)? Alternatively, I thought persisted tower restoration code do some meddles with root column family when restarting in the case of inconsistent state between the tower and rocksdb (CC: @carllin ). That could be doing something bad (I'll check this later). But it's unlikely it could leave 900-wide slots unrooted.

Lastly, given kill -TERM, do our process termination code closes RocksDB cleanly, I guess?

ryoqun commented 3 years ago

@ryoqun can you point me toward anything to inspect in the ssts or LOG that might indicate what is happening to this Root data?

Hmm, I'm merely a compaction-stall killer as rocksdb technician... lol. I don't have a nice pointers and greps to hunt. Could you just share the LOGs as a whole in some way? :)

CriesofCarrots commented 3 years ago

Hmm, I'm merely a compaction-stall killer as rocksdb technician... lol. I don't have a nice pointers and greps to hunt. Could you just share the LOGs as a whole in some way? :)

Hehe, I thought you knew everything! Here are the rocksdb logs from the fra1 node's set-aside ledger: https://drive.google.com/file/d/1Zi1Vbwq6xlFQ5bV3fwxlBQu-GzVE4o0s/view?usp=sharing

linuskendall commented 3 years ago

@ryoqun We use TERM (systemctl stop ...). So is that a recommendation we should switch standard to XFS? I don't particularly trust btrfs. Would you have some good docs on differences in xfs/ext4 on rocksdb to point to? I'll look into it as well but if you have something to dig into that would be great.

CriesofCarrots commented 3 years ago

@CriesofCarrots Stopping the node and running this wouldn't be the end of the world as a work around. I think we have a node with this problem right now so I'm going to check if this helps us fix it.

@linuskendall , did you have a change to try this?

ryoqun commented 3 years ago

Here are the rocksdb logs from the fra1 node's set-aside ledger: https://drive.google.com/file/d/1Zi1Vbwq6xlFQ5bV3fwxlBQu-GzVE4o0s/view?usp=sharing

thanks. for sharing! But I couldn't find odd things. Sadly, rocksdb logs time in system's local timezone... Is this machine's tz PDT (UTC-7)? or UTC?

So is that a recommendation we should switch standard to XFS?

This isn't recommendation, rather experiment unfortunately. I think we haven't collected enough information for ext4 vs xfs regarding rocksdb within our validator community. Googling doesn't tell enough general difference too. At least I thought XFS is better than ext4's write ordering behavior (which is kind of notorious). Personally, I haven't used much XFS but popular among performance-oriented server-side engineers. :)

I don't particularly trust btrfs.

FYI, facebook loves btrfs. So, I thought it could be a nice bet. Also, I'm in the btrfs camp. ;)

I think that's reason we are giving some scary options as blind (--wal-recovery-mode skip_any_corrupted_record)?

Let alone the fs religion, I think our current recommended recovery options is very bad in terms of consistency: https://github.com/facebook/rocksdb/wiki/WAL-Recovery-Modes#kskipanycorruptedrecords

I think it breaks cross column family consistency with this recovery, which should be guaranteed by the write ahead log....

Reading little our history, it seems that this recovery option is started to be used by https://github.com/solana-labs/solana/issues/10015 to fix the Corruption: SST file is ahead of WALs error, and rocksdb recently merged a potential bugfix for it: https://github.com/facebook/rocksdb/commit/09a9ec3ac04418ad3426ccc94d5cb9430a9d23f9

We can't use the version (v6.20.3) with the fix immediately now. But I think we can revisit the scary recovery option at least for rpc usage to revert to kPointInTimeRecovery or more stricter variants in the near future....

Alternatively, I thought persisted tower restoration code do some meddles with root column family when restarting in the case of inconsistent state between the tower and rocksdb (CC: @carllin ).

I re-read the relevant code (reconcile_blockstore_roots_with_tower). but it seems to be innocent. it only marks slots as rooted for gaps, it don't revert rooted to unrooted...

Okay. my long dull murmuring is over... As a bitter conclusion, I think any inconsistency could happen for any unclean hw/os/fs/SIGKILL restart, only excluding SIGTERM process restart for now. So, the most safest options would to reset your ledger or approach with #17045. The odd part is that only root cf is corrupted....

linuskendall commented 3 years ago

On the file system front, we do have quite a lot of nodes running now with pretty much identical setups and hardware. We could quite easily vary the file system on them, I would even be willing to try btrfs as part of it I'll need to have some regular way of keeping track of the blocks missing thing but I might be able to pull that regularly from our prometheus and we could see if we in production see any difference between FS. Unfortunately I guess there's also more than the FS layer to contend with as the drive setup might also have an impact so the data may be pretty noisy but we could see if we can eek out some kind of operational recommendation out of it.

On FS, don't care much for religion, but just happy if we can find a solution that works as ideal as possible. I recognise that I'm used to thinking about durability first and foremost when it comes to FS so I am needing to put aside some of my regular assumptions/thinking as we really care more about performance than durability.

On Fri, May 7, 2021 at 9:59 PM Ryo Onodera @.***> wrote:

Here are the rocksdb logs from the fra1 node's set-aside ledger: https://drive.google.com/file/d/1Zi1Vbwq6xlFQ5bV3fwxlBQu-GzVE4o0s/view?usp=sharing

thanks. for sharing! But I couldn't find odd things. Sadly, rocksdb logs time in system's local timezone... Is this machine's tz PDT (UTC-7)? or UTC?

So is that a recommendation we should switch standard to XFS?

This isn't recommendation, rather experiment unfortunately. I think we haven't collected enough information for ext4 vs xfs regarding rocksdb within our validator community. Googling doesn't tell enough general difference too. At least I thought XFS is better than ext4's write ordering behavior (which is kind of notorious). Personally, I haven't used much XFS but popular among performance-oriented server-side engineers. :)

I don't particularly trust btrfs.

FYI, facebook loves btrfs. So, I thought it could be a nice bet. Also, I'm in the btrfs camp. ;)

I think that's reason we are giving some scary options as blind (--wal-recovery-mode skip_any_corrupted_record)?

Let alone the fs religion, I think our current recommended recovery options is very bad in terms of consistency: https://github.com/facebook/rocksdb/wiki/WAL-Recovery-Modes#kskipanycorruptedrecords

I think it breaks cross column family consistency with this recovery, which should be guaranteed by the write ahead log....

Reading little our history, it seems that this recovery option is started to be used by #10015 https://github.com/solana-labs/solana/issues/10015 to fix the Corruption: SST file is ahead of WALs error, and rocksdb recently merged a potential bugfix for it: @.*** https://github.com/facebook/rocksdb/commit/09a9ec3ac04418ad3426ccc94d5cb9430a9d23f9

We can't use the version (v6.20.3) with the fix immediately now. But I think we can revisit the scary recovery option at least for rpc usage to revert to kPointInTimeRecovery or more stricter variants in the near future....

Alternatively, I thought persisted tower restoration code do some meddles with root column family when restarting in the case of inconsistent state between the tower and rocksdb (CC: @carllin https://github.com/carllin ).

I re-read the relevant code (reconcile_blockstore_roots_with_tower). but it seems to be innocent. it only marks slots as rooted for gaps, it don't revert rooted to unrooted...

Okay. long dull discussion... As a bitter conclusion, I think any inconsistency could happen for any unclean hw/os/fs/SIGKILL restart, only excluding SIGTERM process restart for now. So, the most safest options would to reset your ledger or approach with #17045 https://github.com/solana-labs/solana/pull/17045. The odd part is that only root cf is corrupted....

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/solana-labs/solana/issues/17000#issuecomment-834597070, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABHOYRO6GUPFTTO5LHKSJT3TMQIONANCNFSM44ASCTQA .

linuskendall commented 3 years ago

@CriesofCarrots Stopping the node and running this wouldn't be the end of the world as a work around. I think we have a node with this problem right now so I'm going to check if this helps us fix it.

@linuskendall , did you have a change to try this?

Can confirm that the repair tool works for us on a production node. Stop the node, run the repair tool and then restart. Afterwards blocks are accessible in RPC again. So solution 2 is available to us now.

CriesofCarrots commented 3 years ago

Thanks to @carllin (https://github.com/solana-labs/solana/pull/17402#issuecomment-846379439), I was just able to deliberately reproduce this issue on a mnb node! And also to confirm that #17402 prevents the issue

solana-labs / solana

RPC: Blocks go missing #17000

Problem

Proposed solution