solana-labs / solana

Web-Scale Blockchain for fast, secure, scalable, decentralized apps and marketplaces.
https://solanalabs.com
Apache License 2.0
12.94k stars 4.15k forks source link

Bigtable Upload Service Loops on Same Block Repeatedly #33831

Closed McSim85 closed 10 months ago

McSim85 commented 10 months ago

We have the same issue on two warehouse nodes. image this happens from 1.14.17 (around 19 September) and still happens. Currently, on 1.16.17, but still happens.

I will add more details shortly.

version was running

solana -V
solana-cli 1.16.17 (src:667bc163; feat:4033350765, client:SolanaLabs)

bounds of the ledger

solana@sol-warehouse-jfk-a01:~$ solana-ledger-tool -l  ./ledger/ bounds
[2023-10-24T11:27:20.845573653Z INFO  solana_ledger_tool] solana-ledger-tool 1.16.17 (src:667bc163; feat:4033350765, client:SolanaLabs)
[2023-10-24T11:27:20.845664206Z INFO  solana_ledger::blockstore] Maximum open file descriptors: 1048576
[2023-10-24T11:27:20.845672926Z INFO  solana_ledger::blockstore] Opening database at "/home/solana/ledger/rocksdb"
[2023-10-24T11:27:20.848092279Z INFO  solana_ledger::blockstore_db] Opening Rocks with secondary (read only) access at: "/home/solana/ledger/rocksdb/solana-secondary"
[2023-10-24T11:27:20.848102869Z INFO  solana_ledger::blockstore_db] This secondary access could temporarily degrade other accesses, such as by solana-validator
[2023-10-24T11:28:19.255389546Z INFO  solana_ledger::blockstore_db] Rocks's automatic compactions are disabled due to Secondary access
[2023-10-24T11:28:19.255506870Z INFO  solana_ledger::blockstore] "/home/solana/ledger/rocksdb" open took 58.4s
Ledger has data for 1240156 slots 224411444 to 225683930
  with 1219130 rooted slots from 224411444 to 225683898
  and 32 slots past the last root

[2023-10-24T11:28:21.276443965Z INFO  solana_ledger_tool] ledger tool took 60.4s

bounds of bigtable

all data from the genesys

when you saw this

Mostly happens after uprade\restart.

relevant issue

          Yeah, I think that could happen, depending on node settings.

I'm going to close this for now, but please re-open if you see it again. I'll also ponder how if bigtable-upload should support fragmented ledgers.

Originally posted by @CriesofCarrots in https://github.com/solana-labs/solana/issues/27732#issuecomment-1244684489

McSim85 commented 10 months ago

Right now, the server is in the loop, so it's an excellent change to troubleshoot.

CriesofCarrots commented 10 months ago

Sure, if you can tell me what the missing ranges in your ledger are, I can take a look at the uploader logic. Probably it should emit an error and halt rather than continually looping.

steviez commented 10 months ago

@CriesofCarrots - In regards to looping on the same slot, I think we can attribute that to this comment https://github.com/solana-labs/solana/blob/a3b0348649db5788e59237b5778405b2554704b2/ledger/src/bigtable_upload.rs#L44-L46 and this line here https://github.com/solana-labs/solana/blob/a3b0348649db5788e59237b5778405b2554704b2/ledger/src/bigtable_upload.rs#L141

From the logs posted above in the issue description, bigtable_upload::upload_confirmed_blocks() is getting called with start_slot=225163166, end_slot=225163358. https://github.com/solana-labs/solana/blob/a3b0348649db5788e59237b5778405b2554704b2/ledger/src/bigtable_upload_service.rs#L110-L120

The logs then indicate that the blockstore only has the slot 225163166 available within [225163166, 225163358]:

Found 1 slots in the range (225163166, 225163166)

And that the slot has already been uploaded:

No blocks between 225163166 and 225163358 need to be uploaded to bigtable

so there is no work to do and we hit this early return with last_blockstore_slot = 225163166 https://github.com/solana-labs/solana/blob/a3b0348649db5788e59237b5778405b2554704b2/ledger/src/bigtable_upload.rs#L136-L141

Looking at the snippet from BigTableUploadService above, start_slot will remain at 225163166: https://github.com/solana-labs/solana/blob/a3b0348649db5788e59237b5778405b2554704b2/ledger/src/bigtable_upload_service.rs#L120

And thus we're stuck since this is a gap and our node will never repair/replay the slots immediately after 225163166. We can seemingly avoid getting stuck by returning ending_slot like we do in the other early return case: https://github.com/solana-labs/solana/blob/a3b0348649db5788e59237b5778405b2554704b2/ledger/src/bigtable_upload.rs#L69-L72

PS: @McSim85 - When posting logs in the future, please post the text in between triple ` quotes instead of pasting an image; it makes it easier for us to copy/paste/search/etc the text.