Closed McSim85 closed 10 months ago
Right now, the server is in the loop, so it's an excellent change to troubleshoot.
Sure, if you can tell me what the missing ranges in your ledger are, I can take a look at the uploader logic. Probably it should emit an error and halt rather than continually looping.
@CriesofCarrots - In regards to looping on the same slot, I think we can attribute that to this comment https://github.com/solana-labs/solana/blob/a3b0348649db5788e59237b5778405b2554704b2/ledger/src/bigtable_upload.rs#L44-L46 and this line here https://github.com/solana-labs/solana/blob/a3b0348649db5788e59237b5778405b2554704b2/ledger/src/bigtable_upload.rs#L141
From the logs posted above in the issue description, bigtable_upload::upload_confirmed_blocks()
is getting called with start_slot=225163166, end_slot=225163358
.
https://github.com/solana-labs/solana/blob/a3b0348649db5788e59237b5778405b2554704b2/ledger/src/bigtable_upload_service.rs#L110-L120
The logs then indicate that the blockstore only has the slot 225163166
available within [225163166, 225163358]
:
Found 1 slots in the range (225163166, 225163166)
And that the slot has already been uploaded:
No blocks between 225163166 and 225163358 need to be uploaded to bigtable
so there is no work to do and we hit this early return with last_blockstore_slot = 225163166
https://github.com/solana-labs/solana/blob/a3b0348649db5788e59237b5778405b2554704b2/ledger/src/bigtable_upload.rs#L136-L141
Looking at the snippet from BigTableUploadService
above, start_slot
will remain at 225163166
:
https://github.com/solana-labs/solana/blob/a3b0348649db5788e59237b5778405b2554704b2/ledger/src/bigtable_upload_service.rs#L120
And thus we're stuck since this is a gap and our node will never repair/replay the slots immediately after 225163166
. We can seemingly avoid getting stuck by returning ending_slot
like we do in the other early return case:
https://github.com/solana-labs/solana/blob/a3b0348649db5788e59237b5778405b2554704b2/ledger/src/bigtable_upload.rs#L69-L72
PS: @McSim85 - When posting logs in the future, please post the text in between triple ` quotes instead of pasting an image; it makes it easier for us to copy/paste/search/etc the text.
We have the same issue on two warehouse nodes. this happens from 1.14.17 (around 19 September) and still happens. Currently, on 1.16.17, but still happens.
I will add more details shortly.
version was running
bounds of the ledger
bounds of bigtable
all data from the genesys
when you saw this
Mostly happens after uprade\restart.
relevant issue
I'm going to close this for now, but please re-open if you see it again. I'll also ponder how if bigtable-upload should support fragmented ledgers.
Originally posted by @CriesofCarrots in https://github.com/solana-labs/solana/issues/27732#issuecomment-1244684489