solana-labs / solana

Web-Scale Blockchain for fast, secure, scalable, decentralized apps and marketplaces.
https://solanalabs.com
Apache License 2.0
12.96k stars 4.16k forks source link

Ledger Tool Create Snapshot Loads Starting Slot Past Snapshot Slot #27325

Open buffalu opened 2 years ago

buffalu commented 2 years ago

Problem

I'm trying to create a snapshot of the last slot in the epoch:

~/jito-solana/target/release/solana-ledger-tool \
    -l /mnt/ledger/ \
    create-snapshot \
    146972255 ~/146972255/

i added printing around the load_bank_forks and see this:

[2022-08-23T05:28:23.759624090Z INFO  solana_ledger_tool] Creating snapshot of slot 146972255 in /home/jito/146972255/
[2022-08-23T05:28:23.760173745Z INFO  solana_ledger_tool] full_snapshot_slot: 146958936 incremental_snapshot_slot: 146984605
[2022-08-23T05:28:23.760182582Z INFO  solana_ledger_tool] starting slot: 146984605

It is trying to start at a slot past my snapshot because of incremental snapshots.

Proposed Solution

The starting_slot should never be larger than the process_options.halt_at_slot. I think there's a bug in the loading from snapshot where it attempts to load incremental snapshots past the starting slot.

I ended up deleting all of our incremental snapshots and waiting for it from 146958936 to 146972255.

steviez commented 2 years ago

I think the behavior that you're alluding to can be considered for snapshots in general (both full and incremental). For example, if your two full snapshots were at slots 146958936 and 146984605, then the later full snapshot would be picked and you'd encounter same issue of starting_slot being later than halt_at_slot

I think there's a bug in the loading from snapshot where it attempts to load incremental snapshots past the starting slot.

As you've found, current behavior just greedily picks the latest snapshots: https://github.com/solana-labs/solana/blob/12693e83df4aa2c08ad31c4f1d8036d60ca858b4/ledger/src/bank_forks_utils.rs#L197 That call pick the latest full snapshot and the latest corresponding incremental snapshot (if there are any).

Hypothetically, we could sweep through all available snapshots to find one that is less than halt_at_slot. However, that would be a deviation from what validator does which I'm hesitant about since solana-ledger-tool runs are typically used to mimic validator behavior. Maybe we could throw out a warn! or error! if snapshot creation slot is less than the snapshot load slot

buffalu commented 2 years ago

this process is going to be very frequent for what we're building, so a little frustrating that it needs this much hand holding. but it seems like we're just working outside the bounds of what it's currently capable of, so it seems like it'd be best to just write our own version that works as expected :)

steviez commented 2 years ago

this process is going to be very frequent for what we're building, so a little frustrating that it needs this much hand holding. but it seems like we're just working outside the bounds of what it's currently capable of, so it seems like it'd be best to just write our own version that works as expected :)

Let me clarify, I'm hesitant to change default behavior away from what validator does. I could see adding a flag that a user opts into.

steviez commented 2 years ago

Some Discord discussion on the matter: https://discord.com/channels/428295358100013066/439194979856809985/1014284768281645106

I would recommend updating ledger-tool to use snapshot_utils::get_full_snapshot_archives()/get_incremental_snapshot_archives(), and then passing the correct pair (based on desired slot) to snapshot_utils::bank_from_snapshot_archives().

This way bank_from_latest_snapshot_archives() retains its intended purpose, and bank_from_snapshot_archives() is used when specific/non-latest snapshot archives are desired.

This sounds fine to me; however, a little more plumbing will be necessary as ledger-tool doesn't call bank_from_*() directly; rather, ledger-tool calls bank_forks_utils::load_bank_forks() which has the call to bank_from_latest_snapshot_archives() under the hood. bank_forks_utils::load_bank_forks() is also called by validator::load_blockstore() so we will either another function or a flag for alternate snapshot selection strategy

brooksprumo commented 2 years ago

I think it would make sense for ledger-tool to have its own pub fn on bank_fork_utils that is calls (i.e. bank_fork_utils::load_bank_forks_for_ledger_tool()). It could have the halt-at-slot param, then use bank_from_snapshot_archives(). The rest of the impl can (hopefully) be the same as bank_fork_utils::load_bank_forks().

brooksprumo commented 2 years ago

Also probably need to change/move the code that purges old snapshots. I'd argue that ledger-tool create-snapshot should never delete any existing snapshots.

steviez commented 2 years ago

I'd argue that ledger-tool create-snapshot should never delete any existing snapshots.

Completely agree

steviez commented 2 years ago

Also probably need to change/move the code that purges old snapshots. I'd argue that ledger-tool create-snapshot should never delete any existing snapshots.

Burned myself on this one today (luckily only lost a snapshot that was downloaded from bucket instead of one that I had to replay several 1000 slots to get to). In any case, https://github.com/solana-labs/solana/pull/27788 makes ledger-tool no longer purge snapshots by default

steviez commented 9 months ago

Reopening this one for consideration per some feedback from planned testnet restart - I'm still not sure we want to change behavior to differ from solana-validator, but could definitely see a flag that allows loading from non-newest snapshot

buffalu commented 9 months ago

there's some code in jito-solana that handles this fwiw