brooksprumo commented 3 years ago

Problem

Startup time for nodes is slow when having to download a large snapshot.

Proposed Solution

Incremental snapshots! Instead of always having to download large, full snapshots, download a full snapshot once/less often, then download a small incremental snapshot. The expectation/hope/thought is that only a small number of accounts are touched often, so use incremental snapshots to optimize for that behavior. At startup, now a node with an existing full snapshot only needs to download a new, small incremental snapshot.

Create full snapshots much less often (?)
- every 100,000 slots? at epoch? SLOTS_PER_EPOCH / 2 - 1?
Create incremental snapshots every 100 slots (?)
Each incremental snapshot is the difference from the last full snapshot
Old incremental snapshots can be cleaned up, but save at least one extra as fallback
Add a new snapshot field to gossip to differentiate between full and incremental snapshots
- The gossip info for incremental snapshots will need to include the slot of the full snapshot that this incremental snapshot is based on

Example

slot 100,000: full snapshot (A)
slot 100,100: incremental snapshot (B)
slot 100,200: incremental snapshot (C)
...
slot 1xx,x0: incremental snapshot (D)
...
slot 200,000: full snapshot (E)

Incremental snapshot (ISS for short) B is the diff between full snapshot (FSS) A and ISS B. Similarly, ISS C = diff(A, C), and so on.
The latest snapshot is still the valid snapshot. If the latest snapshot is an incremental snapshot, replay the FSS then the ISS.
Incremental snapshots older than a full snapshot can be deleted (i.e. FSS E supersedes FSS A, and ISS B, C, and D).
When ISS D is created, ISS B can be deleted.
If at a slot between D and E, a new node would query for FSS A and then ISS D through gossip.

Details

Storing an Incremental Snapshot

Get the slot from the last full snapshot
Snapshot the bank (same as for FSS)
Snapshot the status cache (slot deltas) (same as FSS)
Package up the storages (AppendVecs) from after the FSS
Make archive

Loading from an Incremental Snapshot

Get the highest full snapshot as done now
Get the highest incremental snapshot based on the full snapshot from above
Extract full snapshot
Extract incremental snapshot
Rebuild the AccountsDb from the storages in both FSS and ISS
Rebuild the Bank from the ISS

Validator

new CLI args for setting ISS interval
loading ISS at startup
creating ISS periodically
discovering and downloading ISS in bootstrap

Background Services

AccountsBackgroundService will need to know about the last FSS slot, as to not clean past it
AccountsBackgroundService will now decide based on the full/incremental snapshot interval if the snapshot package will be a FSS or an ISS
AccountsHashVerifier no longer needs to decide full vs incremental
SnapshotPackagerService is largely unchanged

AccountsDb

Update clean_accounts() to add a new parameter, last_full_snapshot_slot to not clean zero-lamport accounts above the last FSS slot

Ledger Tool

Update CLI args to set maximum number of incremental snapshots to retain

RPC

Add support for downloading incremental snapshots

Gossip

Add incremental snapshot hashes to CrdsData

Bootstrap

Discover and download incremental snapshots during bootstrap

Testing

Unit Tests

[x] snapshot_utils: roundtrip bank to snapshot to bank for FSS
[x] snapshot_utils: roundtrip bank to snapshot to bank for ISS
[x] snapshot_utils: cleanup zero-lamport accounts in slots after FSS

Integration Tests

`core/tests/snapshots.rs`

[x] Make a similar test as the bank forks test, but also creating incremental snapshots
[x] Make a new test that spins up all the background services and ensures FSS and ISS are taken at the correct intervals, and they deserialize correctly

`local_cluster`

[x] Make a similar test that generates ISS on one node, and the other node downloads then loads from it
[x] Make a test for startup processing new roots past full snapshot interval

Questions

~how often should incremental snapshots be created?~
- ~more is good when (re)joining (faster startup time?), but less is good for the running node (less resource utilization?)~
- ~does it matter when incremental snapshots would be made? Like after/before certain cleanup code?~
~should incremental snapshots only exist locally, or should they also be sent to new nodes~
- ~i'm guessing we want to send incremental snapshots to new nodes as well, so they start up faster~
~what goes in the incremental snapshot?~
- ~is it all the same data types as a full snapshot, just the delta since the last snapshot?~
~should full snapshots be created less/same/more frequently now?~
- ~likely not more... but there for completeness~
- ~still need full snapshots for a new node joining the network~
~what tests are needed?~
- ~obviously a test to make sure it works~
- ~ensure fallback to full snapshot works if an incremental snapshot is borked~

Related Work

Original Snapshot Work

Original issue: #2475
PR 1: #3671
PR 2: #4244

Future Work

Dynamically decide when to generate full and incremental snapshots.
With the current implementation, it's highly beneficial if nodes use the same full snapshot interval. This is so at bootstrap if a node needs to download a snapshot, and already has a full snapshot, it's most likely to not need to download another full snapshot, and instead just the incremental snapshot. More discovery methods or decisions could be added to RPC/bootstrap to better support different full snapshot intervals.

Tasks

[x] #18822
[x] #18829
[x] #18972
[x] #18973
[x] #18825
[x] #19014
[x] #18828
[x] #18813
[x] #18826
[x] #19297
[x] #19065
[x] #18639
[x] #19579
[x] #19296
[x] #19857
[x] #19856
[x] #20375
[x] #19885
[x] #20423
[x] #18641
[x] #18824
[x] #18815

sakridge commented 3 years ago

Actually the only problem this fixes is the one that startup takes longer because the node might have to download a large snapshot. Now, if the node has the large 'full' snapshot, then it just needs to download the correct incremental one to apply on top of that. The snapshot extraction speed would be exactly the same, maybe worse because it now has to combine both snapshots and the amount of data it is ingesting will be the same at the end of the computation.

Incremental snapshot would be on top of a full snapshot. The idea is that a validator creates a full snapshot maybe every 100,000 slots, and then creates an incremental snapshot every 100 slots. A node joining the network from nothing would have to download the full and and then the latest incremental that applies on top of it.

Some solution steps/ideas:

Add another snapshot field to gossip. This will then have the snapshot slot and hash, but also the parent 'full' snapshot slot so that the client node knows which incremental options it has and which match up with what 'full' snapshot it has.
Download logic changed to see the above and choose whether to download a 'full'+incremental or an incremental based on a 'full' that it already has.
Snapshots today become 'full' snapshots. Add a new flag for when the snapshot service needs to create an incremental one.
On full snapshot creation, the logic is the same as today. Full clean to try to get the state as small as possible.
After full snapshot is created, now the clean_accounts cannot operate on the range including the last 'full' slot. It needs to only operate on newer slots so that the append-vecs apply correctly onto the 'full' files. Delta will now clean in this special range and then collect all append-vecs which are present in this range to create the new delta snapshot.

sakridge commented 3 years ago

how often should incremental snapshots be created? 100 slots is what we try to create snapshots at now, I think that's a fine start. Although today, it takes longer than 100 slots to create a snapshot on mainnet-beta.
does it matter when incremental snapshots would be made? Like after/before certain cleanup code? I think the current place in accounts_background_service is fine.
should incremental snapshots only exist locally, or should they also be sent to new nodes Yes, definitely want to share with other nodes, that's one of the main benefits.
what goes in the incremental snapshot? Yes, I would say the same data, just a different set of updates. The current snapshot code collects append-vecs for all rooted slots below the snapshot target slot, this would collect slots from last_full_snapshot_slot to incremental_snapshot_slot
should full snapshots be created less/same/more frequently now? Less, way less. Once we have some data about how big the incremental grows, then we can better tell.

There's a couple strategies here. One is a fixed interval, and have all nodes in the network try to create at the same block height. That is nice because then a node can get a full snapshot from one node and then it can likely find an incremental one later from another node which used the same full snapshot slot. Incremental snapshots with different parents will obviously be incompatible.

Another could be to create it dynamically based on how quickly the incremental grows. It's expected the incremental will keep getting larger and larger and potentially as large as the full snapshot. At that point or maybe 50% of the size of the full, then you roll up the state and just create a new full snapshot then. This may be necessary if the fixed interval doesn't work well. If a node ends up downloading a full snapshot and an incremental one just as big, then that's not great.

still need full snapshots for a new node joining the network yes, hopefully the node has one locally. Since they don't update as often, then hopefully that will be the case.
what tests are needed? obviously a test to make sure it works - yes ensure fallback to full snapshot works if an incremental snapshot is borked - yes. Maybe it could just download a new incremental?

Another idea is that we re-store part of the account state on each slot to keep the append-vec number down to under the number of slots in an epoch. This isn't great, because it then will bloat the size of the incremental snapshots because those will look like new updates to the accounts system. I think it would be good to remove this and then support combining accounts from different slots into a single append-vec.

This is the account store code as part of the rent collection: https://github.com/solana-labs/solana/blob/fa86a335b0fd793c3ea2835834fb395a74dcc7fc/runtime/src/bank.rs#L3571

brooksprumo commented 3 years ago

* what goes in the incremental snapshot?
  Yes, I would say the same data, just a different set of updates. The current snapshot code collects append-vecs for all rooted slots below the snapshot target slot, this would collect slots from `last_full_snapshot_slot` to `incremental_snapshot_slot`

@sakridge Are you envisioning just a single incremental snapshot between full snapshots (a new incremental snapshot replaces the existing one), or multiple? I was assuming multiple, but it doesn't need to be that way.

Multiple

slot 1,000: full snapshot (A)
slot 1,100: incremental snapshot (B)
slot 1,200: incremental snapshot (C)
...
slot 1,x00: incremental snapshot (S)
...
Slot 2,000: full snapshot (V)

Incremental snapshot B is the diff from A to B, incremental snapshot C is the diff from B to C, etc.

Single

Same picture as above, but when incremental snapshot C is created, it is the diff from A to C, and then B is deleted. Same for incremental snapshot S, which is the diff from A to S, and all other incremental snapshots are removed.

sakridge commented 3 years ago

@sakridge Are you envisioning just a single incremental snapshot between full snapshots (a new incremental snapshot replaces the existing one), or multiple? I was assuming multiple, but it doesn't need to be that way.

Multiple
slot 1,000: full snapshot (A)
slot 1,100: incremental snapshot (B)
slot 1,200: incremental snapshot (C)
...
slot 1,x00: incremental snapshot (S)
...
Slot 2,000: full snapshot (V)
Incremental snapshot B is the diff from A to B, incremental snapshot C is the diff from B to C, etc.

Single

Same picture as above, but when incremental snapshot C is created, it is the diff from A to C, and then B is deleted. Same for incremental snapshot S, which is the diff from A to S, and all other incremental snapshots are removed.

I was thinking the node would just have a single incremental which would replace all previous incremental ones, but I'm open to arguments for the multiple design (or others?)

I just think with the multiple you will have a lot of duplicated states updated in the subsequent snapshots, and only 1 is useful. So there will be a lot of overlap. I think there are some small set of accounts updated a lot, like once per slot, and there are many more accounts which are only updated very infrequently, like once in a million or more slots.

Hopefully we capture those infrequently updated accounts with the full snapshot, and the incremental captures the frequently updated ones.

edit: Maybe 2 incremental kind of like we have today, so if the newest one is bad, you can fallback to the old one.

brooksprumo commented 3 years ago

I was thinking the node would just have a single incremental which would replace all previous incremental ones,

Sounds good!

Maybe 2 incremental kind of like we have today, so if the newest one is bad, you can fallback to the old one.

I like it.

carllin commented 3 years ago

I was thinking about the interval as well, and thought maybe if feasible it would be cool to have snapshots cascading into different intervals. For instance, a couple that are 100 from the tip, a couple that are that 200, then 400, 800, .... etc.

You could even have different threads packaging these at different intervals, or maybe different validators package snapshots at these varying intervals. I.e. some package at a faster rate than others.

The benefit here is I think most validators who shut down recently don't need to download a large incremental snapshot thats 1/2 the size of the full snapshot, and can grab a few smaller ones to catch up. This might be also useful for nodes trying to catch up if they can fast sync small incremental snapshots from near the tip of the network.

brooksprumo commented 3 years ago

I was thinking about the interval as well, and thought maybe if feasible it would be cool to have snapshots cascading into different intervals. For instance, a couple that are 100 from the tip, a couple that are that 200, then 400, 800, .... etc.

The benefit here is I think most validators who shut down recently don't need to download a large incremental snapshot thats 1/2 the size of the full snapshot, and can grab a few smaller ones to catch up.

@carllin This design sounds like it fall under the "Multiple" category of number-of-incremental-snapshots-between-full-snapshots. Is that right?

sakridge commented 3 years ago

Another way to do it:

Split the snapshot across a range of slots, then keep track of every instance of clean_accounts and shrink_slots and then keep a dirty set of stores to then regenerate the new snasphot slices for only those slots that had changed data.

carllin commented 3 years ago

I think one of the nice things about how we've organized storage entries by slot currently is that the set of storage entries forms a natural diff tracker.

So for instance if we were taking a snapshot every 100 slots:

Slot 100, Slot 200, Slot 300

Then as long as you guarantee clean doesn't progress past the slot while you're grabbing a copy of the storage entries (max_clean_root acts as the guard 😃 ), the diff from 100 to 200 would just be the storage entries for slots in the range (100, 200], the diff from 200 to 300 would just be the storage entries in the range (200, 300].

And this can be expanded to arbitrary intervals, 200, 400, etc. Also while you're generating the 100 slot diff lets say from (200, 300], you could use that to then generate the 200 slot diff from (100, 300] ( because they utilize the same set of storage entries in the overlapping range), which could be the "accumulator" incremental snapshot Stephen described earlier.

For v1 one approach is to just focus on using a fixed, configurable interval N. Using this, I could imagine a tiered system where certain validators are generating snapshots at varying intervals to provide better coverage of the space. For instance let's say we had one set of validators generating every 100 slots, another set 1000 slots, and another 10,000 slots. Then once the 100 interval validators have generated 10 such diffs, they can start dropping the earliest one since they know that range is now covered by the validators who generated the 1000 slot interval. And the validators who generated the 1000 slot intervals could do the same by relying on the validator with 10,000 slots coverage

ryoqun commented 3 years ago

hey, late for the party. :)

I thought on this a bit. and I may have something to add more color.

Still inheriting the delta slot interval design discussed so far, I think we can forgo generating those delta snapshots at some guessed intervals. Then we can realize yet faster startup? My idea is to reuse accounts dir across reboots and rather dumb (and hopefully easy-to-secure) ondemand delta snapshot api endpoint like ./snapshot/accounts/1111X00-1111Y00.zst with little system load for trusted validators?

This design can come later, but this mostly overlaps the delta snapshot archive generation and download code.

the new restart flow:

exiting node...
- serialize root bank and the whole index into ledger dir (yeah, would be 10-15G) (this could write secondary index as well)
- and prune unrooted older slots?
- then finally mark the dir as successfully finalized for next reboot.
booting node...
- deserialize rooted bank
- only remove newer appendvecs after the serialized rooted bank
- read the serialized index (would be 10-15G disk io, but should be faster than recomputing it. maybe with borsh? lol)
- (optional: check accounts hash?)
- request to one of trusted validators for /snapshot/accounts/1111XX00-1111YY00.zst. XX is from deserialized rooted bank and YY is from gossip.
requested trusted validator...
- Some sanity range check against too large slot range
- Grab Arc<AppendVec> like snapshot generation process
- Read those appendvecs in order while removing updated accounts (this is to compress well the final delta snapshot)
  - for this, we can reuse AccountsIndex reconstruction code needing relatively small memory.
  - for zero-lamport accounts, I think we need to pause clean a bit?
- Then create final sorted and shrunken appendvec and stream it to the booting node via zstd compression
booting node...
- apply the delta on-the-fly snapshot appendvecs to ./accounts and restored indexes while streaming off.
done

my assumptions and observations:

accounts are updated in two extremely-distinct patterns: very frequently and large (serum state, any future fun dapp game's world state, etc) VS very infrequently and small (spl-token holdings, user states self-custodied by end-user's keypairs)
- this write pattern should be the case for very long time. we just see increased number of such updates per slot as we grow.
- So, omitting the frequently updated older account states should be very effective for compressing.
the network bandwidth should be the bottle neck. so minimize the delta snapshot as much as possible.
- also those legitimate access shouldn't high because validator doesn't restart so often.
the needed rpc endpoint should be rather primitive and easy to secure.
- also, should be less resource intensive (maybe io bound)
- the increased threat of DOS for the trusted node isn't significant. (already quite easy to saturate victim's outgoing network by bogus request)
delta snapshot is needed for quick operator-initiated restarts almost always (so, older delta snapshot archive's utility will wane soon).
- and the delta range END must be the most recent root almost absolutely (so, we can exploit this property for less harm of replaying regarding pause of cleaning)
- unclean restarts should be rare.
accounts dir recreation is kind of wasted time to begin with..
index generation is another major part of wasteful booting process.
this introduces accounts dir layout and index serialization compatibility across reboots, but I think this isn't big issue.

brooksprumo commented 3 years ago

OK, looks like there are a few different ways that I could go to implement incremental snapshots (or whatever we call it). I don't know how to quantify what way is the right way to go though. Who should be the one to pick? Or, are there additional data points that I can gather that would make the decision clear?

ryoqun commented 3 years ago

@carllin @sakridge could you share initial thoughts on my alternative idea?: https://github.com/solana-labs/solana/issues/17088#issuecomment-839905418

carllin commented 3 years ago

@ryoqun

serialize root bank and the whole index into ledger dir (yeah, would be 10-15G) (this could write >secondary index as well) and prune unrooted older slots? then finally mark the dir as successfully finalized for next reboot

As long as you build in a mechanism to ensure this serialization happens at a consistent point in time between clean/shrink, this should work.

Some sanity range check against too large slot range Grab Arc like snapshot generation process Read those appendvecs in order while removing updated accounts (this is to compress well the final >delta snapshot) for this, we can reuse AccountsIndex reconstruction code needing relatively small memory. for zero-lamport accounts, I think we need to pause clean a bit? Then create final sorted and shrunken appendvec and stream it to the booting node via zstd >compression

From my understanding, this is an on demand streaming of AccountsDb storages. However, it seems like you'll still need to support a full snapshot fetch (not just the accounts storage, but also the status cache, bank etc.) for nodes who crashed/corrupted their serialized/saved state.

As for how fast on demand snapshotting is compared to incremental snapshots that happen at regular intervals, I think on demand snapshotting is strictly worse if the time to take a snapshot X is greater than the time for the interval I.

For instance, if we have some nodes packaging incremental snapshots every 100 slots, and the packaging process takes 500 slots, then: 1) On demand snapshot will be 500 slots behind by the time the snapshots arrives 2) If you just took the last incremental snapshot, it would only be 100 slots behind

This implies to me that having some nodes doing regular snapshotting at different varying intervals I might be better for catchup than on demand snapshotting

ryoqun commented 3 years ago

As long as you build in a mechanism to ensure this serialization happens at a consistent point in time between clean/shrink, this should work.

thanks for confirming! yeah, hopefully restarting by itself should make this synchronization requirement easy. :)

However, it seems like you'll still need to support a full snapshot fetch (not just the accounts storage, but also the status cache, bank etc.)

Yeah, but I think this should be rare.

as for the delta snapshot, I think status cache and bank should be small compared to the accounts storage. So, I omitted to mention. Maybe, the ondemand snapshot endpoint processing need to briefly grab the root bank and stash those binaries and include them into the delta snapshot archive

As for how fast on demand snapshotting is compared to incremental snapshots that happen at regular intervals, I think on demand snapshotting is strictly worse if the time to take a snapshot X is greater than the time for the interval I. ....

Thanks for great analysis. :)

Firstly, I just noticed that avoidance of purging accounts and recomputing index can be realized for both incremental and ondemand snapshotting if both just restarts with proper coordination like saving the incremental snapshot locally before exiting (CC: @brooksprumo ) Originally, I thought this is only possible because ondemand can freely specify the start of slot for fetching snapshot dynamically. Maybe you can select and download an incremental snapshot which slightly overlaps with the local root to take the same optimization?

This implies to me that having some nodes doing regular snapshotting at different varying intervals I might be better for catchup than on demand snapshotting

Oh, I'm getting a clue. The different varying intervals part. So, are you assuming newly-restarted validator usually needs to fetch 1 incremental snapshot in normal case? I was originally concerned about wasted bandwidth for the case of multiple incremental snapshot download (i blindly thought about 2-3 deltas, hard-coded 100 slots interval). The wasted bandwidth comes from the fact that multiple incremental snapshots should contain substantial amount of outdated (duplicated) account state between them. (ondemand snapshot tries to hard de-duplicate it)

I think on demand snapshotting is strictly worse if the time to take a snapshot X is greater than the time for the interval I.

yeah. that's true. However, I thought we can offset that delay with vastly reduced network bandwidth by de-duplicating data across equivalent multiple incremental snapshots. As said above, if we can realize 1 delta snapshot download per restart under normal case. the merit of ondemand snapshot is small, though. :)

Also, I don't think ondemand snapshot takes so long. Recent appendvecs should generally be on pagecache and index creation is basically linear processing so, bound to disk read bandwidth (unlike incremental snapshot no need to write archive for archive, only needs to grab mmap).

sakridge commented 3 years ago

Handling the zero-lamport account updates sounds hard to me for @ryoqun idea. The target node would have to have the same clean state as the source node or some way to reconcile it. I think initially starting from a known state is an easier solution. I think these on-demand ideas might be good to explore once we have the basic mechanism working.

brooksprumo commented 3 years ago

Thanks for the input, everyone. I'm going ahead with the single incremental snapshot that will be set at an interval of 100 slots, and full snapshots at 100,000 slots.

One of the implementation details is that I'll need to pass around a slightly different set of parameters for an incremental snapshot vs a full snapshot. I was thinking I could either do this by either:

Creating a new code path for everything related to incremental snapshots. So new functions that duplicate most all the existing snapshot functions, and also channels/PendingSnapshotPackage, for incremental snapshots.

or

Make a Snapshot enum(s) with fields for Full and Incremental, and do this all the way down. This will require updating all the existing functions too, to check for which snapshot and do the right thing.

There's some pros and cons to both ways, but also neither seems like the clear better way. Given your knowledge of the codebase, and also Rust, does one way sound better than the other?

brooksprumo commented 3 years ago

Also, thinking about this a bit more, since the current snapshot logic is based on multiples of Account Hash Interval, I could also piggy back on it and do an incremental snapshot for every account hash interval.

The existing snapshot functions would need to have parameters added for being able to handle incremental snapshots. Then AccountsHashVerifier would need some more logic to do either an incremental snapshot or a full snapshot based on the snapshot interval.

Is that an easier/better than the other two ways?

behzadnouri commented 3 years ago

Very late to the party here : ) Last week Brooks asked me about the gossip part of this and I just got the chance to take a look how this is impacting gossip.

From my understanding, this will require new values to be gossiped between nodes. So we will save some time+bandwidth when starting a validator, but then adding to gossip bandwidth usage (which is already a problem) by continuously sending new values across entire cluster each time a node has a new incremental snapshot.

So a one-time payload between only 2 nodes is replaced by continuous gossip traffic across all nodes all the time. Is it still given that the trade-off is positive here? I am in particular worried about how much this is going to make gossip worse.

sakridge commented 3 years ago

Very late to the party here : ) Last week Brooks asked me about the gossip part of this and I just got the chance to take a look how this is impacting gossip.

From my understanding, this will require new values to be gossiped between nodes. So we will save some time+bandwidth when starting a validator, but then adding to gossip bandwidth usage (which is already a problem) by continuously sending new values across entire cluster each time a node has a new incremental snapshot.

So a one-time payload between only 2 nodes is replaced by continuous gossip traffic across all nodes all the time. Is it still given that the trade-off is positive here? I am in particular worried about how much this is going to make gossip worse.

The incremental snapshot values would be updated at roughly the same speed as snapshot values today. The regular snapshot rate will be reduced significantly.

behzadnouri commented 3 years ago

Can we consider this alternative implementation of incremental snapshots:

Nodes locally keep latest snapshot at multiples of 100,000 slots as well as recent slots.
- So only one extra snapshot in addition to whatever snapshots they currently keep.
Lets say node a has snapshot at slot 812,345.
- So it also has snapshot at slot 800,000 as well (last multiple of 100k).
If node b wants to start and it already has some snapshot at multiple 100k it tells a what that slot is (and its respective hash).
- In the case that b tells a: "I already have snapshot at 800k with this hash" and it matches what a has, then a will compute the diff between its latest snapshot (i.e. 812,345) and 800k and only return that 812,345 diff 800k.
- Otherwise, if things do not match or b does not have any 100k snapshots, a will send back both the 800k snapshot and 812,345 diff 800k (so 2 files but same total bytes as before).

So, effectively the difference is that:

Incremental snapshot conceptually only exist at the time one node wants to start off another node, and is only implemented at snapshot send/receive code.
Every where else in the code snapshot means "full snapshot". No other part of the runtime (and in particular gossip) need to be updated to support incremental snapshots.

Now, to save disk space, under the hood a may store 812,345 as a diff off 800k but that is only an optional internal optimization and not exposed outside of snapshoting code. If node a chose to do so, it will also speed things up when responding to node b.

sakridge commented 3 years ago

How does the node know that whatever data it got for diff of slot 812,345 -> 800,000 is good and matches the rest of the network?

behzadnouri commented 3 years ago

How does the node know that whatever data it got for diff of slot 812,345 -> 800,000 is good and matches the rest of the network?

I do not think we need to worry about hash of 812,345 diff 800k. As in mainnet today, we just use and verify hashes at "full" snapshots:

So node b knows what is the hash of node a's snapshot at 812,345, and if it considers that correct compared to rest of the network.
When b fetches 812,345 diff 800k from a it reconstructs the snapshot at 812,345, and check if the computed hash for 812,345 is what was supposed to be. If so then the diff it obtained is good.

sakridge commented 3 years ago

As in mainnet today, we just use and verify hashes at "full" snapshots

Because that's the only case that exists.

check if the computed hash for 812,345 is what was supposed to be. If so then the diff it obtained is good

It has no way to check that other nodes would have created the same hash at that slot as well so the data could be incorrect or malicious.

behzadnouri commented 3 years ago

I do not understand the problem here.

Node b knows the correct snapshot hash at slot 812,345 from the cluster.
It reconstructs snapshot for slot 812,345 in whatever ways. Doesn't matter how so.
Then computes the hash for snapshot at slot 812,345 and validates against cluster.
So it does not need to compute or validate any hashes for 812,345 diff 800k or anything else.

What is the problem with above?

sakridge commented 3 years ago

Where does this step come from:

Then computes the hash for snapshot at slot 812,345 and validates against cluster.

Where in the cluster is the hash for slot 812,345 broadcast?

behzadnouri commented 3 years ago

Where in the cluster is the hash for slot 812,345 broadcast?

Isn't that already part of gossip? https://github.com/solana-labs/solana/blob/f558b9b6b/gossip/src/crds_value.rs#L87

brooksprumo commented 3 years ago

Isn't that already part of gossip? https://github.com/solana-labs/solana/blob/f558b9b6b/gossip/src/crds_value.rs#L87

Is that snapshot hash from here? https://github.com/solana-labs/solana/blob/master/core/src/snapshot_packager_service.rs#L55

If so, then that hash is only pushed out with the (full) snapshot, i.e. every 100k slots in this example.

At the same time, AccountsHashVerifier also pushes out a hash every 100-slots-by-default (IIRC). Maybe that hash can be used? And maybe that means incremental snapshots should only occur on multiples of the AccountsHashVerifier interval? https://github.com/solana-labs/solana/blob/master/core/src/accounts_hash_verifier.rs#L162

behzadnouri commented 3 years ago

I am suggesting not to change SnapshotHash (or anything else in gossip) from how they work today. Limit the concept of "incremental snapshot" just to where snapshot send/receive code is implemented, as described here: https://github.com/solana-labs/solana/issues/17088#issuecomment-861721192

With that design the concept of "incremental snapshot" is encapsulated just to that piece of code which sends/receives the snapshot. Every where else in the code snapshot means "full snapshot", and no other part of the runtime (and in particular gossip) need to be updated to support incremental snapshots.

sakridge commented 3 years ago

At the same time, AccountsHashVerifier also pushes out a hash every 100-slots-by-default (IIRC). Maybe that hash can be used?

The snapshot may not be completed for that 100-slot interval and there is a gap between that hash being broadcast for the verifier service and the snapshot being generated and available for download.

Potentially we could re-use the Hash value itself to save some bandwidth but have a signal that indicates that slot is completed for example.

sakridge commented 3 years ago

I am suggesting not to change SnapshotHash (or anything else in gossip) from how they work today. Limit the concept of "incremental snapshot" just to where snapshot send/receive code is implemented, as described here: #17088 (comment)

Well, the snapshot download logic would like to know what the parent slot of the incremental snapshot is so it knows if it can use it after downloading it or not.

It could get that information in an out-of-band way I suppose, like an RPC call to ask the node more information about the snapshot.

behzadnouri commented 3 years ago

The only thing that will change from today's mainnet is that: when node b wants to obtain a snapshot from node a:

b tells a that it already has snapshot at 800,000 with this hash.
If that matches what a has, then, instead of returning a full snapshot, a just computes the diff from 800k snapshot and sends the diff.
On the receiving end, b reconstructs the full snapshot by applying the diff and does exact same verifications on the full snapshot as it does today.

No other component in the validator runtime needs to know or care how b obtained or constructed the full snapshot.

brooksprumo commented 3 years ago

The only thing that will change from today's mainnet is that [...]

This sounds reasonable to me, and seems to only change some specifics around gossip and RPC. A node can still create incremental snapshots at a fixed interval, and serve them when queried by another node.

My takeaways are:

No changes w.r.t. what an incremental snapshot is
No changes w.r.t. how/when a node creates incremental snapshots
No changes w.r.t. loading from an ISS at startup
**new** Do not change gossip
**new** RPC will be extended to enable a node to query more info about snapshots, such that node b can tell node a what its last full snapshot is, and then node a can send back the incremental snapshot

@behzadnouri Is my understanding correct? @sakridge Does this sound good to you?

sakridge commented 3 years ago

The only thing that will change from today's mainnet is that [...]

This sounds reasonable to me, and seems to only change some specifics around gossip and RPC. A node can still create incremental snapshots at a fixed interval, and serve them when queried by another node.

My takeaways are:

No changes w.r.t. what an incremental snapshot is

No changes w.r.t. how/when a node creates incremental snapshots

No changes w.r.t. loading from an ISS at startup

new Do not change gossip

new RPC will be extended to enable a node to query more info about snapshots, such that node b can tell node a what its last full snapshot is, and then node a can send back the incremental snapshot

@behzadnouri Is my understanding correct? @sakridge Does this sound good to you?

RPC is optional and not recommended for a staked node, so I don't think we should require it for this. One option is using the HTTP port where the snapshot is hosted and putting an endpoint there to ask about the incremental snapshot parent.

mvines commented 3 years ago

RPC is optional and not recommended for a staked node, so I don't think we should require it for this.

I don't think we are right? This is just for the case when a node wants to fetch a snapshot from another node. The --no-snapshot-fetch path would still be RPC free

sakridge commented 3 years ago

RPC is optional and not recommended for a staked node, so I don't think we should require it for this.

I don't think we are right? This is just for the case when a node wants to fetch a snapshot from another node. The --no-snapshot-fetch path would still be RPC free

Any node that hosts snapshots would have to enable RPC as well where they didn't have to before.

mvines commented 3 years ago

Any node that hosts snapshots would have to enable RPC as well where they didn't have to before.

Na, currently a node that hosts snapshots needs to run with no less than --minimal-rpc-api

sakridge commented 3 years ago

Any node that hosts snapshots would have to enable RPC as well where they didn't have to before.

Na, currently a node that hosts snapshots needs to run with no less than --minimal-rpc-api

Well yea, that's the change I'm talking about.. If we are ok with that, then it's ok.

mvines commented 3 years ago

I think it's fine for now. A snapshot server that's not a full node can always be built over --minimal-rpc-api when we want to

brooksprumo commented 3 years ago

DONE!

github-actions[bot] commented 2 years ago

This issue has been automatically locked since there has not been any activity in past 7 days after it was closed. Please open a new issue for related bugs.

solana-labs / solana

Incremental Snapshots #17088

Problem

Proposed Solution

Example

Details

Storing an Incremental Snapshot

Loading from an Incremental Snapshot

Validator

Background Services

AccountsDb

Ledger Tool

RPC

Gossip

Bootstrap

Testing

Unit Tests

Integration Tests

`core/tests/snapshots.rs`

`local_cluster`

Questions

Related Work

Original Snapshot Work

Future Work

Tasks

Multiple

Single

Multiple

Single