Closed brooksprumo closed 3 years ago
Actually the only problem this fixes is the one that startup takes longer because the node might have to download a large snapshot. Now, if the node has the large 'full' snapshot, then it just needs to download the correct incremental one to apply on top of that. The snapshot extraction speed would be exactly the same, maybe worse because it now has to combine both snapshots and the amount of data it is ingesting will be the same at the end of the computation.
Incremental snapshot would be on top of a full snapshot. The idea is that a validator creates a full snapshot maybe every 100,000 slots, and then creates an incremental snapshot every 100 slots. A node joining the network from nothing would have to download the full and and then the latest incremental that applies on top of it.
Some solution steps/ideas:
how often should incremental snapshots be created? 100 slots is what we try to create snapshots at now, I think that's a fine start. Although today, it takes longer than 100 slots to create a snapshot on mainnet-beta.
does it matter when incremental snapshots would be made? Like after/before certain cleanup code? I think the current place in accounts_background_service is fine.
should incremental snapshots only exist locally, or should they also be sent to new nodes Yes, definitely want to share with other nodes, that's one of the main benefits.
what goes in the incremental snapshot?
Yes, I would say the same data, just a different set of updates. The current snapshot code collects append-vecs for all rooted slots below the snapshot target slot, this would collect slots from last_full_snapshot_slot
to incremental_snapshot_slot
should full snapshots be created less/same/more frequently now? Less, way less. Once we have some data about how big the incremental grows, then we can better tell.
There's a couple strategies here. One is a fixed interval, and have all nodes in the network try to create at the same block height. That is nice because then a node can get a full snapshot from one node and then it can likely find an incremental one later from another node which used the same full snapshot slot. Incremental snapshots with different parents will obviously be incompatible.
Another could be to create it dynamically based on how quickly the incremental grows. It's expected the incremental will keep getting larger and larger and potentially as large as the full snapshot. At that point or maybe 50% of the size of the full, then you roll up the state and just create a new full snapshot then. This may be necessary if the fixed interval doesn't work well. If a node ends up downloading a full snapshot and an incremental one just as big, then that's not great.
still need full snapshots for a new node joining the network yes, hopefully the node has one locally. Since they don't update as often, then hopefully that will be the case.
what tests are needed? obviously a test to make sure it works - yes ensure fallback to full snapshot works if an incremental snapshot is borked - yes. Maybe it could just download a new incremental?
Another idea is that we re-store part of the account state on each slot to keep the append-vec number down to under the number of slots in an epoch. This isn't great, because it then will bloat the size of the incremental snapshots because those will look like new updates to the accounts system. I think it would be good to remove this and then support combining accounts from different slots into a single append-vec.
This is the account store code as part of the rent collection: https://github.com/solana-labs/solana/blob/fa86a335b0fd793c3ea2835834fb395a74dcc7fc/runtime/src/bank.rs#L3571
* what goes in the incremental snapshot? Yes, I would say the same data, just a different set of updates. The current snapshot code collects append-vecs for all rooted slots below the snapshot target slot, this would collect slots from `last_full_snapshot_slot` to `incremental_snapshot_slot`
@sakridge Are you envisioning just a single incremental snapshot between full snapshots (a new incremental snapshot replaces the existing one), or multiple? I was assuming multiple, but it doesn't need to be that way.
slot 1,000: full snapshot (A)
slot 1,100: incremental snapshot (B)
slot 1,200: incremental snapshot (C)
...
slot 1,x00: incremental snapshot (S)
...
Slot 2,000: full snapshot (V)
Incremental snapshot B is the diff from A to B, incremental snapshot C is the diff from B to C, etc.
Same picture as above, but when incremental snapshot C is created, it is the diff from A to C, and then B is deleted. Same for incremental snapshot S, which is the diff from A to S, and all other incremental snapshots are removed.
@sakridge Are you envisioning just a single incremental snapshot between full snapshots (a new incremental snapshot replaces the existing one), or multiple? I was assuming multiple, but it doesn't need to be that way.
Multiple
slot 1,000: full snapshot (A) slot 1,100: incremental snapshot (B) slot 1,200: incremental snapshot (C) ... slot 1,x00: incremental snapshot (S) ... Slot 2,000: full snapshot (V)
Incremental snapshot B is the diff from A to B, incremental snapshot C is the diff from B to C, etc.
Single
Same picture as above, but when incremental snapshot C is created, it is the diff from A to C, and then B is deleted. Same for incremental snapshot S, which is the diff from A to S, and all other incremental snapshots are removed.
I was thinking the node would just have a single incremental which would replace all previous incremental ones, but I'm open to arguments for the multiple design (or others?)
I just think with the multiple you will have a lot of duplicated states updated in the subsequent snapshots, and only 1 is useful. So there will be a lot of overlap. I think there are some small set of accounts updated a lot, like once per slot, and there are many more accounts which are only updated very infrequently, like once in a million or more slots.
Hopefully we capture those infrequently updated accounts with the full snapshot, and the incremental captures the frequently updated ones.
edit: Maybe 2 incremental kind of like we have today, so if the newest one is bad, you can fallback to the old one.
I was thinking the node would just have a single incremental which would replace all previous incremental ones,
Sounds good!
Maybe 2 incremental kind of like we have today, so if the newest one is bad, you can fallback to the old one.
I like it.
I was thinking about the interval as well, and thought maybe if feasible it would be cool to have snapshots cascading into different intervals. For instance, a couple that are 100 from the tip, a couple that are that 200, then 400, 800, .... etc.
You could even have different threads packaging these at different intervals, or maybe different validators package snapshots at these varying intervals. I.e. some package at a faster rate than others.
The benefit here is I think most validators who shut down recently don't need to download a large incremental snapshot thats 1/2 the size of the full snapshot, and can grab a few smaller ones to catch up. This might be also useful for nodes trying to catch up if they can fast sync small incremental snapshots from near the tip of the network.
I was thinking about the interval as well, and thought maybe if feasible it would be cool to have snapshots cascading into different intervals. For instance, a couple that are 100 from the tip, a couple that are that 200, then 400, 800, .... etc.
The benefit here is I think most validators who shut down recently don't need to download a large incremental snapshot thats 1/2 the size of the full snapshot, and can grab a few smaller ones to catch up.
@carllin This design sounds like it fall under the "Multiple" category of number-of-incremental-snapshots-between-full-snapshots. Is that right?
Another way to do it:
Split the snapshot across a range of slots, then keep track of every instance of clean_accounts and shrink_slots and then keep a dirty set of stores to then regenerate the new snasphot slices for only those slots that had changed data.
I think one of the nice things about how we've organized storage entries by slot currently is that the set of storage entries forms a natural diff tracker.
So for instance if we were taking a snapshot every 100 slots:
Slot 100, Slot 200, Slot 300
Then as long as you guarantee clean doesn't progress past the slot while you're grabbing a copy of the storage entries (max_clean_root
acts as the guard 😃 ), the diff from 100 to 200 would just be the storage entries for slots in the range (100, 200], the diff from 200 to 300 would just be the storage entries in the range (200, 300].
And this can be expanded to arbitrary intervals, 200, 400, etc. Also while you're generating the 100 slot diff lets say from (200, 300], you could use that to then generate the 200 slot diff from (100, 300] ( because they utilize the same set of storage entries in the overlapping range), which could be the "accumulator" incremental snapshot Stephen described earlier.
For v1 one approach is to just focus on using a fixed, configurable interval N
. Using this, I could imagine a tiered system where certain validators are generating snapshots at varying intervals to provide better coverage of the space. For instance let's say we had one set of validators generating every 100 slots, another set 1000 slots, and another 10,000 slots. Then once the 100 interval validators have generated 10 such diffs, they can start dropping the earliest one since they know that range is now covered by the validators who generated the 1000 slot interval. And the validators who generated the 1000 slot intervals could do the same by relying on the validator with 10,000 slots coverage
hey, late for the party. :)
I thought on this a bit. and I may have something to add more color.
Still inheriting the delta slot interval design discussed so far, I think we can forgo generating those delta snapshots at some guessed intervals. Then we can realize yet faster startup? My idea is to reuse accounts
dir across reboots and rather dumb (and hopefully easy-to-secure) ondemand delta snapshot api endpoint like ./snapshot/accounts/1111X00-1111Y00.zst
with little system load for trusted validators?
This design can come later, but this mostly overlaps the delta snapshot archive generation and download code.
the new restart flow:
/snapshot/accounts/1111XX00-1111YY00.zst
. XX is from deserialized rooted bank and YY is from gossip.Arc<AppendVec>
like snapshot generation processmy assumptions and observations:
OK, looks like there are a few different ways that I could go to implement incremental snapshots (or whatever we call it). I don't know how to quantify what way is the right way to go though. Who should be the one to pick? Or, are there additional data points that I can gather that would make the decision clear?
@carllin @sakridge could you share initial thoughts on my alternative idea?: https://github.com/solana-labs/solana/issues/17088#issuecomment-839905418
@ryoqun
serialize root bank and the whole index into ledger dir (yeah, would be 10-15G) (this could write >secondary index as well) and prune unrooted older slots? then finally mark the dir as successfully finalized for next reboot
As long as you build in a mechanism to ensure this serialization happens at a consistent point in time between clean/shrink, this should work.
Some sanity range check against too large slot range Grab Arc
like snapshot generation process Read those appendvecs in order while removing updated accounts (this is to compress well the final >delta snapshot) for this, we can reuse AccountsIndex reconstruction code needing relatively small memory. for zero-lamport accounts, I think we need to pause clean a bit? Then create final sorted and shrunken appendvec and stream it to the booting node via zstd >compression
From my understanding, this is an on demand streaming of AccountsDb storages. However, it seems like you'll still need to support a full snapshot fetch (not just the accounts storage, but also the status cache, bank etc.) for nodes who crashed/corrupted their serialized/saved state.
As for how fast on demand snapshotting is compared to incremental snapshots that happen at regular intervals, I think on demand snapshotting is strictly worse if the time to take a snapshot X
is greater than the time for the interval I
.
For instance, if we have some nodes packaging incremental snapshots every 100 slots, and the packaging process takes 500 slots, then: 1) On demand snapshot will be 500 slots behind by the time the snapshots arrives 2) If you just took the last incremental snapshot, it would only be 100 slots behind
This implies to me that having some nodes doing regular snapshotting at different varying intervals I
might be better for catchup than on demand snapshotting
As long as you build in a mechanism to ensure this serialization happens at a consistent point in time between clean/shrink, this should work.
thanks for confirming! yeah, hopefully restarting by itself should make this synchronization requirement easy. :)
However, it seems like you'll still need to support a full snapshot fetch (not just the accounts storage, but also the status cache, bank etc.)
Yeah, but I think this should be rare.
as for the delta snapshot, I think status cache and bank should be small compared to the accounts storage. So, I omitted to mention. Maybe, the ondemand snapshot endpoint processing need to briefly grab the root bank and stash those binaries and include them into the delta snapshot archive
As for how fast on demand snapshotting is compared to incremental snapshots that happen at regular intervals, I think on demand snapshotting is strictly worse if the time to take a snapshot X is greater than the time for the interval I. ....
Thanks for great analysis. :)
Firstly, I just noticed that avoidance of purging accounts and recomputing index can be realized for both incremental and ondemand snapshotting if both just restarts with proper coordination like saving the incremental snapshot locally before exiting (CC: @brooksprumo ) Originally, I thought this is only possible because ondemand can freely specify the start of slot for fetching snapshot dynamically. Maybe you can select and download an incremental snapshot which slightly overlaps with the local root to take the same optimization?
This implies to me that having some nodes doing regular snapshotting at different varying intervals I might be better for catchup than on demand snapshotting
Oh, I'm getting a clue. The different varying intervals
part. So, are you assuming newly-restarted validator usually needs to fetch 1 incremental snapshot in normal case? I was originally concerned about wasted bandwidth for the case of multiple incremental snapshot download (i blindly thought about 2-3 deltas, hard-coded 100 slots interval). The wasted bandwidth comes from the fact that multiple incremental snapshots should contain substantial amount of outdated (duplicated) account state between them. (ondemand snapshot tries to hard de-duplicate it)
I think on demand snapshotting is strictly worse if the time to take a snapshot X is greater than the time for the interval I.
yeah. that's true. However, I thought we can offset that delay with vastly reduced network bandwidth by de-duplicating data across equivalent multiple incremental snapshots. As said above, if we can realize 1 delta snapshot download per restart under normal case. the merit of ondemand snapshot is small, though. :)
Also, I don't think ondemand snapshot takes so long. Recent appendvecs should generally be on pagecache and index creation is basically linear processing so, bound to disk read bandwidth (unlike incremental snapshot no need to write archive for archive, only needs to grab mmap).
Handling the zero-lamport account updates sounds hard to me for @ryoqun idea. The target node would have to have the same clean state as the source node or some way to reconcile it. I think initially starting from a known state is an easier solution. I think these on-demand ideas might be good to explore once we have the basic mechanism working.
Thanks for the input, everyone. I'm going ahead with the single incremental snapshot that will be set at an interval of 100 slots, and full snapshots at 100,000 slots.
One of the implementation details is that I'll need to pass around a slightly different set of parameters for an incremental snapshot vs a full snapshot. I was thinking I could either do this by either:
or
There's some pros and cons to both ways, but also neither seems like the clear better way. Given your knowledge of the codebase, and also Rust, does one way sound better than the other?
Also, thinking about this a bit more, since the current snapshot logic is based on multiples of Account Hash Interval, I could also piggy back on it and do an incremental snapshot for every account hash interval.
The existing snapshot functions would need to have parameters added for being able to handle incremental snapshots. Then AccountsHashVerifier would need some more logic to do either an incremental snapshot or a full snapshot based on the snapshot interval.
Is that an easier/better than the other two ways?
Very late to the party here : ) Last week Brooks asked me about the gossip part of this and I just got the chance to take a look how this is impacting gossip.
From my understanding, this will require new values to be gossiped between nodes. So we will save some time+bandwidth when starting a validator, but then adding to gossip bandwidth usage (which is already a problem) by continuously sending new values across entire cluster each time a node has a new incremental snapshot.
So a one-time payload between only 2 nodes is replaced by continuous gossip traffic across all nodes all the time. Is it still given that the trade-off is positive here? I am in particular worried about how much this is going to make gossip worse.
Very late to the party here : ) Last week Brooks asked me about the gossip part of this and I just got the chance to take a look how this is impacting gossip.
From my understanding, this will require new values to be gossiped between nodes. So we will save some time+bandwidth when starting a validator, but then adding to gossip bandwidth usage (which is already a problem) by continuously sending new values across entire cluster each time a node has a new incremental snapshot.
So a one-time payload between only 2 nodes is replaced by continuous gossip traffic across all nodes all the time. Is it still given that the trade-off is positive here? I am in particular worried about how much this is going to make gossip worse.
The incremental snapshot values would be updated at roughly the same speed as snapshot values today. The regular snapshot rate will be reduced significantly.
Can we consider this alternative implementation of incremental snapshots:
100,000
slots as well as recent slots.
a
has snapshot at slot 812,345
.
800,000
as well (last multiple of 100k
).b
wants to start and it already has some snapshot at multiple 100k
it tells a
what that slot is (and its respective hash).
b
tells a
: "I already have snapshot at 800k
with this hash" and it matches what a
has, then a
will compute the diff between its latest snapshot (i.e. 812,345
) and 800k
and only return that 812,345 diff 800k
.b
does not have any 100k
snapshots, a
will send back both the 800k
snapshot and 812,345 diff 800k
(so 2 files but same total bytes as before).So, effectively the difference is that:
Now, to save disk space, under the hood a
may store 812,345
as a diff off 800k
but that is only an optional internal optimization and not exposed outside of snapshoting code. If node a
chose to do so, it will also speed things up when responding to node b
.
How does the node know that whatever data it got for diff of slot 812,345
-> 800,000
is good and matches the rest of the network?
How does the node know that whatever data it got for diff of slot
812,345
->800,000
is good and matches the rest of the network?
I do not think we need to worry about hash of 812,345 diff 800k
. As in mainnet today, we just use and verify hashes at "full" snapshots:
b
knows what is the hash of node a
's snapshot at 812,345
, and if it considers that correct compared to rest of the network.b
fetches 812,345 diff 800k
from a
it reconstructs the snapshot at 812,345
, and check if the computed hash for 812,345
is what was supposed to be. If so then the diff it obtained is good.As in mainnet today, we just use and verify hashes at "full" snapshots
Because that's the only case that exists.
check if the computed hash for 812,345 is what was supposed to be. If so then the diff it obtained is good
It has no way to check that other nodes would have created the same hash at that slot as well so the data could be incorrect or malicious.
I do not understand the problem here.
b
knows the correct snapshot hash at slot 812,345
from the cluster.812,345
in whatever ways. Doesn't matter how so.812,345
and validates against cluster.812,345 diff 800k
or anything else.What is the problem with above?
Where does this step come from:
Then computes the hash for snapshot at slot 812,345 and validates against cluster.
Where in the cluster is the hash for slot 812,345 broadcast?
Where in the cluster is the hash for slot 812,345 broadcast?
Isn't that already part of gossip? https://github.com/solana-labs/solana/blob/f558b9b6b/gossip/src/crds_value.rs#L87
Isn't that already part of gossip? https://github.com/solana-labs/solana/blob/f558b9b6b/gossip/src/crds_value.rs#L87
Is that snapshot hash from here? https://github.com/solana-labs/solana/blob/master/core/src/snapshot_packager_service.rs#L55
If so, then that hash is only pushed out with the (full) snapshot, i.e. every 100k slots in this example.
At the same time, AccountsHashVerifier
also pushes out a hash every 100-slots-by-default (IIRC). Maybe that hash can be used? And maybe that means incremental snapshots should only occur on multiples of the AccountsHashVerifier interval?
https://github.com/solana-labs/solana/blob/master/core/src/accounts_hash_verifier.rs#L162
I am suggesting not to change SnapshotHash
(or anything else in gossip) from how they work today.
Limit the concept of "incremental snapshot" just to where snapshot send/receive code is implemented, as described here:
https://github.com/solana-labs/solana/issues/17088#issuecomment-861721192
With that design the concept of "incremental snapshot" is encapsulated just to that piece of code which sends/receives the snapshot. Every where else in the code snapshot means "full snapshot", and no other part of the runtime (and in particular gossip) need to be updated to support incremental snapshots.
At the same time, AccountsHashVerifier also pushes out a hash every 100-slots-by-default (IIRC). Maybe that hash can be used?
The snapshot may not be completed for that 100-slot interval and there is a gap between that hash being broadcast for the verifier service and the snapshot being generated and available for download.
Potentially we could re-use the Hash
value itself to save some bandwidth but have a signal that indicates that slot
I am suggesting not to change
SnapshotHash
(or anything else in gossip) from how they work today. Limit the concept of "incremental snapshot" just to where snapshot send/receive code is implemented, as described here: #17088 (comment)
Well, the snapshot download logic would like to know what the parent slot of the incremental snapshot is so it knows if it can use it after downloading it or not.
It could get that information in an out-of-band way I suppose, like an RPC call to ask the node more information about the snapshot.
The only thing that will change from today's mainnet is that: when node b
wants to obtain a snapshot from node a
:
b
tells a
that it already has snapshot at 800,000
with this hash.a
has, then, instead of returning a full snapshot, a
just computes the diff from 800k snapshot
and sends the diff.b
reconstructs the full snapshot by applying the diff and does exact same verifications on the full snapshot as it does today.No other component in the validator runtime needs to know or care how b
obtained or constructed the full snapshot.
The only thing that will change from today's mainnet is that [...]
This sounds reasonable to me, and seems to only change some specifics around gossip and RPC. A node can still create incremental snapshots at a fixed interval, and serve them when queried by another node.
My takeaways are:
b
can tell node a
what its last full snapshot is, and then node a
can send back the incremental snapshot@behzadnouri Is my understanding correct? @sakridge Does this sound good to you?
The only thing that will change from today's mainnet is that [...]
This sounds reasonable to me, and seems to only change some specifics around gossip and RPC. A node can still create incremental snapshots at a fixed interval, and serve them when queried by another node.
My takeaways are:
- No changes w.r.t. what an incremental snapshot is
- No changes w.r.t. how/when a node creates incremental snapshots
- No changes w.r.t. loading from an ISS at startup
- new Do not change gossip
- new RPC will be extended to enable a node to query more info about snapshots, such that node
b
can tell nodea
what its last full snapshot is, and then nodea
can send back the incremental snapshot@behzadnouri Is my understanding correct? @sakridge Does this sound good to you?
RPC is optional and not recommended for a staked node, so I don't think we should require it for this. One option is using the HTTP port where the snapshot is hosted and putting an endpoint there to ask about the incremental snapshot parent.
RPC is optional and not recommended for a staked node, so I don't think we should require it for this.
I don't think we are right? This is just for the case when a node wants to fetch a snapshot from another node. The --no-snapshot-fetch
path would still be RPC free
RPC is optional and not recommended for a staked node, so I don't think we should require it for this.
I don't think we are right? This is just for the case when a node wants to fetch a snapshot from another node. The
--no-snapshot-fetch
path would still be RPC free
Any node that hosts snapshots would have to enable RPC as well where they didn't have to before.
Any node that hosts snapshots would have to enable RPC as well where they didn't have to before.
Na, currently a node that hosts snapshots needs to run with no less than --minimal-rpc-api
Any node that hosts snapshots would have to enable RPC as well where they didn't have to before.
Na, currently a node that hosts snapshots needs to run with no less than
--minimal-rpc-api
Well yea, that's the change I'm talking about.. If we are ok with that, then it's ok.
I think it's fine for now. A snapshot server that's not a full node can always be built over --minimal-rpc-api
when we want to
DONE!
This issue has been automatically locked since there has not been any activity in past 7 days after it was closed. Please open a new issue for related bugs.
Problem
Startup time for nodes is slow when having to download a large snapshot.
Proposed Solution
Incremental snapshots! Instead of always having to download large, full snapshots, download a full snapshot once/less often, then download a small incremental snapshot. The expectation/hope/thought is that only a small number of accounts are touched often, so use incremental snapshots to optimize for that behavior. At startup, now a node with an existing full snapshot only needs to download a new, small incremental snapshot.
SLOTS_PER_EPOCH / 2 - 1
?Example
ISS C = diff(A, C)
, and so on.Details
Storing an Incremental Snapshot
Loading from an Incremental Snapshot
Validator
Background Services
AccountsDb
clean_accounts()
to add a new parameter,last_full_snapshot_slot
to not clean zero-lamport accounts above the last FSS slotLedger Tool
RPC
Gossip
CrdsData
Bootstrap
Testing
Unit Tests
snapshot_utils
: roundtrip bank to snapshot to bank for FSSsnapshot_utils
: roundtrip bank to snapshot to bank for ISSsnapshot_utils
: cleanup zero-lamport accounts in slots after FSSIntegration Tests
core/tests/snapshots.rs
local_cluster
Questions
Related Work
Original Snapshot Work
Future Work
Tasks