feat(state-sync): sync to the current epoch instead of the previous

marcelo-gonzalez commented 1 month ago

When a node processes a block that’s the first block of epoch T, and it realizes that it will need to track shards that it doesn’t currently track in epoch T+1, it syncs state to the end of epoch T-1 and then applies chunks until it’s caught up. We want to change this so that it syncs state to epoch T instead, so that the integration of state sync/catchup and resharding will be simpler.

In this PR, this is done by keeping most of the state sync logic unchanged, but changing the “sync_hash” that’s used to identify what point in the chain we want to sync to. Before, “sync_hash” was set to the first block of an epoch, and the existing state sync logic would have us sync the state as of two chunks before this hash. So here we change the sync hash to be the hash of the first block for which at least two new chunks have been seen for each shard in its epoch. This allows us to sync state to epoch T with minimal modifications, because the old logic is still valid.

Note that this PR does not implement support for this new way of syncing for nodes that have fallen several epochs behind the chain, rather than nodes that need to catchup for an upcoming epoch. This can be done in a future PR

marcelo-gonzalez commented 1 month ago

Btw, for reviewers, this PR is not quite ready to be submitted because it is only minimally tested on localnet, where I just checked that it syncs properly. You can try it with this:

diff --git a/core/primitives/src/epoch_manager.rs b/core/primitives/src/epoch_manager.rs
index 281be4399..b73b9348a 100644
--- a/core/primitives/src/epoch_manager.rs
+++ b/core/primitives/src/epoch_manager.rs
@@ -166,6 +166,8 @@ impl AllEpochConfig {
     pub fn generate_epoch_config(&self, protocol_version: ProtocolVersion) -> EpochConfig {
         let mut config = self.genesis_epoch_config.clone();

+        config.validator_selection_config.shuffle_shard_assignment_for_chunk_producers = true;
+
         Self::config_mocknet(&mut config, &self.chain_id);

         if !self.use_production_config {

Then if you run the transactions.py pytest, it should be able to finish after nodes sync state in the new way (you might have to comment out some asserts in that test that fail sometimes, looks unrelated but will check it)

So before submitting, I need to try this with more meaningful state and traffic/receipts, probably on forknet. Also would be good to add some integration tests, and fix whichever integration tests or pytests might have been broken by this. Also the FIXME comment in this PR needs to be fixed before I can submit this. But in any case, it should mostly be ready for review

One thing to be decided in this PR review is whether the gating via current protocol version I put in there looks okay. It feels kind of ugly to me, but it might be the easiest way to go

marcelo-gonzalez commented 1 month ago

I actually just removed the test_mock_node_basic() test, since I think it's kind of outdated anyway. hopefully nobody objects to that... I kind of have some plans to just delete the hacky part of the mock-node code that generates home dirs in favor of just providing the home dirs from mainnet or testnet up-front anyway

marcelo-gonzalez commented 1 month ago

It may be easier to review and merge this if you can split it down into smaller PRs. Not sure if it makes sense but something to consider.

Actually here I tried to make the individual commits inside this PR somewhat well contained on their own, so hopefully those are easier to look at?

marcelo-gonzalez commented 1 month ago

So added a DB migration to store a new cleaner version of the StateSyncInfo struct. I added a state_sync_version field to it currently just set to 0, but I think it might be helpful if we want to change things about the meaning of the sync hash in the future. We really should only need to wait for one new chunk per shard, and the reason we wait for two in this PR is because it's easier to implement a first version doing that. But if we want to make that change then we'll need to indicate what we mean by the sync hash somehow.

Also see this commit heheh... https://github.com/near/nearcore/pull/12102/commits/78d6a756d5dbb5347b1d57c78cb1170dd356f832 . I removed the part that says "before submitting", because I think it's probably okay to submit it as-is for now, since it's mostly going to be fine as long as we make sure to fix it before this makes it into a release. Fixing that seems like a pretty well-scoped individual PR that would be a little easier to review on its own

let me know what ppl think

marcelo-gonzalez commented 1 month ago

also there's a test here to make sure the migration from the previous StateSyncInfo to the new works okay: https://github.com/marcelo-gonzalez/nearcore/blob/catchup-migration-test/pytest/tests/sanity/state_sync_info_migration.py

don't think that's really worth adding here though, but good to sanity check it once

marcelo-gonzalez commented 4 weeks ago

aghh, just noticed the nightly test is failing... will fix it tmr

marcelo-gonzalez commented 1 week ago

@wacban updated with fixed tests and more commits addressing comments, PTAL. There's a decent amount changed from last time, but I tried to keep the commits small, so I think it will be much easier to review if you just look at the individual commits from last time you looked. Also haven't merged it with master for now because I heard there were some state sync problems on master, but will do that later

marcelo-gonzalez commented 1 week ago

fyi, @wacban added a couple more commits to fix tests. Note that in 9991c89c9295874d853de15b48abff1a30d4e072 I'm returning early from get_sync_hash() for the genesis block and in 99f98a5f64edabb9dfb4bc461f95da93c4569d5e just not even checking what happens with the genesis block in tests because I don't think we ever want to be using the genesis block here in any case. ClientActorInner::find_sync_hash() even has an assertion against it. Let me know if you have objections though :)

codecov[bot] commented 1 week ago

Codecov Report

Attention: Patch coverage is 77.20930% with 98 lines in your changes missing coverage. Please review.

Project coverage is 71.24%. Comparing base (10463b2) to head (286693b). Report is 1 commits behind head on master.

Files with missing lines	Patch %	Lines
chain/client/src/client.rs	80.00%	14 Missing and 12 partials :warning:
chain/chain/src/chain.rs	81.30%	8 Missing and 12 partials :warning:
core/store/src/migrations.rs	0.00%	19 Missing :warning:
tools/state-viewer/src/state_parts.rs	0.00%	12 Missing :warning:
core/store/src/lib.rs	0.00%	8 Missing :warning:
chain/client/src/client_actor.rs	53.33%	5 Missing and 2 partials :warning:
chain/client/src/test_utils/client.rs	0.00%	3 Missing :warning:
chain/chain/src/store_validator/validate.rs	0.00%	2 Missing :warning:
nearcore/src/migrations.rs	0.00%	1 Missing :warning:

Additional details and impacted files

```diff @@ Coverage Diff @@ ## master #12102 +/- ## ========================================== - Coverage 71.54% 71.24% -0.31% ========================================== Files 838 838 Lines 168748 168911 +163 Branches 168748 168911 +163 ========================================== - Hits 120725 120334 -391 - Misses 42767 43338 +571 + Partials 5256 5239 -17 ``` | [Flag](https://app.codecov.io/gh/near/nearcore/pull/12102/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=near) | Coverage Δ | | |---|---|---| | [backward-compatibility](https://app.codecov.io/gh/near/nearcore/pull/12102/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=near) | `0.16% <0.00%> (-0.01%)` | :arrow_down: | | [db-migration](https://app.codecov.io/gh/near/nearcore/pull/12102/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=near) | `0.16% <0.00%> (-0.01%)` | :arrow_down: | | [genesis-check](https://app.codecov.io/gh/near/nearcore/pull/12102/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=near) | `1.23% <0.00%> (-0.01%)` | :arrow_down: | | [integration-tests](https://app.codecov.io/gh/near/nearcore/pull/12102/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=near) | `39.05% <77.20%> (+0.14%)` | :arrow_up: | | [linux](https://app.codecov.io/gh/near/nearcore/pull/12102/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=near) | `70.68% <48.37%> (-0.39%)` | :arrow_down: | | [linux-nightly](https://app.codecov.io/gh/near/nearcore/pull/12102/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=near) | `70.83% <77.20%> (-0.29%)` | :arrow_down: | | [macos](https://app.codecov.io/gh/near/nearcore/pull/12102/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=near) | `50.34% <8.06%> (-3.81%)` | :arrow_down: | | [pytests](https://app.codecov.io/gh/near/nearcore/pull/12102/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=near) | `1.54% <0.00%> (-0.01%)` | :arrow_down: | | [sanity-checks](https://app.codecov.io/gh/near/nearcore/pull/12102/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=near) | `1.35% <0.00%> (-0.01%)` | :arrow_down: | | [unittests](https://app.codecov.io/gh/near/nearcore/pull/12102/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=near) | `64.17% <10.48%> (-1.14%)` | :arrow_down: | | [upgradability](https://app.codecov.io/gh/near/nearcore/pull/12102/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=near) | `0.21% <0.00%> (?)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=near#carryforward-flags-in-the-pull-request-comment) to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

near / nearcore

feat(state-sync): sync to the current epoch instead of the previous #12102

Codecov Report