`warp_sync` hanging on `AwaitPeers` if `warp_sync_params` is None

brunopgalvao commented 1 year ago

Is there an existing issue?

[X] I have searched the existing issues

Experiencing problems? Have you tried our Stack Exchange first?

[X] This is not a support question.

Description of bug

If warp_sync_params is set to None:

warp_sync_params: None

warp_sync will hang on AwaitingPeers: https://github.com/paritytech/substrate/blob/53c58fbc3dc6060a2f611a8d7adfdeea655fe324/client/network/sync/src/lib.rs#L516-L517

Steps to reproduce

Set warp_sync_params here to None.

arkpar commented 1 year ago

After https://github.com/paritytech/substrate/pull/12761 If a parachain does not configure warp_sync_params but the user still passes SyncMode::Warp, the sync ends up in AwaitingPeers state forever, which is indeed incorrect. I can propose two possible solutions here:

Move warp_sync_params into SyncMode::Warp enum variant, so that it is always set when warp sync is enabled.
Fallback to SyncMode::Full when warp_sync_params are missing and maybe print a misconfiguration warning.

shunsukew commented 1 year ago

In my testing, actually, warp_sync_params is set to Some(WarpSyncParams::WaitForTarget(target_block)) https://github.com/AstarNetwork/Astar/pull/901/files & https://github.com/paritytech/cumulus/blob/d6eef144421ef5c3f339f681484d06bb729dfa82/client/service/src/lib.rs#L303.

And it downloads state for a certain period of time, but it gets halted at AwaitingPeers. Restarting the process resumes downloading states.

bkchr commented 1 year ago

@shunsukew how many peers are you connected to?

shunsukew commented 1 year ago

@bkchr 41 peers last time I reported https://substrate.stackexchange.com/questions/8078/parachain-warp-sync-get-stuck-at-awaitingpeers-status.

bkchr commented 1 year ago

@shunsukew your problem is something different! If you run the warp sync you will see that when it finishes downloading the state it prints out an error. Then the warp sync is stuck as you see it.

The first thing you need is to add this "fix":

diff --git a/bin/collator/src/parachain/shell_upgrade.rs b/bin/collator/src/parachain/shell_upgrade.rs
index acf7a3cc..b0858139 100644
--- a/bin/collator/src/parachain/shell_upgrade.rs
+++ b/bin/collator/src/parachain/shell_upgrade.rs
@@ -21,7 +21,7 @@
 use cumulus_client_consensus_common::{ParachainCandidate, ParachainConsensus};
 use cumulus_primitives_core::relay_chain::{Hash as PHash, PersistedValidationData};
 use futures::lock::Mutex;
-use sc_consensus::{import_queue::Verifier as VerifierT, BlockImportParams};
+use sc_consensus::{import_queue::Verifier as VerifierT, BlockImportParams, ForkChoiceStrategy};
 use sp_api::ApiExt;
 use sp_consensus::CacheKeyId;
 use sp_consensus_aura::{sr25519::AuthorityId as AuraId, AuraApi};
@@ -113,7 +113,7 @@ where
 {
     async fn verify(
         &mut self,
-        block_import: BlockImportParams<Block, ()>,
+        mut block_import: BlockImportParams<Block, ()>,
     ) -> Result<
         (
             BlockImportParams<Block, ()>,
@@ -121,6 +121,18 @@ where
         ),
         String,
     > {
+        // Skip checks that include execution, if being told so or when importing only state.
+        //
+        // This is done for example when gap syncing and it is expected that the block after the gap
+        // was checked/chosen properly, e.g. by warp syncing to this block using a finality proof.
+        // Or when we are importing state only and can not verify the seal.
+        if block_import.with_state() || block_import.state_action.skip_execution_checks() {
+            // When we are importing only the state of a block, it will be the best block.
+            block_import.fork_choice = Some(ForkChoiceStrategy::Custom(block_import.with_state()));
+
+            return Ok((block_import, None));
+        }
+
         let block_hash = *block_import.header.parent_hash();

         if self

This will skip the verification when we import state changes directly. However, you will still see an issue about storage root being "invalid". The problem here is the following pr: https://github.com/AstarNetwork/Astar/pull/567

It set state_version: 1 without having any migration being run. This means that the state you download is partly state_version = 0 and state_version = 1. This leads then to a wrong state root as stored in the header. You will need to add the state migration pallet to your runtime, let the entire state migrate to state_version = 1 and then warp sync should work with the patch I posted above.

@cheme should be able to provide you some guide on how to setup the state migration pallet for parachains.

CC @akru

shunsukew commented 1 year ago

Thank you so much for checking and letting us know the required changes. We will conduct state_version migration. It would probably be better to continue my topic in Astar's repo, since we have a different problem from this Github Issue. please let me mention you there. https://github.com/AstarNetwork/Astar/pull/901#issuecomment-1519369812

As for Verifier fix, This Verifier is wrapper around AuraVerifier and RelayChainVerifier. And inside AuraVerifier's verify function, it is checking below as you suggested.

if block.with_state() || block.state_action.skip_execution_checks() {
            // When we are importing only the state of a block, it will be the best block.
            block.fork_choice = Some(ForkChoiceStrategy::Custom(block.with_state()));

            return Ok((block, Default::default()))
        }

So, considering Astar had Aura runtime API from the beginning, this shouldn't be directly related to the problem we have I think. (It is necessary for Shiden, but I'm only testing Astar warp sync).

cheme commented 1 year ago

@shunsukew , small write up as info may be a bit sparse.

If the chain was not using state_version at first and when version 1 afterward, then it is very likely that the state is in "hybrid" mode, some nodes runing on state 0 and some on state 1.

To check that, the first step would be to run this rpc: state.trieMigrationStatus

The rpc is just iterating on all the chain trie nodes and look for state version 0 that should be migrated to state version 1 (https://github.com/paritytech/cumulus/pull/1424).

If there is items to migrate in the result, indeed those node needs to be migrated before warpsync can be use.

Changing an node from state v0 to v1 just require to rewrite the value.

Thus the pallet migration is doing it. See : https://hackmd.io/JagpUd8tTjuKf9HQtpvHIQ Note that the migration rewrite all state value (we cannot say if a node is v0 or v1 from the runtime).

There is two scenarii:

not a parachain: a migration with a size limit run in each blocks and a cursor in state is updated each time.
a parachain: extrinsic emitted from a specific account are allowed to migrate values (and move cursor).

Should be noticed that parachain process require specific right for a parachain account and an external process that emit extrinsic. This is needed to ensure the migration do not get stuck in an oversized block (for instance cursor is doing 10 value and first nine put things near size limit and the 10th is putting things other it: then the 10th value will always be in the pov and all pov will be oversized, by doing it manually the blocks with the extrinsic simply could not be produced and an extrinsic with less value should be emitted).

Personally, if using the manual migration is a no go (for instance if giving right to emit these extrinsic for a given account is a no go), I would run the automatic migration with a modification:

modify the rpc to find key of value with a problematic size
add a btree of those value to the pallet
pallet first process those value one per block
pallet skep these value when running the automatic migration afterward

obviously only work if there is not too many of these big values, and if the chain will never create new key with these big value (could not be skipped if unknown when we include the pallet). These change to the pallet are not too difficult I think.

Regarding the migration testing I would suggest doing so with try_runtime.

shunsukew commented 1 year ago

Thank you so much for the detailed info.

paritytech / polkadot-sdk