Refactor container_chain_spawner

Several improvements to the spawn function based on what we see in testnet, and some refactors to make the code cleaner. Enabling "hide whitespace" in files view makes this easier to review.

Fix "shutdown error" log by removing database after closing it, instead of doing it while it is still open.
Improve check for block diff to avoid unneeded restart of the container chain. Now instead of always restarting, we only restart if we need to remove the db. Since in most cases the db will be good, this reduces the time it takes to start a container chain by a few seconds (guess: 5s). Also change block diff number from 100 blocks to 30_000, to ensure we use full sync when it is more efficient than warp sync.
Do not remove db if the block number is 0. This avoids a restart of the warp sync process for chains with big state. (untested)
~~Change select_sync_mode to return warp sync for an existing database. Even if we return warp sync it will still use full sync.~~ undid this change, see [1]
Move try_spawn logic to a function instead of using a closure.
Move params of spawn function to a new type ContainerChainSpawnParams to make cloning and passing to try_state easier.
When checking container chain genesis hash, only check v0 if v1 does not match, this saves calculating an additional state root when the hash matches.
Change spawn to be a regular async fn instead of an fn that returns a boxed future

Fix #486

Edit:

[1]: because warp sync has some bugs and can get stuck, we decided to default to full sync if a database exists. Combined with the change of "Do not remove db if the block number is 0", this means that now by default if warp sync fails to sync in 1 session, or if the node is stopped manually while the warp sync is in progress, when the chain restarts it will use full sync. By "warp sync" I mean the state sync part of warp sync, if the node is stopped while the block history download is in progress, nothing breaks.

Since using full sync can be very bad for chains with a big state, I added an error level log so that collators are aware that they may not be able to sync in time. They can retry a warp sync by stopping the node and manually removing the database, but if warp sync got stuck the first time it will probably also get stuck the second time. It's not clear why warp sync gets stuck but it seems related to bootnodes banning collators with error "Same state request multiple times". More investigation is needed.

Files Changed	Coverage
/node/src/container_chain_spawner.rs	44.89% (-0.45%)	🔽
/runtime/dancebox/tests/common/xcm/core_buyer.rs	100.00% (+9.16%)	🔼

moondance-labs / tanssi

Refactor container_chain_spawner #608

Coverage Report