Make `SyncingStrategy` abstract and allow developers to customize it

nazar-pc commented 4 weeks ago

Is there an existing issue?

[X] I have searched the existing issues

Experiencing problems? Have you tried our Stack Exchange first?

[X] This is not a support question.

Motivation

Right now SyncingStrategy is created in the constructor of SyncingEngine and the whole thing assumes there are just 3 syncing strategies that transition like this: https://github.com/paritytech/polkadot-sdk/blob/6619277b74ed7227ed9015e4a72b8579a0786808/substrate/client/network/sync/src/strategy.rs#L520-L522

This is a problem for chains that both don't have/use Warp sync and/or have custom sync strategies (like Subspace/Autonomys does, we have two custom sync strategies already).

Assumption of existence of certain strategies makes it impossible to implement custom ones without forking Substrate and injecting custom logic in strategic places, which is hard to maintain and fragile long-term. It also frequently results in unexpected side-effects like https://github.com/paritytech/polkadot-sdk/issues/4607, https://github.com/paritytech/polkadot-sdk/issues/4407 and more.

Specific examples of strategies we have already implemented with hacks in Subspace/Autonomys:

Sync from DSN (Distributed Storage Network): we download and decode most of the blocks from archival history, import them sequentially before switching back to regular chain sync strategy
Snap sync: we download the first block of the last segment of archived history from DSN, download and verify corresponding state from one of the nodes, import block with corresponding state before switching back to regular chain sync (similar to Warp sync, except we do not intend to ever sync the gap afterwards)

Request

Make it possible to define custom sync strategies and transition sequence without forking

Solution

Probably the first step would be to extract a trait out of existing SyncingStrategy, then probably extract different strategies out of SyncingStrategy and make them composable, finally expose a way to configure custom syncing strategy instead of what is used by default.

Not sure what to do with build_network, it contains significant amount of boilerplate and while could be copy-pasted and modified in chain-specific codebase, it feels like there should be a better solution.

Open to suggestions on how to approach this so we can make progress and reduce the diff with upstream in our fork.

Examples of hacks that we had to introduce downstream to implement custom sync protocols:

Temporarily pausing Substrate's sync so it doesn't attempt to catch up while we do progress towards the same objective with our custom sync protocol, avoiding unnecessary network usage, errors and eventually banning due to https://github.com/paritytech/polkadot-sdk/issues/531 / https://github.com/paritytech/polkadot-sdk/issues/1915: https://github.com/autonomys/polkadot-sdk/commit/4670a2c4ae4e373bbf840f3b79f8aff728058ddd
API for sync strategy restart: https://github.com/autonomys/polkadot-sdk/commit/9897ffa7bbd74e82cab99f4388972208b2c4b78c
API for force-clearing block gap: https://github.com/autonomys/polkadot-sdk/commit/d5375125ca7c59fcc1ac72dc1b2ac251cbc80323
Expose state sync internals in order to be able to make the same requests ourselves: https://github.com/autonomys/polkadot-sdk/commit/de6a7e3963132b68e2c75f1d81e970f58e0e2e8d

These are all fragile workarounds that we'd really like to get rid of sooner rather than later.

cc @shamil-gadelshin

Are you willing to help with this request?

Yes!

nazar-pc commented 3 weeks ago

@dmitry-markin @skunert any opinion here?

dmitry-markin commented 3 weeks ago

Thanks for bringing this up! Actually, I was also thinking about composing syncing strategies instead of hardcoding transitions in proceed_to_next when I was working on this. This is something we need anyway when we introduce more syncing strategies after extracting gap sync and adding Sync 2.0. So, I would really appreciate your help here.

One thing to keep in mind is that some strategies should be able to run in parallel. This is the case for gap sync we have now (it was planned we extract it from ChainSync as a parallel sync startegy, and some prep refactoring was already done). This is also the case for planned polkadot-sdk Sync 2.0 strategy, which should run in parallel to ChainSync (Sync 1.0) until we completely phase out Sync 1.0.

Another thing that should be dealt with along the way is that currently the strategies use custom actions to communicate with protocols. This should be replaced with a possibility for a strategy to communicate with "its" protocols using opaque requests/responses and subscribing to opaque notifications.

Your proposed solution seems sound to me.

As for build_network issue, I don't have any thoughts right now, but we can plan this refactoring as well.

Overall, this is definitely the thing that should be implemented.

nazar-pc commented 3 weeks ago

Okay, so we are thinking in about the same direction then, which is good.

The first issue I see for composition is that everything is tied together through layers instead of being somewhat independent. Specifically different strategies already have their own implementations, but the actions they produce are proxied through SyncingStragegy into SyncingEngine and then results back through a bunch of purpose-built methods.

What I think would solve that is to have each strategy their own async fn run() -> FinishResult and make them own things like NetworkServiceHandle or ImportQueueService that they need for their operation. Then SyncingStragegy could become that first aggregator that simply drives a list of provided strategies to their completion and one one of them completes it can make a decision to exit or to start another strategy, etc.

Essentially individual strategies should become more or less opaque to the caller, caller should understand their internal operation or forward their requests/responses back and forth.

This is essentially how our custom strategies in Subspace protocol work, they own handles to their dependencies and interact with networking or other things as they see fit without Substrate having any idea about it. We'd be able to wrap those in structs and implement strategy trait in the future.

Does it make sense?

dmitry-markin commented 3 weeks ago

Let me think if we can make strategies completely opaque. May be we will need to manage strategies using the same protocol in some smart way. For example, if different strategies request the same blocks via block request protocol, the peer will be banned.

We also tried to issue max 1 in-flight request at a time to every peer at some point, but I haven't found this being enforced now. Probably we lifted this constraint, and don't need to take it into account anymore.

Do you have any thoughts what else (except async fn run()) should be in the strategy trait?

nazar-pc commented 3 weeks ago

Methods like is_major_syncing, num_peers, probably status, report_metrics, etc.

Not yet sure if there needs to be some more direct communication between lower and higher level strategies, but at least on the top level I don't think too many are necessary.

Well, I guess it might be more like spawning a worker for strategy and having a handle to interact with it, kind of like SyncingEngine vs SyncingService already split the API between them.

conr2d commented 3 weeks ago

I'm also interested in this issue regarding the potential implementation of a PoW chain using Substrate. Since there's no finalized state in a PoW context, the existing fast sync mechanism may not work effectively. However, if an abstraction of SyncingStrategy is introduced, it could enable syncing a PoW chain to a specific block number, similar to how Eth1 handles synchronization with a pivot point.

nazar-pc commented 3 weeks ago

I have tried different approaches and found something that is somewhat decent for the first step.

After https://github.com/paritytech/polkadot-sdk/pull/5410 I have another step before the actual abstraction.

New trait is introduced and initially implemented for both SyncingStrategy and ChainSync (other sync implementations are not really complete sync strategies and can't be used on their own, but ChainSync can be!).

There is still plenty of things to untangle, but I need to get reviews and submit things in sequence, so will appreciate reasonably quick turnaround with PRs if possible.

nazar-pc commented 2 weeks ago

I'd like to make more progress here, but can't open PRs until https://github.com/paritytech/polkadot-sdk/pull/5364, https://github.com/paritytech/polkadot-sdk/pull/5410 and https://github.com/paritytech/polkadot-sdk/pull/5431 are in. Let me know if there is anything I can do to move things forward faster.

nazar-pc commented 1 week ago

Another PR with initial SyncingStrategy trait introduction: https://github.com/paritytech/polkadot-sdk/pull/5469

nazar-pc commented 5 days ago

Have another set of changes that allows to use custom syncing strategies on top of https://github.com/paritytech/polkadot-sdk/pull/5469 that I can't open as a PR before https://github.com/paritytech/polkadot-sdk/pull/5469 is in.

With that syncing_strategy is an argument of build_network and any strategy can be used instead of the default one, all of the protocols are added by the strategy constructor, so they are under full control of the developer now. Also StrategyKey is an opaque type instead of an enum with a finite set of options, so custom strategies can define their own keys if they need to without modifying Substrate.

The next step will be to make SyncingEngine unaware of different sync types by consolidating different request and response handling methods. For example to remove SyncingStrategy::on_warp_proof_response() since it is too specific and have something generic instead (especially since we already have StrategyKey to distinguish what purpose each response has), but I'm not there yet and will wait for other changes to go in first.

nazar-pc commented 5 days ago

I think I finally found a good pattern to remove block/state/warp requests/responses handling from SyncingStrategy public interface. Existing SyncingAction::CancelRequest will simply be accompanied by SyncingAction::StartRequest and responses are all generic and up to strategy to decode/interpret. I quite like how things are looking so far.

paritytech / polkadot-sdk