paritytech / polkadot-sdk

The Parity Polkadot Blockchain SDK
https://polkadot.network/
1.62k stars 565 forks source link

Polkadot Doppelganger #4230

Open bkchr opened 2 months ago

bkchr commented 2 months ago

This issue is for presenting and collecting ideas for creating a so called "Polkadot Doppelganger". This should be a network that can clone the real network and then be used to test runtime/node interactions while being as close as possible to the real network.

Introduction

We want to increase the number of tests we are doing as part of releasing runtimes. We already have try-runtime which is running all runtime upgrades against the latest state of a chain. The problem with this approach is that you don't know if Polkadot is still able to produce blocks afterwards or if Parachains are still continue to work. Thus, we need a more sophisticated solution. Similar to try-runtime these test should be running against the latest state or better to fork off from the latest on chain state. These tests should be more long running, the best would be probably to run them at least for one full era to be able to ensure that session switching works and elections are working etc. It should also be possible to have multiple nodes running. This is important to ensure that stuff like parachains are working as intended. In the end the environment should be as close as possible to the real world network.

Forking off

To fork off the easiest solution would be probably be to first warp sync to the tip of the chain. This would need to be only by one node. The other nodes of the test network would then warp sync from this "primary node". As we need to change some of the state after the warp sync, it would otherwise require some coordination between all these nodes which we can achieve for "free" by letting the other nodes sync from the primary node. The best would be to introduce a custom BlockImport. Something in the direction of:

struct DoppelgangerImport<Inner>(Inner);

impl BlockImport for DoppelgangerImport {
    fn import_block(&mut self, block: BlockImportParams<B>) {
        // This means the state import of warp sync is ready
        if block.with_state() {
            // This function doesn't exist
            let state = block.take_state();

            // Override all important keys in the state.
            // At least BABE, Grandpa, Parachains etc
            state.override_keys();

            // We probably also need to touch staking to ensure
            // that our fake keys get re-elected etc.
            // Generally anything that needs to be fixed up, should be done here.

            // Don't forget to change the storage root of our faked block.

            self.inner.import_block(state.to_block())
        } else {
            self.inner.import_block(block)
        }
    }
}

This BlockImport needs to be put in front of Babe and Grandpa to ensure that they also fetch the correct authorities from the on chain state.

After the state is imported, the node should fail to import other blocks from the network. This is expected as we changed the storage root, block hash etc. The easiest solution after this is done is probably to let the node internally stop the "real Polkadot node" and switch to our Doppelganger implementation. The main point of this being that we don't want that the other nodes in the network connect to the real nodes in the Polkadot network. One simple approach we could do here is to let the node at startup use the default p2p port, but tell all the other nodes the Doppelganger port. Thus, the other nodes will only be able to connect when the primary node has done the switch. To detect when to do the switch, we can probably just listen for the block import event in the node.

Doppelganger Service

To have a Doppelganger node, we probably only need to override the Crypto host functions. These Crypto host functions are used to validate signatures in the runtime. As we want to be able to control any account, we should override them to return always true:

#[runtime_interface]
pub trait Crypto {
    fn ed25519_verify(sig: &ed25519::Signature, msg: &[u8], pub_key: &ed25519::Public) -> bool {
        true
    }

    #[version(2)]
    fn sr25519_verify(sig: &sr25519::Signature, msg: &[u8], pub_key: &sr25519::Public) -> bool {
        true
    }

    #[version(2)]
    fn ecdsa_verify(sig: &ecdsa::Signature, msg: &[u8], pub_key: &ecdsa::Public) -> bool {
        true
    }
}

When declaring the Executor in the service we need to put our custom host functions as last to make them override the default ones:

WasmExecutor<(sp_io::SubstrateHostFunctions, custom::HostFunctions)>,

For the rest of the service we should be able to reuse what is being used for Polkadot.

How to use it?

When the networking is running we should be able to send transactions from any account. This will be required for example to pass some OpenGov proposals by letting a lot of DOTs vote. Best for this is probably to scrap the on chain state and start with the biggest bags down to the smallest when it comes to casting votes.

Parachains

For parachains we will need a similar setup. But it should be less involved as we will only need to have one collator running that will produce the blocks. However, stuff like overwriting the keys etc will also be required and we also want to be able to send transactions from each account. Then we can fake Fellowship proposals to whitelist stuff and get it approved quite fast.

For the start only having the system chain running should be good enough. We only want to test that the runtime works together with the node as expected.

Faking time

Generally it would probably be nice to fake time as well, especially as the test network otherwise will need to run X days until some OpenGov proposal in the test network is approved. Generally we can build blocks as fast as we want, when we feed in some faked time. As we control all the nodes this isn't a problem. However, the problem is that the Parachain consensus is not able to hold up with this and we wanted to stay as close to the reality as possible. But maybe we can just "warp" across certain time frames to speed up OpenGov. In this time there wouldn't not be any parachain progressing. We could also override the on chain runtime with a custom one that has certain delays reduced. However, this would probably make the integration into CI more complicated as you would need to figure out what is the correct runtime code to your on chain runtime etc.

Why not regenesis?

We could also do this as regenesis. However, this would require to reset the block number and other things. It is more close to the real world if we can continue from where we forked off.

What else?

This issue is probably missing quite a lot of stuff and should be used as some starter to get this starting. It gives some ideas for things we need to work around to make the "fake network" work. There are also probably more things that I forgot to write down here and that we will discover while solving this. Please leave comments on what should be added or what is maybe missing.

This Doppelganger should not be a solution for all the testing purposes. I still think that for testing XCM across the entire network, the work laid out here is better suited for this. It is way easier to just fork off with chopsticks all the runtimes and let it progress the runtimes individually. Instead of having a full blown node running for every chain.

burdges commented 2 months ago

Ain't clear if "doppelganger" really makes sense: Who runs the node? How much do differences matter? Does minimizing differences represent the adversarial enviroment of the real polkadot? etc.

We should maybe discuss the real testing priorities: I suppose this issue concerns the testing priority of minimizing polkadot downtime, but extreme costs or over confidence vs adversarial enviroments matter too.

Also, is it maybe simpler to ask in what ways is kusama not being an ideal production-test hybrid? And if those ways can be improved?

All that said..

Idea 1.

As in the community testnet, we could've a testnet that scales down the approvals checks, but pulls in real paracain blocks from polkadot collators.

We regenesis this testnet whenever we like, but keep polakdot's genesis hash, and simply fake the validator set transition somewhere. We run the same number of parachain cores as polakdot, run fake collators that pull in all real backed polakdot parachain candidates, and then do our own backing & approval checks on them, but we just do many fewer approval for them.

Good: Avoides asking parachains teams to do anything. It's cheap in CPU time.

Bad: It's maybe a lot on engeneering to make it work. Also, the whole testnet dies quickly once some parachain candidates cannot be approved, which makes it expensive in devops time. A problem like #4226 would likely mean devops doing repeated regenesis. Worse, it only tests wierdness coming from real parachains, not wierdness coming from validators.

Idea 2.

I think idea 1 sounds too expensive, but there maybe simpler schemes which still only wierdness coming from real parachains, maybe still expensive in engeneering, not so expensive in devops. We do not even need a full tesnet, maybe just run a bunch of random backing/approval jobs on the new runtime/host? That's maybe not good enough.

Idea 3.

After (1) we're happier with approvals performance, (2) complete the multi relay chain analysis, (3) push polkadot uypwards to 768-1024 validators, (4) add better randomness, and (5) enough relay traffic to deman more parachain core, then we could (5) migrate to multiple relay chains. At that point, we could consider when upgrades could be roll out in a staged way at different times on different relay chains. If relay chain A and B both have 1024 validators but A has 300 busy parachain cores of work, while B only has 50 cores of work, then we might choose to roll out non-messaging upgrades first on B and only later on A, although validators jump back and forth between A and B randomly each era.

Anything like this is obviously way off.

seunlanlege commented 2 months ago

All of this is already implemented in polytope-labs/sc-simnode

bkchr commented 2 months ago

Also, is it maybe simpler to ask in what ways is kusama not being an ideal production-test hybrid? And if those ways can be improved?

This was triggered on Kusama as well and was fixed, but not 100% correctly. While Kusama is "expect chaos", not all People like it this way. I mean it is fine to test there stuff out etc, especially stuff that requires playing around with economical incentives etc. However, even for the economical stuff we have seen that people don't care that much lately, think about validators not upgrading and still getting nominations etc. Stuff that could be tested before it gets to Kusama, should be tested. Kusama is then more like a real world, stuff is not uniform etc test env, but we should prevent to bring Kusama down ;)

Generally the idea is to be as close as possible to the real environment to test this. We already have test networks and also could use them to do more weird shit, but in the end you never know if it is in the same state as the real network. Even if you would do what I wrote here and run your tests, you are still not 100% sure, but you are closer to the real network. You probably still miss issues that arise from network latency etc. Nothing is perfect, we can just move closer to it.

Idea 3.

After (1) we're happier with approvals performance, (2) complete the multi relay chain analysis, (3) push polkadot uypwards to 768-1024 validators, (4) add better randomness, and (5) enough relay traffic to deman more parachain core, then we could (5) migrate to multiple relay chains.

This is an interesting idea and would be actually cool. However, this requires a lot of overhead by first bridging etc. Nothing impossible, but takes way longer. Then also the question arises which relay chain you want to upgrade first :D And you could still run into issues because there are differences between all these chains that could trigger different behaviors/bugs.

bkchr commented 2 months ago

All of this is already implemented in polytope-labs/sc-simnode

* We don't need to mess with block import. We simply fake the required consensus digests.

* Signature verification host functions are overridden

* Block production can be controlled over RPC, so you can create a lot of blocks to "fast-forward" the chain or revert blocks to rewind the chain.

* It's compatible with both Polkadot and parachains. (It was originally built for Polkadot)

Yes I know that this exists. However, it doesn't really fulfill the required needs here. First problem is the parachain consensus (I know confusing naming, I don't mean the logic for parachain to build blocks, but the logic running on the relay chain side that backs/approves/disputes etc) on the relay chain side is not covered. It is also quite big and it would require rewriting some parts of it to support accepting incorrect signatures, but then we have again some custom code path. Generally the entire machinery should be as close as possible to the real network. We want to test the interactions between runtime and node. This should be done by mocking as less as possible. I mean we already have test nets etc, but they don't always help discovering all the bugs, so we should go as close as possible to the real world. Simnode has its use cases and I'm aware of it. I don't think it really fits here, because it is build for testing on chain logic, like voting etc, but here we want to test the entire "machinery".

Polkadot-Forum commented 2 months ago

This issue has been mentioned on Polkadot Forum. There might be relevant details there:

https://forum.polkadot.network/t/2024-04-21-polkadot-parachains-stalled-until-next-session/7526/1

pepoviola commented 2 months ago

Hi @bkchr, sounds great and I think we need some kind of fork off tool. We touch base about this with zombienet several times but the idea of building a node that bypass the signature validation was a stopper at that time. I think we can help with the whole orchestration from zombienet, following the process you described :)

burdges commented 2 months ago

This is an interesting idea and would be actually cool.

Alright another idea..

We create a special test class of parachain on kusama & polakdot themselves, which work exactly like other parachains, including messaging, except we only recieve notifications about test parachain behavior, like dispute attempts or whatever, and test parachains cannot halt relay chain progress, meaning:

We need a security boundary here, like either

We do staged roll outs in which we first enable new behavior for test parachains, spin up some glutton test parachains, and inspect how they behave.

We'd see the new host & runtime behavior under realy network conditions this way, including real validators. I'd expect tresury buys test parachain coretime, probably at some reduced rate.

I've not yet thought through what can & cannot be tested in this way, but intuitively test parachains should provide fairly good coverage.


As for downsides, we do close one door by testing in this way..

We typically have both old & new code exist together in the runtimes & hosts now, so that's not really new, although now this transition logic must respect whatever security boundary we choose here, maybe easy if purely governace.

In principle though, one could envision thinner migration strategies where runtime A hands off to runtime AtoB that then hands off to runtime B. Relay chains cannot employ migration strategies like this, because both A and B must co-exist in the same relay chain runtime so that test parachains can use B while rela parachains use A.

rzadp commented 1 month ago

Hey @bkchr, I'm trying to fully understand this issue. I have two questions:

  1. We're talking about faking time and sending votes, in order to pass OpenGov proposals. But since we're bypassing signatures, would it make sense to sudo an extrinsic instead - directly execute what would normally happen upon enactment of the referendum? I realize that it takes us a step away from real environment, but so does faking time.

  1. What should be under test.

You speak of tests as part of releasing runtimes. If I get it right, it boils down to: runtime release candidate => fork off real chain => apply runtime upgrade => make sure it works

But a runtime is one artifact of the releases, another are binaries.

Should we also be considering as part of this issue:

binary release candidate => fork off real chain using new binaries => make sure it works?

Based on my (limited) understanding, this kind of test does not exist at the moment (with real nodes, so discounting chopsticks) - please correct me if I'm wrong.

bkchr commented 1 month ago
  1. But since we're bypassing signatures, would it make sense to sudo an extrinsic instead

These chains don't have any sudo and thus, we can not use it.

But a runtime is one artifact of the releases, another are binaries.

The idea as presented above tests both, the runtime and the node with as minimal as possible changes to the node.

burdges commented 1 month ago

Do we know how much time gets spent in Crypto vs hashing?

We cannot just return true in hashing since runtimes compare results. This'll be true of future crypto too, but that's not necessarily a problem if this is meant to test the relay chain and a few core systme parachains.

bkchr commented 1 month ago

Returning true for signature checking is only done to be able to fake any kind of extrinsic. This enables you to control any account, which is quite handy ;)

pepoviola commented 1 month ago

Hi @bkchr, I want to start working on this but first I want to make a couple of questions to be sure that I'm working in the right path:

  1. The goal is to produce a new node that can handle the process of:

    • Sync the network (warp sync) and once is synced shutdown the node.
    • Start a new process (node) with the Doppelganger logic implemented, that include a custom block builder that manage the state keys overrides and bypass the crypto signature checks.

    All this should be handled as inner process logic? or

  2. Could we detach the steps? that means, start a regular node to sync from the network and ones is synced just shutdown and start a new doppelganger node (using the same database directory)?

Thanks!!

bkchr commented 1 month ago

2. Could we detach the steps? that means, start a regular node to sync from the network and ones is synced just shutdown and start a new doppelganger node (using the same database directory)?

You can also detach the steps.

xlc commented 1 month ago

with remote-externalities, we don't need to sync?

bkchr commented 1 month ago

Yeah good idea! This should also make it easier to override some of the keys in the state.

However, we would still need to "trick" the consensus systems to assume we have done a warp sync that they setup their state based on the latest block. We could maybe let the node import some "fake block" as we import the finalized state after a warp sync. All in all, I'm not sure the remote externalities makes it that much easier, but worth a look for sure. Remote externalities would also not give us the ability to restart a node for example, as they keep everything in memory etc.