paritytech / polkadot-sdk

The Parity Polkadot Blockchain SDK
https://polkadot.com/
1.89k stars 696 forks source link

The road to the native runtime free world #62

Open bkchr opened 2 years ago

bkchr commented 2 years ago

The Vision

We want to get rid off the native runtime. Initially the native runtime was added to overcome the performance "problems" of using wasmi for the runtime execution. However, with wasmtime we are around 2x slower than native. While this still isn't perfect, it is way faster than wasmi that is around 10x slower (all depending on the work load etc). While the performance plays an important role here there are also other things to consider, like ensuring that the native and wasm runtime are deterministic. We have already run into problems with Polkadot when it comes to this determinism problems. While we can ensure to use the exact same compiler version for building the Polkadot binary in our release process, we can not ensure that everyone else is also using the same compiler version as we used for the wasm build.

There is also a nice write up and some discussions here: https://github.com/paritytech/substrate/issues/7288

The Plan

To remove the native runtime, we need to fix some things before:

  1. Make the wasm memory growable. We have already run into a situation of a OOM. This happened because we only allow a fixed amount of memory per wasm instance. We solved this problem temporarily by running the native runtime which can just use the entire system memory and then afterwards we released a new Polkadot node with an increased memory budget. So, we were lucky because we had the native runtime that saved our ass, but if we remove it it would not be there anymore. So, we should support growing of the wasm memory. It can still in the end allocate in maximum 4GB, but that should be way enough for all our use cases and combined with some proper monitoring we should be able to be informed before we hit this upper limit.
  2. Remove runtime api requirement of statically checking that the runtime api is implemented. Currently when trying to use a runtime api, it is required to tell the compiler that a certain type is required that needs to implement a runtime api. This was done on purpose to give people better hints at compile time that they don't have implemented a certain api, but this only works by checking the native runtime (because we can not tell the compiler to check the wasm file). So, if we want to remove the native runtime we also need to remove this requirement on having the runtime api being implemented for a type. This can be solved by implementing the runtime apis for a different trait that will then be able to call into the runtime. We will also need to change some stuff, like checking that a runtime api is implemented on startup instead of failing later when the node is running. https://github.com/paritytech/polkadot-sdk/issues/27
  3. Genesis is currently being created using the native runtime. We will also need to fix this. Maybe with an extra tool or whatever. https://github.com/paritytech/polkadot-sdk/issues/25

Open Questions

If you want to help us out and contribute to this issue, in this section you can find open questions and tasks where we would appreciate any input.

There are maybe still some other things to consider and to fix on the way, like only compiling the runtime for wasm instead of the current native & wasm build.


Here you can find the board with specific sub-tasks to this milestone: https://github.com/orgs/paritytech/projects/18/views/9

burdges commented 2 years ago

As an aside not strictly relevant here, we do envision having "runtime libraries" eventually, like SPREEs or perhaps large verifier keys ala halo2, which might be implemented in another way like a flat file of data, or if they undergo closer auditing then eBPF or native code loaded via dlopen, or a native generator for a verifier key. Afaik all node should use identical deployment scheme, so we'd never walk back the consensus improvement described here.

bkchr commented 2 years ago

Okay, sounds interesting. I always thought that SPREE would be different wasm files that are just linked at runtime?

Do we really want to native load code? Doesn't this destroy the entire advantage of using wasm? I mean then we could have done that from the beginning and could have distributed the runtime as a native library.

burdges commented 2 years ago

I've too little grasp about what "wasm files linked at runtime" means, but the SPREE would require process-like separation from the main PVF, and the SPREE invokes hostcalls that modify the SPREE state, for which the main PVF lacks permission.

At least in my mind that's impossible with "linked at runtime" so instead the SPREE exposes some crate linked at build time, but then itself creates new hostcalls or more SPREE-calls available to the whole PVF including its own part.

Yes, SPREEs could be WASM but maybe 2x matters enough that a common enough SPREE would be native and well audited, or maybe doing those native builds limits our hardware choices too much? I donno..

Do we really want to native load code?

I donno, it's plausible we'd never want native code loaded, but much here remains unclear to me, like the performance hit eBPF takes.

I only raised the comment in case anything of this impacts how you remove the native runtime. :)

bkchr commented 2 years ago

I've too little grasp about what "wasm files linked at runtime" means, but the SPREE would require process-like separation from the main PVF, and the SPREE invokes hostcalls that modify the SPREE state, for which the main PVF lacks permission.

Yeah that makes sense and I did not thought about it, but this could just be a host call to communicate with SPREE and the SPREE could be anything.

Aka we could have SPREE by default in wasm and could provide native compiled SPREE that ships with the node. If the requested SPREE version would not be supported by the node, we could just run the wasm version. However, I think whatever we do it is orthogonal to removing the native runtime.

I only raised the comment in case anything of this impacts how you remove the native runtime. :)

That is always a good idea nevertheless :)

Wizdave97 commented 2 years ago

@bkchr Would host functions be removed entirely along side the native runtime? Or would there be a workaround for that?

bkchr commented 2 years ago

Host functions can not be removed and will not be removed.

Garandor commented 2 years ago

A question on this popped up at https://github.com/Manta-Network/Manta/pull/701

The gist of it is: There already exists a commandline flag to force WASM execution: --execution wasm

While stopping validators on a nondeterminism issue is certainly bad and should be avoided ( e.g. using that same CLI flag ), it might be tolerable for other full node network participants - as a downtime on their functions ( serving RPC in our case ) is not nearly as critical to the network as a chain stall.

We might want to actually optimize non-critical nodes for the happy case where everything works and have the 2x speedup in runtime execution result in a higher number of RPCs served per node ( got no numbers on the actual performance impact yet, using the 2x number from @bkchr's initial description here, actual gain I guess will also depend on the weight used per block on average )

native runtime free as on the tin would remove this optionality of tolerating stability issues for performance gain, so IMHO it's not desirable, since you can already force the WASM runtime with the above CLI flag.

bkchr commented 2 years ago

Did you benchmark the execution? And do you have that many calls into the runtime on your RPC nodes?

Garandor commented 2 years ago

Did you benchmark the execution? And do you have that many calls into the runtime on your RPC nodes?

No benchmarks yet. Can only intuit at this point that in the case of full-ish blocks and benchmark-scale machines a 2x performance penalty would hurt RPC throughput.

We do expect high RPC load on our nodes through custom wallet syncing logic in the near future. I realize these RPC serving nodes can be easily scaled horizontally, but what I'm arguing here is that this change would condemn non-critical network participants to waste resources for a gain in stability which may not be strictly necessary.

A proposal could be to force --execution wasm on validators but leave --execution native_else_wasm as default on full nodes, leaving them to opt into wasm execution with --execution wasm.

tomaka commented 2 years ago

A "2x" slowdown estimate in the runtime execution speed does in no way translate in a 2x slowdown in the number of RPC requests that can be served. The slowdown obviously depends on the kind of requests that you get, but I would expect it to be basically 0. Until recently, PolkadotJS didn't even expose the possibility to call into the runtime through the RPC layer.

got no numbers on the actual performance impact yet

I don't think it is reasonable to argue that the performance hit is not worth the removal of the native runtime without having any number to back this up.

Garandor commented 2 years ago

I don't think it is reasonable to argue that the performance hit is not worth the removal of the native runtime without having any number to back this up.

I agree, which is why I'm pointing to the general problem of wasting of CPU cycles here. Wallet RPC is just our use-case. Other people will be running indexing (e.g. subsquid) on their full nodes - they're also impacted by this change. How much so, I do not know.

but I would expect it to be basically 0.

Great if true and then I have no more issue with this. My basis for argument is the assumption that a node that's executing block weight close to the benchmark limit will have a high CPU load if it's exactly on spec. That would leave little free resources to handle RPCs into the runtime. If that same node now had ~50% idle CPU because of native execution finishing faster, it'd have plenty capacity for "extra" RPC work. Please correct me if this assumption is wrong

tomaka commented 2 years ago

If the runtime execution speed was suddenly divided by 2, then the weights would be multiplied by 2 in return.

The CPU usage problem that you describe is indeed correct, but it's independent of the time it takes to execute the runtime. No matter whether it takes 5 microseconds or 5 milliseconds to verify for example a balance transfer, the weights are adjusted so that a node running on the reference machine will use 100% CPU if the chain is at maximum capacity.

The time it takes for the runtime to execute influences the maximum number of transactions per second of a chain, but not (or very little) the CPU usage of the nodes.

Garandor commented 2 years ago

No matter whether it takes 5 microseconds or 5 milliseconds to verify for example a balance transfer, the weights are adjusted so that a node running on the reference machine will use 100% CPU if the chain is at maximum capacity.

There's some subtlety here. Weights are always benchmarked with the WASM runtime, as it's obviously important for validators - who always use it - to not overflow execution time.

Full nodes executing the native runtime thus use weights that were created with the WASM runtime, but execute them 2x faster => lower CPU usage on full nodes executing native assuming reference hardware.

wigy-opensource-developer commented 2 years ago

My basis for argument is the assumption that a node that's executing block weight close to the benchmark limit will have a high CPU load if it's exactly on spec. That would leave little free resources to handle RPCs into the runtime. If that same node now had ~50% idle CPU because of native execution finishing faster, it'd have plenty capacity for "extra" RPC work. Please correct me if this assumption is wrong

You also need to consider how much of the RPC requests are querying recent state and how much of them are interested in "archive" data. The current native approach only helps if most of the calls are querying recent data, and the node is upgraded soon after an on-chain runtime upgrade. But yeah, we need hard benchmarking data to see how much extra resources are needed to avoid this native runtime optimization.

tomaka commented 2 years ago

You also need to consider how much of the RPC requests are querying recent state and how much of them are interested in "archive" data. The current native approach only helps if most of the calls are querying recent data,

There's no relationship between the runtime and querying data through RPC requests. When an RPC requests wants to read a storage value, it just reads directly from the database. The runtime is completely unrelated to this.

arkpar commented 2 years ago

It is also worth mentioning that RPC throughput scales well with the number of CPU cores. Whilst block import is serialized at the moment. Weight calculations do not account for any extra cores, that may be used for serving RPC requests.

Garandor commented 2 years ago

It is also worth mentioning that RPC throughput scales well with the number of CPU cores. Whilst block import is serialized at the moment. Weight calculations do not account for any extra cores, that may be used for serving RPC requests.

Thanks, considering the 100% CPU load from block execution one-core-only combined with the knowledge that RPC reads from DB instead of adding load to the runtime thread fixes my concern

kianenigma commented 1 year ago

For the purpose of the master tutorial, what I would love to have is a fixed node that can ingest different runtimes (one at a time, of course), and run any of them. This should ideally be as simple as changing a single line in Cargo.toml, or simple be a --runtime wasm.wasm.

Will this be feasible once this is done?

I think this is already feasible using the first ugly approach. Assuming we only do wasm execution, we just need a glue trait that would convey some types like OpaqueBlock, and a few other types needed for building the RPC extensions (AccountId, Nonce) and the GenesisConfig.

As far as I know, genesis state will be built with wasm after this, and we are moving away from custom RPCs as well. So I think the amount of information that the client needs to know about the runtime is actually very very small.

looking at bin/minimal/node is a good small example to demonstrate this.

Alternatively, these concrete types that the client needs to know about the runtime in order to build genesis, RPC and such can be auto-generated by construct_runtime. I tried to hint at this in mod prelude which is mentioned in this idea.

arkpar commented 1 year ago

@kianenigma You can do that already with a custom chain spec. Generic substrate node should be able to sync any chain with substrate supported consensus. E.g. cargo run --release --chain polkadot.json in the substrate repo can sync polkadot.

kianenigma commented 1 year ago

@kianenigma You can do that already with a custom chain spec. Generic substrate node should be able to sync any chain with substrate supported consensus. E.g. cargo run --release --chain polkadot.json in the substrate repo can sync polkadot.

That's a valuable new avenue that I had totally forgotten about, thanks!

But -- in order to create a chain-spec, you still need some kind of fully-functioning client to be integrated with a wasm runtime. My ideal scenario is have a runtime that is itself capable of generating its initial state, and can be loaded into an omni-client that can run it with minimal configuration.

of course, as long as the consensus in the runtime and client are matching. For the sake of simplicity, assume manual-seal is being used.

For the tutorial, what I can do as the tutorial creator is to pre-generate all chain specs and let each step of the tutorial to have a .json file.

This is much better, but I am still hoping that we can (optionally) reduce the dependency of the client on the runtime even further.

bkchr commented 1 year ago

My ideal scenario is have a runtime that is itself capable of generating its initial state, an

This will be do-able after: https://github.com/paritytech/polkadot-sdk/issues/25

The wasm file will have an API to generate the genesis state/initial state.

omni-client that can run it with minimal configuration.

This will not be possible after this pr. I mean yes we will move closer to this as the native runtime is gone. However, a lot of runtime api calls etc require that the node and the runtime know the type. This could also be changed, but would require to change a lot of things. (not impossible for sure)

But yeah, for your use case not that important.

kianenigma commented 1 year ago

cargo run --release --chain polkadot.json

Actually this does not seem to work. The genesis config of the linked node's native runtime and the chain-spec need to have matching pallets for this approach to work. Otherwise I get an error that the provided chain spec has either missing pallets or too many of them.

Given this, I am somewhat baffled about why cargo run --release -- --chain polkadot.json does indeed work. 🤔

bkchr commented 1 year ago

The genesis config of the linked node's native runtime and the chain-spec need to have matching pallets for this approach to work.

The polkadot.json is a raw genesis config and this doesn't require the native runtime to load it. So, not sure what you have done there.

kianenigma commented 1 year ago

Then it must be because I have not used --raw chain spec, and now I see what is the point of it! :))

crystalin commented 1 year ago

I don't know if I shared my opinion before but there are 2 important points to using Native runtime:

kianenigma commented 1 year ago

This will not be possible after this pr. I mean yes we will move closer to this as the native runtime is gone. However, a lot of runtime api calls etc require that the node and the runtime know the type. This could also be changed, but would require to change a lot of things. (not impossible for sure)

I was just thinking about this again. I accept that it is not possible to decouple the two 100%, that's not what I am asking for. What I am asking for to make the interface between the two clear.

  1. As it stands now, the client of a substrate based chain is expected to arbitrarily import the runtime crate as a dependency, to use its type, values and items.

  2. Moreover, the runtime code is assumed to know exactly which types it should declare as public in oder to satisfy a client. In reality, it does not, and it is a guesswork.

I think both of the above are bad situation. I don't know if it was left like this because it is not possible, or out of ignorance? The solution seems to be a trait to connect the two in a clear manner.

/// The client's representation of a runtime. 
trait Runtime {
    type RuntimeApis; 
    type OpaqueBock; 
    type GenesisConfig; // This will not be needed anymore, but you get the point ;) 
}

If you see any point in this, happy to move it over to a new issue.

bkchr commented 1 year ago

Not sure we need such a trait. Neither GenesisConfig nor RuntimeApi will be required in the future. I would even go that far and say that you don't even need to import the runtime in your node anymore.

kianenigma commented 1 year ago

I would even go that far and say that you don't even need to import the runtime in your node anymore.

Hell yeah is all I have to say to that then, yes none of what I said is needed.

liamaharon commented 1 year ago

Bumping @crystalin's comment here:

I don't know if I shared my opinion before but there are 2 important points to using Native runtime:

  • Speed (running 1000 tests using native is significantly faster, even when using interpreted wasm (which doesn't spend time compiling at the boot)
  • Debugging (So far, I wasn't able to get proper debugging of the wasm runtime (tried building with debug symbols and tell the wasmtime to debug it, but no success)

I've also found debugging of the wasm runtime very hit-or-miss, with lots of variables being logged as just empty strings.

As try-runtime-cli develops and more debugging features are added, I'm wondering if there could be a benefit to using it with a native runtime.

bkchr commented 1 year ago

The native runtime can already not be used anymore.

with lots of variables being logged as just empty strings.

This is just because we have implemented a lot of Debug impls to do this. Generally this is something that we should revisit and to check the impact on the runtime. Generally, when things are not being used, the linker should strip them.

athei commented 1 year ago

This is just because we have implemented a lot of Debug impls to do this. Generally this is something that we should revisit and to check the impact on the runtime. Generally, when things are not being used, the linker should strip them.

Log messages would probably the place which pull them into the runtime image. Solution would be to disable the lower log levels at compile time by default.

bkchr commented 1 year ago

Solution would be to disable the lower log levels at compile time by default.

We already do this for on chain builds for quite some time. We actually disable all logging. https://github.com/paritytech/polkadot/blob/c46d7426445ba64cf3ef3158e50e1c7665345a3f/runtime/polkadot/Cargo.toml#L300

athei commented 1 year ago

I guess then it would be an interesting experiment to see by how much the size of a runtime goes up if we replace all RuntimeDebug by Debug. We could compare if there is a substantial difference when sp-api/disable-logging is set. With lto enabled the linker should be able to eliminate the unused impls.

bkchr commented 1 year ago

Yep

nazar-pc commented 1 year ago

I tried once RuntimeDebug and noticed it was somehow more code with RuntimeDebug than Debug and I never used RuntimeDebug in my code since. I didn't bother checking why it was the case though.

Polkadot-Forum commented 1 year ago

This issue has been mentioned on Polkadot Forum. There might be relevant details there:

https://forum.polkadot.network/t/polkadot-release-analysis-v1-0-0/3585/4

michalkucharczyk commented 1 year ago

When it comes to generating chain-spec files, I have some concerns: where the generating code should be placed and how it should work.

In the past the generation of the chain-spec was simply done with command (assuming kusama-runtime feature enabled in polkadot):

polkadot build-spec --chain kusama-staging --raw

Once the support for native runtime is removed and with work done in https://github.com/paritytech/polkadot-sdk/pull/1256, we will need following building bricks to create a chain-spec file:

When it comes to tooling for kusama/polkadot/westend/... development chain-spec creation I see following options:

  1. Keep support in polkadot-node, w/o dependencies to runtime crates. This requires build-spec command to be extenede with path to wasm runtime file:

    polkadot build-spec --chain kusama-staging --runtime /path/to/kusama-runtime.wasm --raw 

    We would also keep all the functions generating the patch in polkadot/node/service/src/chain_spec.rs.

  2. Keep support in polkadot-node, with dependencies to runtime crates (only to get a right wasm blob e.g. polkadot::WASM_BINARY).

    # nothing changed here:
    polkadot build-spec --chain kusama-staging --raw 

    In this case we also keep all the functions generating the patch in polkadot/node/service/src/chain_spec.rs.

  3. Move the build-spec subcommand to new tool (chain-spec-builder, which with workd done in #1256 already allows some generic chain-spec operations). The tool could be moved to new repo (like try-runtime).

    chain-spec-builder runtime -r ./kusama-runtime.wasm --raw-storage  --chain-spec kusama-staging

    In this scenario the functions for generating the patches would be removed from polkadot-sdk repo and moved to the chain-spec-builder code. It would also be possible to remove functions generating the JSON patches, and keep patches as files (for each runtime and variant) but that could be inconvienent as one may still need way to easy modify the chain-spec.

crystalin commented 1 year ago

Giving my 2-cents. I would prefer to keep it within polkadot binary

bkchr commented 1 year ago

I would favor option 1 and 3. Option two would us bring back to the old model of including the runtime crate, which blows up compile time again.

The tool proposed in 3 will be really good and can be used by things like zombienet that patches the chain specs anyway.

One other thing that is lingering in my head since some time, we could support pointing to a Cargo.toml file via cli. When the node or the independent tool are pointed to a Cargo.toml file, we should build the crate and use the resulting wasm binary as input binary. I can see this bringing a nice devex when you want to easily test changes to the runtime. It would then only require to restart the node and then it would also recompile the crate.

node --chain my-runtime/Cargo.toml
crystalin commented 1 year ago

I think Wasm or Cargo.toml are both good. Building wasm is fairly easy and straightforward so.

Having another tool makes me worry about maintaining it and having it hard to discover (for new people) or to follow progress. (right now the polkadot -h gives you a lot of useful info and commands)

michalkucharczyk commented 1 year ago

To re-iterate my previous thoughts on the chain-specs in native-runtime-free world:

Removing RuntimeGenesisConfig struct is not enough to decouple the node/client side from the native-runtime entirely.

There are still some left-overs required on the node side to build the chain-spec. Examples of the types (or consts) that one needs to know are as follows:

These types need to be known at time of creating the json chain-spec. One needs to know what the types are to properly put them into the json format of runtime genesis config.

To overcome this problem, I propose to move the creation of the JSON representation of the runtime genesis config into the runtime itself, and exposing multiple configurations through the GenesisBuilder API. GenesisBuilder implementation may encapsulate any number of predefined named configurations (e.g. "dev", "testnet", "staging" "alice"), and make them accessible via the runtime API. The named configuration within runtime would be provided in the form of JSON patch (meaning that only the customized fields would be included into it).

The GenesisBuilder API could be updated as follows (draft):

fn list_patches_names() -> Vec<String>;
fn get_patch(name: String) -> serde_json::Value;
fn get_default_config() -> serde_json::Value;
fn build_storage(value: serde_json::Value);

The ChainsSpecBuilder (introduced in #1256) would be extended with one additional method: with_named_genesis_config_patch, allowing chain-spec construction with the following code:

type RococoChainSpec = ChainSpecBuilder<(), Extensions>;

RococoChainSpec::builder(rococo::WASM_BINARY, Extensions::default()) 
   .with_name("Rococo Staging Testnet")
   .with_id("rococo_staging_testnet")
   .with_chain_type(ChainType::Live)
   .with_named_genesis_config_patch("staging")
   .with_telemetry_endpoints(...)
   .with_boot_nodes(...)
   .with_protocol_id(DEFAULT_PROTOCOL_ID)
   .build()

The above sequence interacts with the GenesisBuilder API of provided runtime, and builds the chain-spec.

genesis-builder-00 (cc: @kianenigma)

bkchr commented 1 year ago
fn get_patch(name: String) -> serde_json::Value;
fn get_default_config() -> serde_json::Value;

I don't see a reason why these are two functions?

michalkucharczyk commented 1 year ago
fn get_patch(name: String) -> serde_json::Value;
fn get_default_config() -> serde_json::Value;

I don't see a reason why these are two functions?

get_default_config() is a wrapper to RuntimeGenesisConfig::default (as we need default config to apply patch [runtime-builtin or user-provided]), get_patch(name: String) allows to get predefined builtin patch which shall be applied to default value. (Having patch allows for better flexibility as we can put patch into chainspec).

bkchr commented 1 year ago

But why return a patch in the first place if you can directly return the chain spec?

michalkucharczyk commented 1 year ago

But why return a patch in the first place if you can directly return the chain spec?

bkchr commented 1 year ago

Okay, thank you for the explanation! Sounds reasonable. But maybe we can come up with a better name for "patch". Maybe just rename it to "config"? I mean it is a config, just not the whole :thinking: Okay, this actually brings me back to making the parameter to get_config optional and if it is None, it could return the default config.