Open bkchr opened 2 years ago
As an aside not strictly relevant here, we do envision having "runtime libraries" eventually, like SPREEs or perhaps large verifier keys ala halo2, which might be implemented in another way like a flat file of data, or if they undergo closer auditing then eBPF or native code loaded via dlopen, or a native generator for a verifier key. Afaik all node should use identical deployment scheme, so we'd never walk back the consensus improvement described here.
Okay, sounds interesting. I always thought that SPREE would be different wasm files that are just linked at runtime?
Do we really want to native load code? Doesn't this destroy the entire advantage of using wasm? I mean then we could have done that from the beginning and could have distributed the runtime as a native library.
I've too little grasp about what "wasm files linked at runtime" means, but the SPREE would require process-like separation from the main PVF, and the SPREE invokes hostcalls that modify the SPREE state, for which the main PVF lacks permission.
At least in my mind that's impossible with "linked at runtime" so instead the SPREE exposes some crate linked at build time, but then itself creates new hostcalls or more SPREE-calls available to the whole PVF including its own part.
Yes, SPREEs could be WASM but maybe 2x matters enough that a common enough SPREE would be native and well audited, or maybe doing those native builds limits our hardware choices too much? I donno..
Do we really want to native load code?
I donno, it's plausible we'd never want native code loaded, but much here remains unclear to me, like the performance hit eBPF takes.
I only raised the comment in case anything of this impacts how you remove the native runtime. :)
I've too little grasp about what "wasm files linked at runtime" means, but the SPREE would require process-like separation from the main PVF, and the SPREE invokes hostcalls that modify the SPREE state, for which the main PVF lacks permission.
Yeah that makes sense and I did not thought about it, but this could just be a host call to communicate with SPREE and the SPREE could be anything.
Aka we could have SPREE by default in wasm and could provide native compiled SPREE that ships with the node. If the requested SPREE version would not be supported by the node, we could just run the wasm version. However, I think whatever we do it is orthogonal to removing the native runtime.
I only raised the comment in case anything of this impacts how you remove the native runtime. :)
That is always a good idea nevertheless :)
@bkchr Would host functions be removed entirely along side the native runtime? Or would there be a workaround for that?
Host functions can not be removed and will not be removed.
A question on this popped up at https://github.com/Manta-Network/Manta/pull/701
The gist of it is:
There already exists a commandline flag to force WASM execution: --execution wasm
While stopping validators on a nondeterminism issue is certainly bad and should be avoided ( e.g. using that same CLI flag ), it might be tolerable for other full node network participants - as a downtime on their functions ( serving RPC in our case ) is not nearly as critical to the network as a chain stall.
We might want to actually optimize non-critical nodes for the happy case where everything works and have the 2x speedup in runtime execution result in a higher number of RPCs served per node ( got no numbers on the actual performance impact yet, using the 2x number from @bkchr's initial description here, actual gain I guess will also depend on the weight used per block on average )
native runtime free
as on the tin would remove this optionality of tolerating stability issues for performance gain, so IMHO it's not desirable, since you can already force the WASM runtime with the above CLI flag.
Did you benchmark the execution? And do you have that many calls into the runtime on your RPC nodes?
Did you benchmark the execution? And do you have that many calls into the runtime on your RPC nodes?
No benchmarks yet. Can only intuit at this point that in the case of full-ish blocks and benchmark-scale machines a 2x performance penalty would hurt RPC throughput.
We do expect high RPC load on our nodes through custom wallet syncing logic in the near future. I realize these RPC serving nodes can be easily scaled horizontally, but what I'm arguing here is that this change would condemn non-critical network participants to waste resources for a gain in stability which may not be strictly necessary.
A proposal could be to force --execution wasm
on validators but leave --execution native_else_wasm
as default on full nodes, leaving them to opt into wasm execution with --execution wasm
.
A "2x" slowdown estimate in the runtime execution speed does in no way translate in a 2x slowdown in the number of RPC requests that can be served. The slowdown obviously depends on the kind of requests that you get, but I would expect it to be basically 0. Until recently, PolkadotJS didn't even expose the possibility to call into the runtime through the RPC layer.
got no numbers on the actual performance impact yet
I don't think it is reasonable to argue that the performance hit is not worth the removal of the native runtime without having any number to back this up.
I don't think it is reasonable to argue that the performance hit is not worth the removal of the native runtime without having any number to back this up.
I agree, which is why I'm pointing to the general problem of wasting of CPU cycles here. Wallet RPC is just our use-case. Other people will be running indexing (e.g. subsquid) on their full nodes - they're also impacted by this change. How much so, I do not know.
but I would expect it to be basically 0.
Great if true and then I have no more issue with this. My basis for argument is the assumption that a node that's executing block weight close to the benchmark limit will have a high CPU load if it's exactly on spec. That would leave little free resources to handle RPCs into the runtime. If that same node now had ~50% idle CPU because of native execution finishing faster, it'd have plenty capacity for "extra" RPC work. Please correct me if this assumption is wrong
If the runtime execution speed was suddenly divided by 2, then the weights would be multiplied by 2 in return.
The CPU usage problem that you describe is indeed correct, but it's independent of the time it takes to execute the runtime. No matter whether it takes 5 microseconds or 5 milliseconds to verify for example a balance transfer, the weights are adjusted so that a node running on the reference machine will use 100% CPU if the chain is at maximum capacity.
The time it takes for the runtime to execute influences the maximum number of transactions per second of a chain, but not (or very little) the CPU usage of the nodes.
No matter whether it takes 5 microseconds or 5 milliseconds to verify for example a balance transfer, the weights are adjusted so that a node running on the reference machine will use 100% CPU if the chain is at maximum capacity.
There's some subtlety here. Weights are always benchmarked with the WASM runtime, as it's obviously important for validators - who always use it - to not overflow execution time.
Full nodes executing the native runtime thus use weights that were created with the WASM runtime, but execute them 2x faster => lower CPU usage on full nodes executing native assuming reference hardware.
My basis for argument is the assumption that a node that's executing block weight close to the benchmark limit will have a high CPU load if it's exactly on spec. That would leave little free resources to handle RPCs into the runtime. If that same node now had ~50% idle CPU because of native execution finishing faster, it'd have plenty capacity for "extra" RPC work. Please correct me if this assumption is wrong
You also need to consider how much of the RPC requests are querying recent state and how much of them are interested in "archive" data. The current native approach only helps if most of the calls are querying recent data, and the node is upgraded soon after an on-chain runtime upgrade. But yeah, we need hard benchmarking data to see how much extra resources are needed to avoid this native runtime optimization.
You also need to consider how much of the RPC requests are querying recent state and how much of them are interested in "archive" data. The current native approach only helps if most of the calls are querying recent data,
There's no relationship between the runtime and querying data through RPC requests. When an RPC requests wants to read a storage value, it just reads directly from the database. The runtime is completely unrelated to this.
It is also worth mentioning that RPC throughput scales well with the number of CPU cores. Whilst block import is serialized at the moment. Weight calculations do not account for any extra cores, that may be used for serving RPC requests.
It is also worth mentioning that RPC throughput scales well with the number of CPU cores. Whilst block import is serialized at the moment. Weight calculations do not account for any extra cores, that may be used for serving RPC requests.
Thanks, considering the 100% CPU load from block execution one-core-only combined with the knowledge that RPC reads from DB instead of adding load to the runtime thread fixes my concern
For the purpose of the master tutorial, what I would love to have is a fixed node that can ingest different runtimes (one at a time, of course), and run any of them. This should ideally be as simple as changing a single line in Cargo.toml
, or simple be a --runtime wasm.wasm
.
Will this be feasible once this is done?
I think this is already feasible using the first ugly approach. Assuming we only do wasm execution, we just need a glue trait that would convey some types like OpaqueBlock
, and a few other types needed for building the RPC extensions (AccountId
, Nonce
) and the GenesisConfig.
As far as I know, genesis state will be built with wasm after this, and we are moving away from custom RPCs as well. So I think the amount of information that the client needs to know about the runtime is actually very very small.
looking at bin/minimal/node
is a good small example to demonstrate this.
Alternatively, these concrete types that the client needs to know about the runtime in order to build genesis, RPC and such can be auto-generated by construct_runtime
. I tried to hint at this in mod prelude
which is mentioned in this idea.
@kianenigma You can do that already with a custom chain spec. Generic substrate node should be able to sync any chain with substrate supported consensus. E.g. cargo run --release --chain polkadot.json
in the substrate repo can sync polkadot.
@kianenigma You can do that already with a custom chain spec. Generic substrate node should be able to sync any chain with substrate supported consensus. E.g.
cargo run --release --chain polkadot.json
in the substrate repo can sync polkadot.
That's a valuable new avenue that I had totally forgotten about, thanks!
But -- in order to create a chain-spec, you still need some kind of fully-functioning client to be integrated with a wasm runtime. My ideal scenario is have a runtime that is itself capable of generating its initial state, and can be loaded into an omni-client that can run it with minimal configuration.
of course, as long as the consensus in the runtime and client are matching. For the sake of simplicity, assume manual-seal is being used.
For the tutorial, what I can do as the tutorial creator is to pre-generate all chain specs and let each step of the tutorial to have a .json
file.
This is much better, but I am still hoping that we can (optionally) reduce the dependency of the client on the runtime even further.
My ideal scenario is have a runtime that is itself capable of generating its initial state, an
This will be do-able after: https://github.com/paritytech/polkadot-sdk/issues/25
The wasm file will have an API to generate the genesis state/initial state.
omni-client that can run it with minimal configuration.
This will not be possible after this pr. I mean yes we will move closer to this as the native runtime is gone. However, a lot of runtime api calls etc require that the node and the runtime know the type. This could also be changed, but would require to change a lot of things. (not impossible for sure)
But yeah, for your use case not that important.
cargo run --release --chain polkadot.json
Actually this does not seem to work. The genesis config of the linked node's native runtime and the chain-spec need to have matching pallets for this approach to work. Otherwise I get an error that the provided chain spec has either missing pallets or too many of them.
Given this, I am somewhat baffled about why cargo run --release -- --chain polkadot.json
does indeed work. 🤔
The genesis config of the linked node's native runtime and the chain-spec need to have matching pallets for this approach to work.
The polkadot.json
is a raw genesis config and this doesn't require the native runtime to load it. So, not sure what you have done there.
Then it must be because I have not used --raw
chain spec, and now I see what is the point of it! :))
I don't know if I shared my opinion before but there are 2 important points to using Native runtime:
This will not be possible after this pr. I mean yes we will move closer to this as the native runtime is gone. However, a lot of runtime api calls etc require that the node and the runtime know the type. This could also be changed, but would require to change a lot of things. (not impossible for sure)
I was just thinking about this again. I accept that it is not possible to decouple the two 100%, that's not what I am asking for. What I am asking for to make the interface between the two clear.
As it stands now, the client of a substrate based chain is expected to arbitrarily import the runtime
crate as a dependency, to use its type, values and items.
Moreover, the runtime
code is assumed to know exactly which types it should declare as public in oder to satisfy a client. In reality, it does not, and it is a guesswork.
I think both of the above are bad situation. I don't know if it was left like this because it is not possible, or out of ignorance? The solution seems to be a trait to connect the two in a clear manner.
/// The client's representation of a runtime.
trait Runtime {
type RuntimeApis;
type OpaqueBock;
type GenesisConfig; // This will not be needed anymore, but you get the point ;)
}
If you see any point in this, happy to move it over to a new issue.
Not sure we need such a trait. Neither GenesisConfig
nor RuntimeApi
will be required in the future. I would even go that far and say that you don't even need to import the runtime in your node anymore.
I would even go that far and say that you don't even need to import the runtime in your node anymore.
Hell yeah is all I have to say to that then, yes none of what I said is needed.
Bumping @crystalin's comment here:
I don't know if I shared my opinion before but there are 2 important points to using Native runtime:
- Speed (running 1000 tests using native is significantly faster, even when using interpreted wasm (which doesn't spend time compiling at the boot)
- Debugging (So far, I wasn't able to get proper debugging of the wasm runtime (tried building with debug symbols and tell the wasmtime to debug it, but no success)
I've also found debugging of the wasm runtime very hit-or-miss, with lots of variables being logged as just empty strings.
As try-runtime-cli develops and more debugging features are added, I'm wondering if there could be a benefit to using it with a native runtime.
The native runtime can already not be used anymore.
with lots of variables being logged as just empty strings.
This is just because we have implemented a lot of Debug
impls to do this. Generally this is something that we should revisit and to check the impact on the runtime. Generally, when things are not being used, the linker should strip them.
This is just because we have implemented a lot of
Debug
impls to do this. Generally this is something that we should revisit and to check the impact on the runtime. Generally, when things are not being used, the linker should strip them.
Log messages would probably the place which pull them into the runtime image. Solution would be to disable the lower log levels at compile time by default.
Solution would be to disable the lower log levels at compile time by default.
We already do this for on chain builds for quite some time. We actually disable all logging. https://github.com/paritytech/polkadot/blob/c46d7426445ba64cf3ef3158e50e1c7665345a3f/runtime/polkadot/Cargo.toml#L300
I guess then it would be an interesting experiment to see by how much the size of a runtime goes up if we replace all RuntimeDebug
by Debug
. We could compare if there is a substantial difference when sp-api/disable-logging
is set. With lto
enabled the linker should be able to eliminate the unused impls.
Yep
I tried once RuntimeDebug
and noticed it was somehow more code with RuntimeDebug
than Debug
and I never used RuntimeDebug
in my code since. I didn't bother checking why it was the case though.
This issue has been mentioned on Polkadot Forum. There might be relevant details there:
https://forum.polkadot.network/t/polkadot-release-analysis-v1-0-0/3585/4
When it comes to generating chain-spec files, I have some concerns: where the generating code should be placed and how it should work.
In the past the generation of the chain-spec was simply done with command (assuming kusama-runtime
feature enabled in polkadot
):
polkadot build-spec --chain kusama-staging --raw
Once the support for native runtime is removed and with work done in https://github.com/paritytech/polkadot-sdk/pull/1256, we will need following building bricks to create a chain-spec file:
GenesisBuilder
API enabled,RuntimeGenesisConfig
(or JSON file containg full config). Patch can be provided as a static file (example), or as a function which provides a patch (example, example).When it comes to tooling for kusama/polkadot/westend/... development chain-spec creation I see following options:
Keep support in polkadot-node
, w/o dependencies to runtime crates. This requires build-spec
command to be extenede with path to wasm
runtime file:
polkadot build-spec --chain kusama-staging --runtime /path/to/kusama-runtime.wasm --raw
We would also keep all the functions generating the patch in polkadot/node/service/src/chain_spec.rs
.
Keep support in polkadot-node
, with dependencies to runtime crates (only to get a right wasm blob e.g. polkadot::WASM_BINARY
).
# nothing changed here:
polkadot build-spec --chain kusama-staging --raw
In this case we also keep all the functions generating the patch in polkadot/node/service/src/chain_spec.rs
.
Move the build-spec
subcommand to new tool (chain-spec-builder
, which with workd done in #1256 already allows some generic chain-spec operations).
The tool could be moved to new repo (like try-runtime
).
chain-spec-builder runtime -r ./kusama-runtime.wasm --raw-storage --chain-spec kusama-staging
In this scenario the functions for generating the patches would be removed from polkadot-sdk
repo and moved to the chain-spec-builder
code.
It would also be possible to remove functions generating the JSON patches, and keep patches as files (for each runtime and variant) but that could be inconvienent as one may still need way to easy modify the chain-spec.
Giving my 2-cents. I would prefer to keep it within polkadot binary
I would favor option 1 and 3. Option two would us bring back to the old model of including the runtime crate, which blows up compile time again.
The tool proposed in 3 will be really good and can be used by things like zombienet that patches the chain specs anyway.
One other thing that is lingering in my head since some time, we could support pointing to a Cargo.toml
file via cli. When the node or the independent tool are pointed to a Cargo.toml
file, we should build the crate and use the resulting wasm binary as input binary. I can see this bringing a nice devex when you want to easily test changes to the runtime. It would then only require to restart the node and then it would also recompile the crate.
node --chain my-runtime/Cargo.toml
I think Wasm or Cargo.toml are both good. Building wasm is fairly easy and straightforward so.
Having another tool makes me worry about maintaining it and having it hard to discover (for new people) or to follow progress. (right now the polkadot -h gives you a lot of useful info and commands)
To re-iterate my previous thoughts on the chain-specs in native-runtime-free world:
Removing RuntimeGenesisConfig
struct is not enough to decouple the node/client side from the native-runtime entirely.
There are still some left-overs required on the node side to build the chain-spec. Examples of the types (or consts) that one needs to know are as follows:
AccountId
,HostConfiguration
,SessionKeys
(BabeId, GrandpaId etc)westend::BABE_GENESIS_EPOCH_CONFIG
These types need to be known at time of creating the json chain-spec. One needs to know what the types are to properly put them into the json format of runtime genesis config.
To overcome this problem, I propose to move the creation of the JSON representation of the runtime genesis config into the runtime itself, and exposing multiple configurations through the GenesisBuilder
API.
GenesisBuilder
implementation may encapsulate any number of predefined named configurations (e.g. "dev", "testnet", "staging" "alice"), and make them accessible via the runtime API.
The named configuration within runtime would be provided in the form of JSON patch (meaning that only the customized fields would be included into it).
The GenesisBuilder
API could be updated as follows (draft):
fn list_patches_names() -> Vec<String>;
fn get_patch(name: String) -> serde_json::Value;
fn get_default_config() -> serde_json::Value;
fn build_storage(value: serde_json::Value);
The ChainsSpecBuilder
(introduced in #1256) would be extended with one additional method: with_named_genesis_config_patch
, allowing chain-spec construction with the following code:
type RococoChainSpec = ChainSpecBuilder<(), Extensions>;
RococoChainSpec::builder(rococo::WASM_BINARY, Extensions::default())
.with_name("Rococo Staging Testnet")
.with_id("rococo_staging_testnet")
.with_chain_type(ChainType::Live)
.with_named_genesis_config_patch("staging")
.with_telemetry_endpoints(...)
.with_boot_nodes(...)
.with_protocol_id(DEFAULT_PROTOCOL_ID)
.build()
The above sequence interacts with the GenesisBuilder
API of provided runtime, and builds the chain-spec.
(cc: @kianenigma)
fn get_patch(name: String) -> serde_json::Value; fn get_default_config() -> serde_json::Value;
I don't see a reason why these are two functions?
fn get_patch(name: String) -> serde_json::Value; fn get_default_config() -> serde_json::Value;
I don't see a reason why these are two functions?
get_default_config()
is a wrapper to RuntimeGenesisConfig::default
(as we need default config to apply patch [runtime-builtin or user-provided]),
get_patch(name: String)
allows to get predefined builtin patch which shall be applied to default value. (Having patch allows for better flexibility as we can put patch into chainspec).
But why return a patch in the first place if you can directly return the chain spec?
But why return a patch in the first place if you can directly return the chain spec?
RuntimeGenesisConfig
is changed in future (e.g. new pallet added).Okay, thank you for the explanation! Sounds reasonable. But maybe we can come up with a better name for "patch". Maybe just rename it to "config"? I mean it is a config, just not the whole :thinking: Okay, this actually brings me back to making the parameter to get_config
optional and if it is None
, it could return the default config.
The Vision
We want to get rid off the native runtime. Initially the native runtime was added to overcome the performance "problems" of using wasmi for the runtime execution. However, with wasmtime we are around 2x slower than native. While this still isn't perfect, it is way faster than wasmi that is around 10x slower (all depending on the work load etc). While the performance plays an important role here there are also other things to consider, like ensuring that the native and wasm runtime are deterministic. We have already run into problems with Polkadot when it comes to this determinism problems. While we can ensure to use the exact same compiler version for building the Polkadot binary in our release process, we can not ensure that everyone else is also using the same compiler version as we used for the wasm build.
There is also a nice write up and some discussions here: https://github.com/paritytech/substrate/issues/7288
The Plan
To remove the native runtime, we need to fix some things before:
Open Questions
If you want to help us out and contribute to this issue, in this section you can find open questions and tasks where we would appreciate any input.
There are maybe still some other things to consider and to fix on the way, like only compiling the runtime for wasm instead of the current native & wasm build.
Here you can find the board with specific sub-tasks to this milestone: https://github.com/orgs/paritytech/projects/18/views/9