paritytech / polkadot-sdk

The Parity Polkadot Blockchain SDK
https://polkadot.com/
1.94k stars 711 forks source link

Merge PVF execution into Substrate repo? #2344

Open pepyakin opened 2 years ago

pepyakin commented 2 years ago

This is not necessarily a feature request or call to action, but just writing down the thoughts on this topic.

In polkadot there is a PVF validation host. That component is responsible for these things:

  1. The validation host (VH for brevity) is responsible for storing of the compiled artifacts on disk. That also implies periodic pruning.
  2. VH maintains a queue for compiling the PVFs and hosts a number of specialized processes that act as workers for compiling the code[^1].
  3. VH maintains a queue for serving the execution requests. Those are also implemented with workers.
    • The reasoning here was that a PVF could have an infinite loop. Nonetheless, that should not make the polkadot to indefinitely execute such code. You generally cannot cleanly abruptly stop a thread doing arbitrary things (such as executing machine code from wasm). At the time wasmtime did not support for limiting the execution time.
    • Execution of a given PVF is performed by relying on sc-executor-wasmtime.
  4. Coordinating job of making sure that PVFs are compiled before executing, and doing other things.

So the question that this issue raises if whether we want to merge this into Substrate. That may sound understandably counterintuitive, but bear with me. I was the same at the beginning and then the idea started slowly growing on me.

During the first overhaul of the PVF validation host, I ended up going with the most straightforward way to implement VH. Basically take the sc-executor-wasmtime crate and modify it so that it supports the implementation described above.

I was wondering if I should actually put the code in Substrate repo. I wondered because there were some pros:

However, mostly for time-to-ship minimization reasons I decided to leave these thoughts and just carry on with the implementation. I also thought at the time it was a little bit weird from the "logical" perspective that generic substrate has parts specific to Polkadot.

However, logic related to PVF ultimately percolated into the Substrate executor (sc-wasmtime-executor). It would be a mistake to pretend that Substrate engine is generic: it already knows about Polkadot execution environment. I also had to admit that my puristic stance is not really constructive and sometimes it makes sense for the upstream projects to embed parts from the downstream[^2].

To give you some examples:

  1. We have allow_missing_imports flag when creating a runtime. The reason why is that we usually want to throw an error if runtime tries to import something that the substrate host does not support. However, for Cumulus we don't want this behavior, because there Substrate Runtime and PVF are combined, and PVF provides only a subset of functions provided by the Substrate Runtime.
  2. sc-wasmtime-executor provides functions that are only used by the PVF: for creating an artifact and loading a module from an on-disk artifact.

So we ended up with an architecture where there are some modules expose some functions that are used for PVF. If we merged PVF code into Substrate we could design a better API that is tailored for the PVF use-case.

Apart from that, Substrate may also benefit from the parts that is already implemented in PVF VH.

Disk Caching. Initially, on disk caching was thought of as only PVF feature. The first naive attempt at it was substrate#8057 which was in substrate, but only because wasmtime required specific configuration.

But later I realized that perhaps for heavy archive nodes we may want to use on-disk caching. I think it is not unimaginable that a chain with big history can have 100 upgrades. This is the same the same number of parachains we target on Polkadot which motivated us to look into the on-disk caching in the first place.

Also there was a suspicion of mine that slowdowns we experience during an upgrade may be related to compiling of the runtime on several threads. I never verified that though.

Interrupting. See https://github.com/paritytech/substrate/issues/10648. Right now, this is achieved by spawning processes in VH. Alternatively, we could have a mode that uses wasmtime facilities for that. While it may not be the best idea to enable that by default for the PVF validation, we may use this approach for the fallbacks for https://github.com/paritytech/polkadot-sdk/issues/881

External Processes. Since we integrated a compiling wasm engine, wasmtime, there was a concern about its big surface area. Essentially, this is a big machine that transforms non-trivial bytecode format into machine code and jumps onto it which then uses a fair share of unsafe blocks. Even though its written in Rust, have very good quality assurance and worked on by brilliant engineers, there is still something to be concerned about, especially considering that this code executes right next to secret material that can be used to steal funds or either to the code that is responsible for validating of the network.

Back then we did not act on this because it felt not priority: at the time I am not even sure if we had production networks, we used wasmi by default and there were lots of other stuff to build. Now we are in different position though.

Therefore, maybe we will want to revisit this and possibly Substrate can optionally move some or all of wasm execution into an external process. This is similar to what Polkadot's PVF host already is doing. For Substrate it may be quite a bit more tricky, since unlike PVF, Substrate Runtime requires communication with the host for retrieving the storage. So not entirely sure about this one.

Related issue on polkadot side https://github.com/paritytech/polkadot-sdk/issues/882

Then lastly, we already were thinking about refactoring of both sides:

So maybe this could be addressed as well.

[^1]: Why do we do this? We need that because unlike Substrate runtime, PVFs are untrusted and can be submitted by anyone. There is a PVF pre-check process that ensures that a PVF can compile soundly. We fear compiler bombs.

[^2]: It is further invalidated if we moved forward with the ideas of making Substrate, Polkadot and Cumulus a monorepo — in that case we could make it back "pure" by extracting executor into another top-level project, but I digress.

pepyakin commented 2 years ago

A couple of more thoughts on this topic:

  1. https://github.com/paritytech/polkadot/pull/5712 is another example of why it would be beneficial to merge those. TL;DR: wasmtime only limits stack available to wasm but does not ensure it. Thus the caller should ensure that there is enough stack space available. It's a big ask. If the code lived here, then it would be much easier to check that this invariant holds. That said, it's not necessary to go all the way in merging with all the caching and stuff.
  2. The fact that we need to have this PR is a demonstration of the benefits of compartmentalization of wasm execution.