quilt / pm

Quilt Project Management Repo
5 stars 0 forks source link

Execution Toolchain for Ethereum 2.0 #2

Open lightclient opened 5 years ago

lightclient commented 5 years ago

Stateless execution is one of the defining factors of Ethereum 2.0. It will alleviate the I/O bottleneck faced by Ethereum 1.0 clients and minimize the amount of data needed for validators to begin validating a new shard. Although it solves some issues, it raises equally as many. Here are some characteristics that are important for a robust stateless system:

Since May, we've been exploring many different facets of stateless systems including compile-time reflection of Rust structs to sparse merkle trees, efficient in-place lookup strategies for sparse merkle trees, single-pass authentication of merkle proofs, atomic modification of proofs in nested execution, a practical example of a stateless token execution environment (and required tooling), and general designs for a stateless execution environment for user contracts.

As we look towards prototyping an umbrella execution environment, we think it is time to try formalizing some of this toolchain so that more efforts can begin in parallel and efforts are not duplicated across projects and organizations. The main pieces of this toolchain that we see are:

  1. Runtime a. Stateless executor for consensus nodes b. Stateful executor for testing + state providers
  2. Authenticated multi-proof backend a. Proof generator b. Multi-proof merging b. EE & contract level library
  3. Orchestration tools a. Deployer b. Testing framework c. Long running simulator
  4. EE & smart contract languages
  5. Client libraries (i.e. web3.js) a. Retrieve proofs from state providers b. Manage credentials for popular execution environments

Runtime

The runtime environment is the base of the whole toolchain. Its semantics define how the multi-proof backends operate which in turn affect each subsequent aspect of the toolchain. The Ewasm team has done a lot of work defining this engine within the bounds of both Ethereum 1.0 and Ethereum 2.0. We have also begun to experiment with certain runtime heuristics in the ewasm-rt such as wasm-spawning-wasm and atomic proof modification. These areas require more research, but are advancing. The biggest missing pieces to the runtime still are secure metering solutions, efficient bignum host functions and multi-proof dependent middleware for stateful execution. Although the middleware won't be utilized by consensus nodes, it will be important for state providers and proof generation in general.

Authenticated multi-proof backend

The multi-proof system is essentially the backbone of all stateless computation. There are numerous flavors of proof systems: sparse, ssz sparse, patricia, knuth binary, etc and even more ways of serializing them. The actual time and size complexity of each proof system varies and should be subject to more rigorous research and improvement. All of these systems can be abstracted in terms of the toolchain as whole by thinking of them as authenticated key-value stores.

+-------------+                +---------------+                +------+
| data values |   < ------ >   | authenticator |   < ------ >   | code |
+-------------+                +---------------+                +------+

They act as an integrity shim on top of the actual data values. The runtime should generally be agnostic of the semantics of the proof backend, but it must make an exception in order to provide a stateful executor. That exception in particular is that it must a) maintain a list of get and set operations against the stateful data store, which can later be used to construct the necessary multi-proof and b) support middleware from the proof library which allows for runtime decisions (e.g. what should a get return for key K that hasn't been "officially" initialized yet?).

Developing libraries that provide all necessary tools is large effort. I discovered this first hand while developing a tool for composing proofs for sheth. Most of my work has revolved around implementations of sparse trees, so there may be additional things to consider for other types of trees, but the two areas of work here are a) compile-time reflection of data structures to general indexes and b) efficient multi-proof libraries for general index based proofs.

Orchestration tools

Until now, we've been rolling our own orchestration tools. As we start looking towards building smart contracts on top of an execution environment that needs to be deployed to a runtime, we see the need for consolidation of these tools more clearly. In fact, even the existing tools combined are still not adequate enough to provide the developer experience we are striving for. I believe that we are in desperate need of both a testing framework for stateless contracts and a simulator for long running processes. In both instances, it's important that the tools are highly interoperable with the proofing libraries so that users can mostly ignore their existence.

I think the Truffle is a great baseline for testing DevEx that should be provided for stateless contracts:

contract TestMetaCoin {
  function testInitialBalanceUsingDeployedContract() {
    MetaCoin meta = MetaCoin(DeployedAddresses.MetaCoin());
    Assert.equal(meta.getBalance(tx.origin), 10000, "Owner should have 10000 MetaCoin initially");
  }

  function testInitialBalanceWithNewMetaCoin() {
    MetaCoin meta = new MetaCoin();
    Assert.equal(meta.getBalance(tx.origin), 10000, "Owner should have 10000 MetaCoin initially");
  }
}

In order to achieve that level of abstraction from the proofing backend, it is critical to throughly define a standard interface for proof libraries.

EE & smart contract languages

Just as most developers don't program in x86 assembly or EVM assembly anymore, we need to investigate high-level solutions for writing EEs and smart contracts. Fortunately, WebAssembly is a target for the LLVM so most languages already compile to it. That is only about a quarter of the story though. Many languages have their own runtimes, garbage collectors, and debugging infrastructure that gets included in binaries. However, writing EE / smart contract code is analogous to embedded systems engineering where every clock cycle and byte matter. We need to develop libraries for these languages and tools such as Chisel to support efficient execution and bundle size. Additionally, multi-proof libraries must be written / implemented natively for other languages.

Client libraries

The final piece to the toolchain puzzle is the client libraries used link application developers with users. Fortunately, there is already a lot of existing work that can be leveraged. For instance, if the other pieces are done correctly it should be fairly to extend web3.js to first send a transaction to a state provider to calculate the proof before submitting it to the eth 2 mempool. It's likely that most of work will be a) supporting the different flavors of execution environments and b) ironing out the incentive schemes for state providers.

villanuevawill commented 5 years ago

This is great. A couple additions:

  1. The orchestration tools in progress provide a simulation around multiple shards - which is linked here. The truffle analogy is perfect and what we had described as the result of the initial simulation to be. We found the initial way we went about building the simulation was a bit too heavy and would take too much engineering effort to maintain (running actual shard chain & beacon chain in tandem with the spec). As a result, we've moved forward with the decision that we should build very simple abstractions into Scout to achieve the same goals. The goals are being able to simulate cross shard calls and a "real" Eth 2 environment. It should also provide the framework by which relayers/state providers can integrate to. The relayers/state providers should be a part of this tooling/testing framework as well.
  2. EE & Smart Contract Languages - some of the most challenging pieces will be exploring how we abstract cross shard calls to the developer as simple asynchronous calls. For example, if a contract must be written on a messaging model (such as the actor model), a yanking/locking model (read/write locks) or just as a simple async call (does not require atomicity). This definitely also has impact on the actual direction of the umbrella EE.
  3. We should define the state provider/incentive scheme space a bit more and open up an entirely separate issue around this. We'd also like to veer towards having a standard interface/api that they operate on. EEs should try and fit into this space or design.
  4. We need to figure out how we actually submit transactions to the block producer. Is it published directly via a "broadcast network" or is it managed via a mempool that shares/distributes transactions as in Eth ?. Investigation around the state provider or relay network should help solve this.

We should break a lot of this out even further into subtasks, etc. so we can begin assigning/finding people from the community to get involved here.

Also, "The biggest missing pieces to the runtime still are efficient bignum host functions and multi-proof dependent middleware for stateful execution. Although this won't be utilized by consensus nodes, it will be important for state providers and proof generation in general." - This assumes interpreters are truly the final choice/route. It would be nice to begin formalizing a decision here.

Great job @lightclient !

lightclient commented 5 years ago

To expand more on the multi-proof section, I see two distinct areas of development.

Low-level proof database

The lowest level of the proof stack should be indifferent to the high level data structures it maintains. The design space for { efficient | compact | static | deterministic | etc } proofs is rather large. There exists only keys K and values V. This is akin to the memory layout in the EVM where the only concept of storage is a mapping of u256 keys to u256 values. Higher levels of abstraction can emulate more complex types (see Solidity's layout of state variables).

Interface

The minimal interface a low-level proof database should provide EE / contract developers is as follows:

fn get(&self, value: V) -> Option<V>;
fn set(&mut self, key: K, value: V) -> Result<Option<V>>;
fn root(&mut self) -> Result<V>;
fn from_bytes(bytes: Vec<u8>) -> Multiproof;
fn from_map(map: HashMap<K, V>) -> Multiproof; // only used by generation frameworks

Note: this is should be used more for inspiration as it is a Rust-specific interface, and the true underlying interface would need to a) be language agnostic b) minimize memcpy occurrences.

Serialization & In-memory operations

A critical attribute of the proof backend is it's size when serialized and the time and space complexity of traversing it in serialized form. Since the operations defined in the interface above will be executed within a metered virtual-machine where every operation counts, its critical that it is as performant as possible. For example, throwing a proof into a standard hash table would likely incur a lot extra computation we'd like to avoid (e.g. hashing the key to determine the location of the value).

High-level wrapper

Having a robust proof backend is great, but using it is not ergonomic from a developer's perspective. It is important that complex user objects can be expressed simply and mapped into a proof backend. It would be even greater, if the high-level wrapper was agnostic of the backend so they could be swapped out on a whim. In order to achieve this, some compile time processing needs to occur that translates user objects to a key-value memory layout that can be loaded into a multi-proof via the from_map interface.

There are two conflicting use cases for the high-level wrapper: efficient computation within a metered virtual machine and simple proof generation on a local machine.

Efficient Execution

Run-time slow downs are often caused by unnecessary memcpy calls occurring. This can generally be avoided by building the wrapper as an interface to the proof backend, rather than an actual object that stores values. An example of a wrapper in this case would be something that, at compile time, parameterizes the keys a structure would occupy in memory. Since this use case is only focused on on-chain execution, it as also assumed that the proof backend has been authenticated and that all values necessary to compute the transition are available (or can be calculated from a default value).

Proof generation

On the other end of the spectrum, the proof generator should have access to the high level constructs of the language the structures are defined (e.g. should act as an actual object instead of interface). This should make it simple to write generators which create the multi-proof that a transaction expects.

A valuable interface for these high level constructs is the following function:

fn to_map(&self) -> Map<K, V>

Things get more complex once the high level structures are used inside of other structures that implement the interface.

s1na commented 5 years ago

I'm starting to think about how to manage state in the Eth1 EE and a lot of what you wrote here would be relevant to that. It'd be great if we can find a "proof middleware" that could abstract the exact type of tree/proof format, and allows for efficient authentication, data lookup and state modification.

Just to make sure I've got the general idea, the goal is to have a library (let's say in rust) which EE and state provider developers would use for the tree/proof part of the logic. I'm mostly interested in the EE itself so I'll focus on that from now on.

This library has two APIs. It's clear what the low-level API is doing, but just to make sure I wanted to ask these methods are not host functions, right?

I'm not sure I got exactly what goes in the high-level API?

lightclient commented 5 years ago

It's clear what the low-level API is doing, but just to make sure I wanted to ask these methods are not host functions, right?

Yep, you're totally correct.

I'm not sure I got exactly what goes in the high-level API?

I don't think I laid it out very clearly, so that's probably why it isn't making so much sense. I would imagine two "versions" for the same type:

An interface type that has only a pointer to the beginning of an Account in memory and an interface of getter / setter functions. This way EE's can minimize their copying.

struct Account(*const u8);

impl Account {
  fn balance(&self) -> &U256;
  fn nonce(&self) -> &u64;
  fn signature(&self) -> &[u8; 96]
}

A "stateful" type which can be used by proof generation tools to mock out values and convert them to proofs.

struct Account {
  balance: U256,
  nonce: u64,
  signature: [u8; 96]
}

The interface I defined to convert from high-level objects to proofs is the to_map interface which can then be consumed by the proof's from_map. This seems to work okay for homogeneous values, but not so well if not. Lots of details to work out still and I'm sure there are a lot of improvements that could be made. I know in Rust these things can achieved at compile time via procedural macros, and I did some initial work over the summer that served as some inspiration for this.

lightclient commented 5 years ago

This is kind of the feel that I want (I tried to make it rusty, but I'm sure it wouldn't compile):

let engine = Engine::new()
let wallet1 = engine.deploy(WalletContract::new());
let wallet2 = engine.deploy(WalletContract::new());
let dai = engine.deploy(DaiContract::new());

DaiContract::set_balance(dai, wallet1.address(), 100);
let tx = wallet.call(dai.address(), DaiContract::Send(wallet1.address(), wallet2.address(), 50);

let new_state = engine.execute(vec![tx]);

assert_eq!(DaiContract::from_state(new_state).balance(wallet2.address), 50);

Some interfaces that would be needed for this to work:

trait Contract<K, V> {
  fn asm() -> Vec<u8>;
  fn to_map(&self) -> HashMap<K, V>;
  fn address(&self) -> Option<Address>;
}

trait Engine<K, V> {
  fn deploy(&mut self, c: Contract<K, V>) -> &Contract;
  fn execute(&mut self, txs: Vec<Transaction>);

}
trait Proof<K, V> {
  fn to_map(&self) -> HashMap<K, V>;
}

trait ProofBackend<K, V> {
  fn from_map(&mut self, map: HashMap<K, V>);
}