paritytech / polkadot-sdk

The Parity Polkadot Blockchain SDK
https://polkadot.com/
1.94k stars 710 forks source link

PVF determinism #990

Open pepyakin opened 4 years ago

pepyakin commented 4 years ago

As the time of writing, polkadot validation function execution is performed using the same executor that is used for executing Substrate runtimes. Substrate executor doesn't guarantee non-determinism of execution, or more precise it doesn't provide more than base guarantees of wasm execution. If the chain-writer manages to exploit these then it's their fault.

We would like to eradicate these effects for the execution of PVF.

We should come up with a way on how to avoid that. (See this discussion for more details)

There are following sources of non-determinism:

  1. Syntactic limits. Things like, the maximum size of a function body, the maximum number of imports, etc. This can be checked at the loading time.
  2. Binary limits. For example, the size of a module as a whole or one of its sections. This can be checked at the loading time.
  3. Validation limits. Validation of wasm binary can be performed eagerly (i.e. at loading time) or lazy (i.e. a function validated just before execution). So far, in our practice, we employed eager validation.
  4. Execution. This consists of execution limits and permissible non-determinism of individual instructions.

Let's break down the point about execution in detail.

Execution limits are specified here. These limits are specified in abstract terms of the specification. Practically, they could be divided on two groups: memory-limited and stack-limited. The size of linear memory and tables can be trivially limited, even statically. The stack-limited parameters, at least in general situation (e.g. without limiting recursion, etc), require us dynamic counting.

There are some instructions that are non-deterministic. This is a local non-determinism, meaning that there are some instructions that have not a single outcome but rather a finite set of outcomes for the same arguments. Theoretically, a wasm VM implementation that chooses the result from a set of the possible outcomes is still deemed as a conforming implementation.

The following is the list of such instructions:

  1. memory.grow, i.e. trying to increase the size of the linear memory. This is easily mitigated by limiting the maximal size of the memory or by counting.
  2. invoking a host function. Host functions are not limited what changes they do to the wasm universe as long as these changes are valid (e.g. host functions cannot change typing of another function).
  3. numeric operations. AFAIK, this only concerns floating point numbers.
burdges commented 4 years ago

I noticed threads mentioned in https://github.com/WebAssembly/design/blob/master/Nondeterminism.md so I guess any threads we use should come from the host, so several runtimes' entry points? I suppose the runtime could make some fork host call, and then a second 'join' host call that merges any state changes according to the selected memory model?

burdges commented 4 years ago

https://github.com/paritytech/substrate/issues/1459#issuecomment-566045237

rphmeier commented 4 years ago

Revisiting this - I'm no Wasm expert. What can we do to address this?

burdges commented 4 years ago

I'm not overly worried about resource limits because if we've some block dispute caused by resource limits then we've a straightforward bugs in resource allocation. We should declare the required resources in the block header so that allocations happen when we start running the block. If a reasonable allocation fails, then validators should trigger a system reboot. We could clean up resource usage first and/or restart the host process, but really we should just fix our own leaks, so that we can assume leaks were caused by other processes, and kill them all with a reboot.

Aside from resources, I think the first questions is: Are we merely avoiding floating point operations? Or did we disable them outright in the builds of wasmtime and wasmi used by substrate? I'd presume the latter.

After that, our next question is: Are we really more confident in wasmi being deterministic than in wasmtime being deterministic? What differences exist? We want wasmtime to agree with wasmi as much as possible, but if wasmi is more deterministic in some fundemental way then..

We discussed separate separate wasmi and wasmtime validity claims for backing validators. We cannot permit approval checks to start if backers could still retract their validity claim, so we've two plausible sequences for backing validator checks, either

  1. backing checkers run both wasmtime and wasmi in either order before claiming validity, or else
  2. we pipeline wasmi beside availability, like
    • backers run wasmitime validity first and eventually announce their wasmtime validity claim, which gets posted on-chain and triggers availability distribution,
    • backers continue running wasmi validity checks and eventually announce their wasmi validity claim, and
    • we post on-chain wasmi validity claims alongside availability claims, which only together trigger approval checkers to announce their assignments.

I think 1 and 2 sound acceptable in raw computation terms since we'd love 5+ times more secondary checkers, who then only run wasmtime. We pay this price in latency during the backing and/or availability phases however, which complicates cumulus pipelining blocks. We'll eventually need cumulus to do "local relay chain state simulation" for pipelining regardless, but that's not really an MVP feature.

As a more MVP approach, if we skip 1 and/or 2 and go with pure wasmtime everywhere, then we promise more work under

  1. promise that governance will sort out any discrepancies and correct any divergence between wasmi and wasmtime.
pepyakin commented 4 years ago

ℹ️ TL;DR: Resources, specifically the stack limits, is more or less the only thing that worries me.

I will only address the wasm section since I don't feel confident enough to comment on the protocol construction part.

I'm not overly worried about resource limits

So here, I assume by resource limits you mean limits of memory and stack. Memory limits are indeed trivial - we can just set a maximum limit and call it a day. That comes from the facts that implementation of a memory doesn't differ too much from it's semantic representation and every engine implements it more or less the same: a linear memory is essentially a byte array that has a size which is a multiple of a wasm page size (i.e. 64k).

With stacks it is a tad bit more complicated.

ℹ️ TL;DR: Stacks in WebAssembly are not deterministic. There are no set limits. This problem doesn't have anything to do with leaks or other implementation pathologies.

In a nutshell, semantically, it is constructed like the following. There is a stack. A stack is a sequence of call frames. A call frame resembles something like this:

struct Frame {
  locals: [Value], // If a function declares local variables, they are stored here. Function parameters are counted as locals!
  labels: [Label], // some bookkeeping for control flow. E.g. an `if` block pushes here something.
                   // This is a result of wasm being a _structured stack machine_, i.e. a stack machine
                   // that has structured control-flow in compared to goto like mechanisms
  operand_stack: [Value], // wasm is semantically a stack machine. Those values are stored here.
}

(this is not precisely how it is described in the spec, but hopefully it will do for our discussion here).

Then, there is the call instruction defined in the wasm instruction set. Each time a wasm program executes a call to a function a new call frame is created. The frame's locals are zero-initialized according to the numbers and types declared in the callee function.

Then, there are execution limits in place, i.e. how many locals, operands or labels there can be at most across all the frames in a single moment of time. Furthermore, these limits are not specified numerically anywhere. There are no minimum or maximum. However, only one thing is specified: there has to be some limits. These limits should be reasonably set to not bring down the whole system if it is reached - it should be gracefully handled. There is even a test in the official testsuite that verifies that infinite recursion stops at some moment or the test doesn't pass.

All of this makes the call instruction non-deterministic, in the sense that the result of a call either a successful creation of a new call frame or a trap due to reaching some of the limits.

Now, the implementation might differ substantially from the description presented above.

Typically, in an interpreter this stack is modeled using a vector on heap and interpreter uses a constant amount of the native machine stack. Wasmi uses u64 for each Value slot (even if the value actually a u32). Also, wasmi doesn't actually model label stack since before execution it converts code from structured-stack-machine representation to one that uses traditional goto like mechanisms. OTOH, wasmi uses some more data: it caches some references to speed-up some operations, it maintains a flag whether a call frame is initialized, it maintains the return type and so on.

In a compiler-based all functions are compiled to the machine code. That depends on the exact compiler in question, but here are the following examples.

In optimizing compiler, labels will be eliminated as in the case of wasmi. Then, locals and operand stacks will be compiled down to a single representation (typically, SSA) and there will be no difference between locals and operands - they just all become [SSA] values. During the lowering to native code, for performance reasons and ABI concerns, some of the values will go in the registers. If it is not possible to put a value into a register (or if ABI requires that) - the value can be spilled into some preallocated buffer on the stack. To find out what goes where - one typically needs to see the results of a function compilation. Therefore, there is a function that maps a function body to the size of stack consumed and it is everything but not easily predictable. Everything is further complicated by the fact that there could be function inlining, that the host functions might be called on the same stack, that the stack frames must be aligned by, typically, 16 bytes boundary, and so on. It can become bizzare in the sense that a call frame can take no space whatsoever.

In a linear single-pass compiler, it might be similar, due to lack of the luxury of analysis they typically lay out the locals and operands greedly and naively on the machine native stack. There might be some quirks where the values must be flushed to the stack.

Note that in the case a compiler is used, the most straightforward and performant way to ensure the limits is to just check if the stack consumed that many of bytes of the machine native stack.

In summary, in practical implementations, the wasm stack will be implemented very differently. Thus can have different and non reconcilable limits. A call frame of a wasm function is a function (no pun intended, a mathematical function) not only of a number of locals, parameters and operand stack but of other unknown parameters depending on the particular engine (We will touch this once again in later in mitigation sections). To make it worse, the stack space budget limit is not a single number (i.e. interpreters can count operands and locals, compilers don't and use native stack bytes, nobody cares about labels although they could).

Next up, let's see why do we care.

How could that be exploited?

Assume we have a module like the following:

(module
    (func $evil (param $x i32)
        (local i32 i32 i32 i32 i32)
        (if
          ;; If $x == 0
          (i32.eqz (local.get $x))
          (then (return))
          (else
            ;; else: evil($x - 1)
            (call $evil
              (i32.sub
                (local.get $x)
                (i32.const 1)
              )
            )
          )
      )
    )
)

In essence, this module just recurses the x number of times specified as an argument. Therefore, call $evil could potentially fail with different x for different engines.

A word about different engines

Now, you (@burdges) compared wasmi vs wasmtime. I think this is false dichotomy, atm, since both engines could be changed (and did change in the past) in such a way that they would give different results with respect to each other.

In other words, technically, wasmtime 0.228 could fail with different x in the example above than say wasmtime 0.229. This happens because ensuring some persistent results is hard and potentially harms performance, therefore nobody gives a hoot about that.

Even wasmi had changes in that respect in the past. (And this is why we were talking about establishing a gold standard of execution).

In other words, various implementations form a set of possible behaviors which also grows over time. We also might want to migrate to other engines in the future.

Mitigations for the stack issue

As I already mentioned in the issue description, in general case, we might be required to leverage dynamic counting to overcome this problem.

Now, by dynamic counting I mean dynamically counting the semantic size of the stack. I.e. we introduce a math function that returns a size of a call frame in some abstracts units that includes the number of locals, parameters and probably the other data, such as return address. Then we maintain a global counter which is local to the execution thread. If that counter exceeds some value - we interrupt execution. In prologue of each function we increment this counter according to the semantic stack frame size of that function, in epilogue we decrement this counter.

Note, that we do that even if the actual compiled function doesn't consume any resources of the underlying implementation (say, all locals and operands fit into registers and there was no frame opened).

The main objective here is to make sure that the engine defined trap is never reached. Our dynamic counter should kick in before.

As far the performance goes, it depends on the implementation, but theoretically it is possible to implement that very efficiently: there is one read/write in prologue and epilogue to a hot location and a cold branch that should be almost always correctly predicted. Then, we might be able to elide the counting for the static parts of the call graph (haven't dug into this though).

Since the contracts require this, we already have instrumentation that relies on instrumentation.

Other areas

As I mentioned, there are other operations that are potentially non-deterministic. But they are way simpler to ensure.

I think that should cover broadly all the cases of wasm non-determinism.

pepyakin commented 4 years ago

Ah, yeah, and also the point about "determinism differences". If it is possible to compare if one implementation is "more deterministic" than other then I prefer to avoid that. This is because non-deterministic operation in wasm sense is one operation that for the same input arguments (the input operands, parameters, the state of memory, stack, etc) returns not the single results value but one return value picked from a set of possible results. Examples:

  1. the call instruction can either perform the call or trap due to stack resources exhaustion,
  2. memory.grow can either success or signal of out memory condition,
  3. module loading can return either success or error depending on whether the module being loaded exceeds some implementation defined limits or not.

Therefore, I think it'd be better to compare what are the differences in behavior between wasmi and wasmtime. .

  1. First, they have different serialization, deserialization code which implies different syntatic and binary limits. They both perform eager validation though.
  2. in both host functions behavior determinism depends on the host functions themselves,
  3. I am not sure about NaN handling in wasmi. In wasmtime there is an optional canonicalization mode.
  4. in both, memory grow / allocation actually fails gracefully if the system refuses to allocate more memory. However, both are written in Rust which generally is not friendly to out-of-memory conditions.

And, finally, stack limits are different - which makes a function call in both engines behave differently if reached boundary conditions of either engine.

burdges commented 4 years ago

Yes, I suppose wasmi/wasmtime version upgrades might pose the most real world threat, interesting thanks! :)

I'd think the relay chain could dictate the required wasmtime versions, so backing checkers run all required versions, while approval checkers run only what they like, and correct nodes stop their validator candidacy if they lack a required wasmtime version. It's cheaper to do two or three wasmtime runs than one wasmi run, excluding the compilation time.

Yes, instrumenting the stack sounds handy there. If that performance hit matters, then you can run the explicit counter only for backing validators, and derive from that an overestimate that ensures we cannot exhaust the stack. It's fixable if we botch that overestimate sometimes. :)

I'm dubious that NaNs are the only problem with floats, but I donno how tightly wasm specifies them, so maybe..

Thanks!

rphmeier commented 4 years ago

cc @arkpar you may find this one interesting as well

arkpar commented 4 years ago

invoking host functions are viewed as non deterministic from PoV of the WebAssembly spec, but we can ensure that only deterministic functions are available to the STVF.

Are there any host functions that could potentially be non-deterministic? Storage access can't cause IO errors and would only panic on missing trie node, which is deterministic. Since there's no IO, the only source of non-determinism that I can think of is host running out of stack or heap memory. Do we have host functions that allocate or could be invoked recursively?

pepyakin commented 4 years ago

Since we are the embedders it is only up to us which host functions to define and we can restrict the set only to deterministic ones (which we already do).

I think memory is not such a big concern either IMO. WebAssembly memories can be limited in size statically (i.e. in declaration) and they are capped with the absolute maximum of 4GB. Allocations up to that size should be fine on a properly configured x86-64 machine, even if there is not enough physical space available. If that happens though that would be suboptimal, so we probably need to file under "system requirements" that the operator would need a certain amount of RAM. But still, it should work.

So that leaves us with only the stack.

Do we have host functions that allocate or could be invoked recursively?

Yeah, as I just mentioned I don't think that allocations are a concern.

There are functions that can lead to recursion: specifically those from the sandboxing module. As a recap: sandboxing is a set of APIs similar to that provided by a regular WebAssembly VM, so they can be used to spawn wasm VMs while already executing within a wasm VM. Since a wasm module needs host functions to interact with the embedder that API allows the runtime/STVF (we refer to it as a supervisor) to expose some of its functions to the instantiated nested wasm module (a guest). We use that for implementation of contracts (but perhaps there are other use-cases as well), where runtime/STVF controls the guest contracts.

Taking contracts as an example, recursion can arise when a contract calls another contract. Specifically something like this:

  1. supervisor creates a guest exposing some host functions to the guest
  2. then supervisor calls into the guest
  3. the guest on the course of execution decides to call a host function (e.g. a contract wants to call another contract)
  4. the host function is called and the control transferred to the supervisor (i.e. the STVF/runtime).
  5. then the supervisor can create another guest wasm module and call into it (e.g. pass the control to the called contract).
arkpar commented 4 years ago

I think memory is not such a big concern either IMO. WebAssembly memories can be limited in size statically (i.e. in declaration) and they are capped with the absolute maximum of 4GB.

I was talking about host functions and host memory. Something like batch signature verification allocates queues on the host heap. It also potentially creates system threads, which is another source of non-determinism.

Sandboxing may potentially exhaust system memory as well. To be fair, we already have non-deterministic execution timeout. I.e. candidate validity depends on the CPU resource of the validator. Perhaps memory could be treated in the similar fashion? As far as I understand the important bit here is to not allow crafting such a candidate that would cause 50% split on the validity opinion. @burdges @rphmeier would the network be in trouble if someone manages to craft a candidate that causes an execution timeout on 50% of the network, but passes validation on the rest?

pepyakin commented 4 years ago

Not to comment on the second part, but I still think that memory is not the biggest concern for us.

Yeah, I think what I said still applies to the memory allocated by the host. The fact that it creates threads also doesn't change much as long as the results are deterministic. However, we need to keep in mind that threads might consume other kinds of resources. But I believe that we can impose limits on STVF so that it would be impossible to exceed the system requirements we specify.

To be more specific here is an example.

Imagine that we generously limit the STVF's linear memory with 4GB. Then, we say that each sandboxed instance can consume up to 128MB and there can be only 32 instances at the same time. This gives us another 4GB. Remember, this is virtual memory.

Then, I assume that most of the host functions release the buffers these functions use after they return. Upon an inspection we can compute an upper bound. But let's be generous and say that those cases contribute another 1GB to the peak memory usage.

Then there are functions that maintain some state on the host side, as you mentioned signature verification and already mentioned sandboxing functionality. You can have, idk a reasonable value but I'd pick 1024, signature verification in flight. The consumption coming from sandboxing module's bookkeeping should be negligible. But let's say it's in total another 1GB.

Then summing this up gives us merely 10GB with these over-estimated numbers good chunk of which typically isn't backed by physical memory. A number that doesn't look too bad in the modern world and certainly can be served on most systems even without any special measures. We can of course tweak these numbers, and say that e.g. STVF can only consume up to 1GB. Even if the machine doesn't possess the amount of physical memory we state in the system requirements for the validator node it still can still execute all the logic unless the swap disabled, and even have a chance not to timeout stuff.

Respective to threads, I believe the similar logic can be applied. We can just constrain the amount of work/threads that a STVF can request on one side and on the other put the specific requirements in the system requirements for a validator node.

burdges commented 4 years ago

We cannot consider timeouts to be slashable, but afaik we have not carefully thought through the related issues:

Imagine one validators declares that some candidate timed out. Is there any scenario under which we'll approve this block? Yes we should approve if that validator just sucks, is being a jerk, etc.

Could we have two separate timeouts with timeout_long = 8ish * timeout_short? All validators must run blocks for timeout_long before aborting, but they report four states:

We accept simple majority vote for approved vs. rejected among approval and backing validators if all report valid, and no penalities. Any invalid vote always escalates to all validators checking.

If all checkers report either valid but rejected or aborted at timeout_long, then we reject the block and likely slash the parachain's bond. We could avoid this case by requiring that all backing checkers report valid and approved, so if you've a slow backing checker then your parachain halts while assigned to them.

If some validator reports approved and some validator reports aborted at timeout_long then our lives become complex: We could ask the slow validator to run the block even longer. We could treat the slow validator like a no show and replace them, which almost sounds unavoidable honestly.

We'll eventually want no shows to report their current state, ala "slow availability" or "slow candidate", well see https://github.com/paritytech/polkadot/pull/1518/files I want to avoid asking for this complexity in a Polkadot MVP though, but..

I've asked for the no show scheme even in an MVP to address the threat of denial of service attacks on validators. We'd select its parameters in a methods that sounds suboptimal for slow candidates.

We should make a separate issue for discussing timeout logic probably..

rphmeier commented 4 years ago

I don't think we should ever be slashing the parachain's deposit. The other concern I have is that long-running candidates might occupy all of our validation resources, so a shorter timeout (< blocktime) would be better during backing. During approval the longer timeout makes more sense.

burdges commented 4 years ago

I've moved timeouts discussions to https://github.com/paritytech/polkadot/issues/1628 so maybe this issue could refocus on determinism.