Open pepyakin opened 4 years ago
I noticed threads mentioned in https://github.com/WebAssembly/design/blob/master/Nondeterminism.md so I guess any threads we use should come from the host, so several runtimes' entry points? I suppose the runtime could make some fork
host call, and then a second 'join' host call that merges any state changes according to the selected memory model?
Revisiting this - I'm no Wasm expert. What can we do to address this?
I'm not overly worried about resource limits because if we've some block dispute caused by resource limits then we've a straightforward bugs in resource allocation. We should declare the required resources in the block header so that allocations happen when we start running the block. If a reasonable allocation fails, then validators should trigger a system reboot. We could clean up resource usage first and/or restart the host process, but really we should just fix our own leaks, so that we can assume leaks were caused by other processes, and kill them all with a reboot.
Aside from resources, I think the first questions is: Are we merely avoiding floating point operations? Or did we disable them outright in the builds of wasmtime and wasmi used by substrate? I'd presume the latter.
After that, our next question is: Are we really more confident in wasmi being deterministic than in wasmtime being deterministic? What differences exist? We want wasmtime to agree with wasmi as much as possible, but if wasmi is more deterministic in some fundemental way then..
We discussed separate separate wasmi and wasmtime validity claims for backing validators. We cannot permit approval checks to start if backers could still retract their validity claim, so we've two plausible sequences for backing validator checks, either
I think 1 and 2 sound acceptable in raw computation terms since we'd love 5+ times more secondary checkers, who then only run wasmtime. We pay this price in latency during the backing and/or availability phases however, which complicates cumulus pipelining blocks. We'll eventually need cumulus to do "local relay chain state simulation" for pipelining regardless, but that's not really an MVP feature.
As a more MVP approach, if we skip 1 and/or 2 and go with pure wasmtime everywhere, then we promise more work under
ℹ️ TL;DR: Resources, specifically the stack limits, is more or less the only thing that worries me.
I will only address the wasm section since I don't feel confident enough to comment on the protocol construction part.
I'm not overly worried about resource limits
So here, I assume by resource limits you mean limits of memory and stack. Memory limits are indeed trivial - we can just set a maximum limit and call it a day. That comes from the facts that implementation of a memory doesn't differ too much from it's semantic representation and every engine implements it more or less the same: a linear memory is essentially a byte array that has a size which is a multiple of a wasm page size (i.e. 64k).
ℹ️ TL;DR: Stacks in WebAssembly are not deterministic. There are no set limits. This problem doesn't have anything to do with leaks or other implementation pathologies.
In a nutshell, semantically, it is constructed like the following. There is a stack. A stack is a sequence of call frames. A call frame resembles something like this:
struct Frame {
locals: [Value], // If a function declares local variables, they are stored here. Function parameters are counted as locals!
labels: [Label], // some bookkeeping for control flow. E.g. an `if` block pushes here something.
// This is a result of wasm being a _structured stack machine_, i.e. a stack machine
// that has structured control-flow in compared to goto like mechanisms
operand_stack: [Value], // wasm is semantically a stack machine. Those values are stored here.
}
(this is not precisely how it is described in the spec, but hopefully it will do for our discussion here).
Then, there is the call
instruction defined in the wasm instruction set. Each time a wasm program executes a call to a function a new call frame is created. The frame's locals are zero-initialized according to the numbers and types declared in the callee function.
Then, there are execution limits in place, i.e. how many locals, operands or labels there can be at most across all the frames in a single moment of time. Furthermore, these limits are not specified numerically anywhere. There are no minimum or maximum. However, only one thing is specified: there has to be some limits. These limits should be reasonably set to not bring down the whole system if it is reached - it should be gracefully handled. There is even a test in the official testsuite that verifies that infinite recursion stops at some moment or the test doesn't pass.
All of this makes the call instruction non-deterministic, in the sense that the result of a call either a successful creation of a new call frame or a trap due to reaching some of the limits.
Now, the implementation might differ substantially from the description presented above.
Typically, in an interpreter this stack is modeled using a vector on heap and interpreter uses a constant amount of the native machine stack. Wasmi uses u64 for each Value
slot (even if the value actually a u32). Also, wasmi doesn't actually model label stack since before execution it converts code from structured-stack-machine representation to one that uses traditional goto like mechanisms. OTOH, wasmi uses some more data: it caches some references to speed-up some operations, it maintains a flag whether a call frame is initialized, it maintains the return type and so on.
In a compiler-based all functions are compiled to the machine code. That depends on the exact compiler in question, but here are the following examples.
In optimizing compiler, labels will be eliminated as in the case of wasmi. Then, locals and operand stacks will be compiled down to a single representation (typically, SSA) and there will be no difference between locals and operands - they just all become [SSA] values. During the lowering to native code, for performance reasons and ABI concerns, some of the values will go in the registers. If it is not possible to put a value into a register (or if ABI requires that) - the value can be spilled into some preallocated buffer on the stack. To find out what goes where - one typically needs to see the results of a function compilation. Therefore, there is a function that maps a function body to the size of stack consumed and it is everything but not easily predictable. Everything is further complicated by the fact that there could be function inlining, that the host functions might be called on the same stack, that the stack frames must be aligned by, typically, 16 bytes boundary, and so on. It can become bizzare in the sense that a call frame can take no space whatsoever.
In a linear single-pass compiler, it might be similar, due to lack of the luxury of analysis they typically lay out the locals and operands greedly and naively on the machine native stack. There might be some quirks where the values must be flushed to the stack.
Note that in the case a compiler is used, the most straightforward and performant way to ensure the limits is to just check if the stack consumed that many of bytes of the machine native stack.
In summary, in practical implementations, the wasm stack will be implemented very differently. Thus can have different and non reconcilable limits. A call frame of a wasm function is a function (no pun intended, a mathematical function) not only of a number of locals, parameters and operand stack but of other unknown parameters depending on the particular engine (We will touch this once again in later in mitigation sections). To make it worse, the stack space budget limit is not a single number (i.e. interpreters can count operands and locals, compilers don't and use native stack bytes, nobody cares about labels although they could).
Next up, let's see why do we care.
Assume we have a module like the following:
(module
(func $evil (param $x i32)
(local i32 i32 i32 i32 i32)
(if
;; If $x == 0
(i32.eqz (local.get $x))
(then (return))
(else
;; else: evil($x - 1)
(call $evil
(i32.sub
(local.get $x)
(i32.const 1)
)
)
)
)
)
)
In essence, this module just recurses the x
number of times specified as an argument. Therefore, call $evil
could potentially fail with different x
for different engines.
Now, you (@burdges) compared wasmi
vs wasmtime
. I think this is false dichotomy, atm, since both engines could be changed (and did change in the past) in such a way that they would give different results with respect to each other.
In other words, technically, wasmtime 0.228
could fail with different x
in the example above than say wasmtime 0.229
. This happens because ensuring some persistent results is hard and potentially harms performance, therefore nobody gives a hoot about that.
Even wasmi had changes in that respect in the past. (And this is why we were talking about establishing a gold standard of execution).
In other words, various implementations form a set of possible behaviors which also grows over time. We also might want to migrate to other engines in the future.
As I already mentioned in the issue description, in general case, we might be required to leverage dynamic counting to overcome this problem.
Now, by dynamic counting I mean dynamically counting the semantic size of the stack. I.e. we introduce a math function that returns a size of a call frame in some abstracts units that includes the number of locals, parameters and probably the other data, such as return address. Then we maintain a global counter which is local to the execution thread. If that counter exceeds some value - we interrupt execution. In prologue of each function we increment this counter according to the semantic stack frame size of that function, in epilogue we decrement this counter.
Note, that we do that even if the actual compiled function doesn't consume any resources of the underlying implementation (say, all locals and operands fit into registers and there was no frame opened).
The main objective here is to make sure that the engine defined trap is never reached. Our dynamic counter should kick in before.
As far the performance goes, it depends on the implementation, but theoretically it is possible to implement that very efficiently: there is one read/write in prologue and epilogue to a hot location and a cold branch that should be almost always correctly predicted. Then, we might be able to elide the counting for the static parts of the call graph (haven't dug into this though).
Since the contracts require this, we already have instrumentation that relies on instrumentation.
As I mentioned, there are other operations that are potentially non-deterministic. But they are way simpler to ensure.
wasmparser
(employed in wasmtime) actually ensures them, but I think that should be something easy to change. We can ensure that the wasm blob is compatible with every engine we use even before that engine kicks in.memory.grow
we assume that always succeeds to the limit set by the polkadot node / STVF's wasm blob.wasmtime
AFAIK).I think that should cover broadly all the cases of wasm non-determinism.
Ah, yeah, and also the point about "determinism differences". If it is possible to compare if one implementation is "more deterministic" than other then I prefer to avoid that. This is because non-deterministic operation in wasm sense is one operation that for the same input arguments (the input operands, parameters, the state of memory, stack, etc) returns not the single results value but one return value picked from a set of possible results. Examples:
call
instruction can either perform the call or trap due to stack resources exhaustion,memory.grow
can either success or signal of out memory condition,Therefore, I think it'd be better to compare what are the differences in behavior between wasmi and wasmtime. .
And, finally, stack limits are different - which makes a function call in both engines behave differently if reached boundary conditions of either engine.
Yes, I suppose wasmi/wasmtime version upgrades might pose the most real world threat, interesting thanks! :)
I'd think the relay chain could dictate the required wasmtime versions, so backing checkers run all required versions, while approval checkers run only what they like, and correct nodes stop their validator candidacy if they lack a required wasmtime version. It's cheaper to do two or three wasmtime runs than one wasmi run, excluding the compilation time.
Yes, instrumenting the stack sounds handy there. If that performance hit matters, then you can run the explicit counter only for backing validators, and derive from that an overestimate that ensures we cannot exhaust the stack. It's fixable if we botch that overestimate sometimes. :)
I'm dubious that NaNs are the only problem with floats, but I donno how tightly wasm specifies them, so maybe..
Thanks!
cc @arkpar you may find this one interesting as well
invoking host functions are viewed as non deterministic from PoV of the WebAssembly spec, but we can ensure that only deterministic functions are available to the STVF.
Are there any host functions that could potentially be non-deterministic? Storage access can't cause IO errors and would only panic on missing trie node, which is deterministic. Since there's no IO, the only source of non-determinism that I can think of is host running out of stack or heap memory. Do we have host functions that allocate or could be invoked recursively?
Since we are the embedders it is only up to us which host functions to define and we can restrict the set only to deterministic ones (which we already do).
I think memory is not such a big concern either IMO. WebAssembly memories can be limited in size statically (i.e. in declaration) and they are capped with the absolute maximum of 4GB. Allocations up to that size should be fine on a properly configured x86-64 machine, even if there is not enough physical space available. If that happens though that would be suboptimal, so we probably need to file under "system requirements" that the operator would need a certain amount of RAM. But still, it should work.
So that leaves us with only the stack.
Do we have host functions that allocate or could be invoked recursively?
Yeah, as I just mentioned I don't think that allocations are a concern.
There are functions that can lead to recursion: specifically those from the sandboxing module. As a recap: sandboxing is a set of APIs similar to that provided by a regular WebAssembly VM, so they can be used to spawn wasm VMs while already executing within a wasm VM. Since a wasm module needs host functions to interact with the embedder that API allows the runtime/STVF (we refer to it as a supervisor) to expose some of its functions to the instantiated nested wasm module (a guest). We use that for implementation of contracts (but perhaps there are other use-cases as well), where runtime/STVF controls the guest contracts.
Taking contracts as an example, recursion can arise when a contract calls another contract. Specifically something like this:
I think memory is not such a big concern either IMO. WebAssembly memories can be limited in size statically (i.e. in declaration) and they are capped with the absolute maximum of 4GB.
I was talking about host functions and host memory. Something like batch signature verification allocates queues on the host heap. It also potentially creates system threads, which is another source of non-determinism.
Sandboxing may potentially exhaust system memory as well. To be fair, we already have non-deterministic execution timeout. I.e. candidate validity depends on the CPU resource of the validator. Perhaps memory could be treated in the similar fashion? As far as I understand the important bit here is to not allow crafting such a candidate that would cause 50% split on the validity opinion. @burdges @rphmeier would the network be in trouble if someone manages to craft a candidate that causes an execution timeout on 50% of the network, but passes validation on the rest?
Not to comment on the second part, but I still think that memory is not the biggest concern for us.
Yeah, I think what I said still applies to the memory allocated by the host. The fact that it creates threads also doesn't change much as long as the results are deterministic. However, we need to keep in mind that threads might consume other kinds of resources. But I believe that we can impose limits on STVF so that it would be impossible to exceed the system requirements we specify.
To be more specific here is an example.
Imagine that we generously limit the STVF's linear memory with 4GB. Then, we say that each sandboxed instance can consume up to 128MB and there can be only 32 instances at the same time. This gives us another 4GB. Remember, this is virtual memory.
Then, I assume that most of the host functions release the buffers these functions use after they return. Upon an inspection we can compute an upper bound. But let's be generous and say that those cases contribute another 1GB to the peak memory usage.
Then there are functions that maintain some state on the host side, as you mentioned signature verification and already mentioned sandboxing functionality. You can have, idk a reasonable value but I'd pick 1024, signature verification in flight. The consumption coming from sandboxing module's bookkeeping should be negligible. But let's say it's in total another 1GB.
Then summing this up gives us merely 10GB with these over-estimated numbers good chunk of which typically isn't backed by physical memory. A number that doesn't look too bad in the modern world and certainly can be served on most systems even without any special measures. We can of course tweak these numbers, and say that e.g. STVF can only consume up to 1GB. Even if the machine doesn't possess the amount of physical memory we state in the system requirements for the validator node it still can still execute all the logic unless the swap disabled, and even have a chance not to timeout stuff.
Respective to threads, I believe the similar logic can be applied. We can just constrain the amount of work/threads that a STVF can request on one side and on the other put the specific requirements in the system requirements for a validator node.
We cannot consider timeouts to be slashable, but afaik we have not carefully thought through the related issues:
Imagine one validators declares that some candidate timed out. Is there any scenario under which we'll approve this block? Yes we should approve if that validator just sucks, is being a jerk, etc.
Could we have two separate timeouts with timeout_long = 8ish * timeout_short? All validators must run blocks for timeout_long before aborting, but they report four states:
We accept simple majority vote for approved vs. rejected among approval and backing validators if all report valid, and no penalities. Any invalid vote always escalates to all validators checking.
If all checkers report either valid but rejected or aborted at timeout_long, then we reject the block and likely slash the parachain's bond. We could avoid this case by requiring that all backing checkers report valid and approved, so if you've a slow backing checker then your parachain halts while assigned to them.
If some validator reports approved and some validator reports aborted at timeout_long then our lives become complex: We could ask the slow validator to run the block even longer. We could treat the slow validator like a no show and replace them, which almost sounds unavoidable honestly.
We'll eventually want no shows to report their current state, ala "slow availability" or "slow candidate", well see https://github.com/paritytech/polkadot/pull/1518/files I want to avoid asking for this complexity in a Polkadot MVP though, but..
I've asked for the no show scheme even in an MVP to address the threat of denial of service attacks on validators. We'd select its parameters in a methods that sounds suboptimal for slow candidates.
We should make a separate issue for discussing timeout logic probably..
I don't think we should ever be slashing the parachain's deposit. The other concern I have is that long-running candidates might occupy all of our validation resources, so a shorter timeout (< blocktime) would be better during backing. During approval the longer timeout makes more sense.
I've moved timeouts discussions to https://github.com/paritytech/polkadot/issues/1628 so maybe this issue could refocus on determinism.
As the time of writing, polkadot validation function execution is performed using the same executor that is used for executing Substrate runtimes. Substrate executor doesn't guarantee non-determinism of execution, or more precise it doesn't provide more than base guarantees of wasm execution. If the chain-writer manages to exploit these then it's their fault.
We would like to eradicate these effects for the execution of PVF.
We should come up with a way on how to avoid that. (See this discussion for more details)
There are following sources of non-determinism:
Let's break down the point about execution in detail.
Execution limits are specified here. These limits are specified in abstract terms of the specification. Practically, they could be divided on two groups: memory-limited and stack-limited. The size of linear memory and tables can be trivially limited, even statically. The stack-limited parameters, at least in general situation (e.g. without limiting recursion, etc), require us dynamic counting.
There are some instructions that are non-deterministic. This is a local non-determinism, meaning that there are some instructions that have not a single outcome but rather a finite set of outcomes for the same arguments. Theoretically, a wasm VM implementation that chooses the result from a set of the possible outcomes is still deemed as a conforming implementation.
The following is the list of such instructions:
memory.grow
, i.e. trying to increase the size of the linear memory. This is easily mitigated by limiting the maximal size of the memory or by counting.