Open eskimor opened 1 year ago
For the stack limit, I just had the following idea: Extend the upcoming time dispute mechanism:
Approval checkers could not only report time, but also maximum stack depth.
We are able to assume some upper bound for the stack depth fluctuations across supported architectures/implementations/etc.
Let's assume above upper bound for fluctuation is 2, then we can have the backers commit to some bound X. Approval checkers are allowed a much larger limit like 6*X. Now in the honest case, approval checkers will never exceed the limit - hence no indeterminism.
For the dishonest case, we do the same as in time disputes: We start charging the backers once the approval checkers say that the stack limit was larger than 2*X, because we can then be sure, even with implementation differences that the backers are faulty and are the ones to punish. If backers push it further they could still trigger a dispute, but given the data we would not punish those dispute raising validators, but likely slash backers a lot instead.
We might be able to extend the "time dispute" mechanism to address all not otherwise solvable indeterminism sources.
This issue has been mentioned on Polkadot Forum. There might be relevant details there:
https://forum.polkadot.network/t/ux-implications-of-pvf-executor-environment-versioning/2519/25
Here's a couple to start the list:
Indeterminism source | Explanation | Mitigations | Status |
---|---|---|---|
Differences in preparation/execution time on different machines | Can lead to jobs timing out on some machines and not others | Mitigated somewhat by counting CPU time instead of wall clock time, the former being more independent of system load. | implemented |
Differences in available memory on different machines | Can lead to jobs hitting OOM on some machines and not others. | Some mitigations are being researched, such as https://github.com/paritytech/polkadot-sdk/issues/745 and https://github.com/paritytech/polkadot-sdk/issues/767. | future |
Not exactly indeterminism, but closely related:
The NaN representation of floating points. Wasmtime has an option for that that, but other implementations might use a different representation.
The allocator algorithm. At the moment, every implementation has to copy the exact behavior of the Substrate allocator down to the smallest detail. I would personally in generally be strongly in favor of removing this allocator altogether (it's in general a poor design for several reasons) and moving the memory allocation to the runtime, but that needs a refactor of many host functions.
I found an old issue listing sources of indeterminism: https://github.com/paritytech/polkadot-sdk/issues/990. I haven't gone through in depth, but I see some have already been mentioned here.
In order to systematically address all sources of indeterminism we have in PVF execution (and preparation), we should start with a list.
I would like to have in the guide a list of all possible sources we can think of, with sections following that list explaining each one in detail together with implemented or possible mitigations.