paritytech / polkadot-sdk

The Parity Polkadot Blockchain SDK
https://polkadot.network/
1.79k stars 645 forks source link

Create complete and exhaustive list of sources of indeterminism in PVF #653

Open eskimor opened 1 year ago

eskimor commented 1 year ago

In order to systematically address all sources of indeterminism we have in PVF execution (and preparation), we should start with a list.

I would like to have in the guide a list of all possible sources we can think of, with sections following that list explaining each one in detail together with implemented or possible mitigations.

eskimor commented 1 year ago

For the stack limit, I just had the following idea: Extend the upcoming time dispute mechanism:

Approval checkers could not only report time, but also maximum stack depth.

Assumption

We are able to assume some upper bound for the stack depth fluctuations across supported architectures/implementations/etc.

Idea

Let's assume above upper bound for fluctuation is 2, then we can have the backers commit to some bound X. Approval checkers are allowed a much larger limit like 6*X. Now in the honest case, approval checkers will never exceed the limit - hence no indeterminism.

For the dishonest case, we do the same as in time disputes: We start charging the backers once the approval checkers say that the stack limit was larger than 2*X, because we can then be sure, even with implementation differences that the backers are faulty and are the ones to punish. If backers push it further they could still trigger a dispute, but given the data we would not punish those dispute raising validators, but likely slash backers a lot instead.

Possible Extensions

We might be able to extend the "time dispute" mechanism to address all not otherwise solvable indeterminism sources.

Polkadot-Forum commented 1 year ago

This issue has been mentioned on Polkadot Forum. There might be relevant details there:

https://forum.polkadot.network/t/ux-implications-of-pvf-executor-environment-versioning/2519/25

mrcnski commented 1 year ago

Here's a couple to start the list:

Indeterminism source Explanation Mitigations Status
Differences in preparation/execution time on different machines Can lead to jobs timing out on some machines and not others Mitigated somewhat by counting CPU time instead of wall clock time, the former being more independent of system load. implemented
Differences in available memory on different machines Can lead to jobs hitting OOM on some machines and not others. Some mitigations are being researched, such as https://github.com/paritytech/polkadot-sdk/issues/745 and https://github.com/paritytech/polkadot-sdk/issues/767. future
tomaka commented 1 year ago

Not exactly indeterminism, but closely related:

mrcnski commented 1 year ago

I found an old issue listing sources of indeterminism: https://github.com/paritytech/polkadot-sdk/issues/990. I haven't gone through in depth, but I see some have already been mentioned here.