paritytech / polkadot-sdk

The Parity Polkadot Blockchain SDK
https://polkadot.com/
1.89k stars 696 forks source link

Run preparation for all existing PVFs on the network before a release updating wasmtime #657

Open eskimor opened 1 year ago

eskimor commented 1 year ago

We assume that once a PVF passed pre-checking that it will compile just fine also in the future. We should therefore make sure that whenever we are upgrading wasmtime or doing any changes to the preparation process that all existing/already registered PVFs on the network would still pass pre-checking with those changes.

Why is this important?

If previously working PVFs suddenly stop working, even for a single parachain/parathread, relay chain finality would stall until the issue is resolved: Better to discover issues before we hit production.

@ordian I believe you already have some tooling in place which could help with this? @Sophia-Gold Once we have the tooling, we would need to coordinate with the release team and CI/CD to automate those checks.

Safe path would be to run those compilations as part of each release.

sandreim commented 1 year ago

@eskimor can you detail a bit why would the relay chain stall if a PVF stops working? I was expecting this to either cause disputes or the candidates will not be backed at all.

eskimor commented 1 year ago

Hah, I actually had this elaboration, but deleted it as not relevant. :laughing:

For the second: You can not rely on them not being backed, because backers can be malicious. For the first: We don't dispute on preparation errors as those should have been found by pre-checking. So if there are any, it has to be some issue with the local node. The assumption is only correct though if we assume that node upgrades don't break things.

But even if we rolled that decision back, then we would indeed have disputes. Which is not better at all, as honest nodes would get slashed and there could be lots of them.

Sophia-Gold commented 1 year ago

I think we should roll that decision back. It's better to have disputes than the relay chain stopping and slashes can be refunded through governance if it's our fault.

Of course, we should also still have the tests in this issue.

sandreim commented 1 year ago

Thanks, that makes sense

eskimor commented 1 year ago

I think we should roll that decision back. It's better to have disputes than the relay chain stopping and slashes can be refunded through governance if it's our fault.

Of course, we should also still have the tests in this issue.

I was briefly thinking about that as well, but I don't think it improves the situation: If we rolled back the decision we would have a dispute storm instead, which would either result in security threats or equal liveness threats. Regardless of whether we have a finality stall or a dispute storm: We would need to push a fix to recover the network.

There are arguments on both sides, what would be better: finality stall or dispute storm: But reality is they are both pretty bad and we should have quality control in place to prevent either.

burdges commented 1 year ago

Afaik there is nothing remotely controversial about your statement:

We should therefore make sure that whenever we are upgrading wasmtime or doing any changes to the preparation process that all existing/already registered PVFs on the network would still pass pre-checking with those changes.

But when? Do we just say governance should always do this on some range of configurations? Or do we make the chain enforce re-votes on re-build results of PVFs? Or do we re-build schedule re-builds as parathread blocks, so then only approval checkers vote yes or no on the re-build.

We might answer this exactly like we answered the original "who builds" question: Should all validators re-build anyways? If yes, then we've no reason to do any fancy approval checkers games. Also if yes then we need everyone to re-build anyways irrespective of if governance checks. If no, then yeah those other options still exist.

Anyways, we do not necessarily have so much choice here, depending upon what a wasmtime upgrade means.

Sophia-Gold commented 1 year ago

I was briefly thinking about that as well, but I don't think it improves the situation: If we rolled back the decision we would have a dispute storm instead, which would either result in security threats or equal liveness threats. Regardless of whether we have a finality stall or a dispute storm: We would need to push a fix to recover the network.

There are arguments on both sides, what would be better: finality stall or dispute storm: But reality is they are both pretty bad and we should have quality control in place to prevent either.

I'm not sure it particularly matters if we're able to prevent this situation through testing, but in my understanding it wouldn't result in a dispute storm unless multiple PVFs fail to compile and multiple backers are dishonest. I don't know where the threshold is for multiple disputes halting the relay chain, but we'd for sure want to be able to slash the backers.

To make sure I understand this, what would we do if this test fails? Would the PVF need to be updated before the release? I'm not sure if this is what @burdges is asking.

burdges commented 1 year ago

We always risk a dispute storm if the PVF or candidate works on some node configurations but not others. It's maybe not a "storm" if we've only one bad parachain, but one could imagine many bad parathreads being caused by some crate. And malicious backers sounds unnecessary. We're less enamored of alternative node configurations now than previously but they remain a reality.

We'll deactivate any PVF which fail to re-compile under the new wasmtime. We do not give PVF authors an opportunity to stall a relay chain upgrade they dislike by crafting a PVF to break it.

All I really said is: If all nodes must rebuild all PVFs anyways then we should just rebuild them all as if they were all re-uploaded afresh, including voting. I do not know if all nodes must rebuild all PVFs anyways.

We could keep both old and new wasmtimes available, so then we slowly transition PVFs from the old wasmtime to the new wasmtime. We transition first all true system parachains/parathreads, second all common good chains in whatever order, third all full parachains but with higher locked dots ones going first, and finally all other parathreads in whatever order, maybe time since last block.

Another suggestion: Any wasmtime update includes a random 32 byte nonce, which by convention we should secretly produce by sha3("relaxed update" || some_secret_random_32_bytes). We never reveal these pre-images, except if we need a fast transition for security reasons then we instead produce this nonce by sha3("security update" || cve_commit) where cve_commit = sha3(cve_number || some_secret_random_32_bytes). We add a "security update reveal" transaction which reveals cve_commit to immediately halts backing untransitioned parachains. We choose when we reveal cve_commit to compromise between security and disruption.

Polkadot-Forum commented 1 year ago

This issue has been mentioned on Polkadot Forum. There might be relevant details there:

https://forum.polkadot.network/t/ux-implications-of-pvf-executor-environment-versioning/2519/27

eskimor commented 1 year ago

To be clear, I suggest to have Parity re-check all PVFs as part of the release process of a new Parity node version on some CI machine. I don't think this needs to be done on validators - although that CI machine should run the exact same thing validators would. If any PVF would fail this check, we would not release until we fixed the issue:

  1. Either the release is faulty - just fix it.
  2. Or the PVFs relied on some previous flaw in wasmtime that now got fixed. In that case we would need to ask parachain devs to upgrade their PVFs or disable them via governance or something before the release - or both.

Although Jeff's suggestion makes sense as well, especially the part about disabling parachains in case of a security vulnerability.

But given that we don't want implementation details like the used wasmtime as part of the spec, this seems infeasible.

Any other alternative node implementation team should have similar testing, but that would be their business.

ordian commented 1 year ago

The initial version of pvf-checker is available here: https://github.com/paritytech/pvf-checker.

bkchr commented 1 year ago

Nice! It would be really nice to also go back in history. Actually it would be much better to test all ever existing PVFs.

ordian commented 1 year ago

It can now accept --at-block <hash> to query PVFs at a specific block hash assuming it's not pruned on the RPC node. Testing all ever existing PVFs probably requires having a scraper service that will collect all PVFs while syncing from genesis. Some PVFs are already not passing checks like 2268 for Kusama, so I added --skip <para_id> flag for that.

Feel free to open an issue on the repo for any feature requests.

eskimor commented 1 year ago

Why would we care for any non current PVFs? It should be enough if anything that is used right now works, future upgrades will be detected by pre-checking ( a bit racy, but still). Previously existing PVFs sound like a nice bonus that can run on a best effort basis?

bkchr commented 1 year ago

Why would we care for any non current PVFs?

To have a bigger testing space. Maybe the current files are not using some niche feature that was used before. Especially in the light of different wasm engines it is better to test with more input data imo.