s0me0ne-unkn0wn commented 9 months ago

Overview

The inability to execute a properly prepared artifact (and raising a dispute over a valid candidate as a result) has a year-long story that started after the Polkadot incident in March 2023 (paritytech/polkadot#6862). Efforts to mitigate it started with node version check (paritytech/polkadot#6861), evolved over time into more checks and safety measures (#1918, #2895, to name a few), and are still ongoing (#2742, #661).

Overviewing the broader picture of the problem, it increasingly looks like plugging holes in a spaghetti strainer. In this issue, I try to summarize many discussions carried out on that topic and propose a possible solution, for which I can't take credit; it was born by the hivemind somewhere in discussions.

The problem

execute_artifact is the point where actual parachain runtime construction, instantiation and execution takes place. It is an unsafe function that comes with the following security contract:

The caller must ensure that the compiled artifact passed here was: 1) produced by prepare, 2) was not modified,

There are two problems here: first, this contract is not always held (and, more broadly, it cannot be held every single time in the real world), and second, it is somewhat incomplete.

Let's consider some known incidents to understand what are the obstacles to upholding that contract.

Dirty node upgrade

The node now consists of three binaries: the node itself, the preparation worker, and the execution worker. That leaves a lot of room for the node operator in the sense of how he can screw up the node upgrade:

Upgrade all the binaries but forget to restart the node;
Upgrade only the node binary, leaving workers from the previous version;
Upgrade only one worker and forget to upgrade the other one;
All the possible combinations of the aforementioned scenarios.

Some day, we found a good way of handling that: to let node version and worker version cross-check. "Version" was meant to be the commit hash, but that resulted in an awful developer experience because every change to the code required manually rebuilding workers. So, that was relaxed, and now we check the node's semantic version. That is still problematic: versions are not bumped every hour, and upgrading from the latest stable version to master can still lead to undetected version mismatch.

Node hardware downgrade

Sometimes, node operators decide to save a bit of money by transferring their VMs from an expensive VPS plan, where the VMs were running on Intel Xeon CPUs, to a not-so-expensive plan with consumer-grade Intel Core CPUs. Pre-compiled artifacts that survived the transfer couldn't be executed on the Core hardware as they were compiled for Xeon hardware and used its extended set of features.

That was "fixed" in #2895, but that's a regressive approach. We wanted artifact persistence for optimization, and we still want it.

Wasmtime version change

This is closely related to the "dirty node upgrade" scenario. Wasmtime versions are not backward-compatible in the general case. A later Wasmtime version may refuse to execute an artifact produced by a former version. That is the second reason for removing artifact persistence, and that's also why #2398 was raised.

The solution

The aforementioned execute_artifact must distinguish between error types, returning a concrete error to the caller so the caller can know if the problem is with runtime construction, instantiation, or execution itself.

Given that the runtime construction is checked during the PVF pre-checking process, it shouldn't fail during the actual execution, so the runtime construction error means the artifact is either corrupted or there's some kind of artifact version mismatch. In that case, the caller must discard the artifact, remove it from the filesystem, and re-submit the PVF in question into the preparation queue.

After the PVF is re-prepared, execution must be retried.

If the runtime construction fails for the second time in a row, that means that some external conditions have changed, and the PVF cannot be executed anymore. That is a deterministic error, and raising a dispute for the candidate is okay.

Outcomes

We wouldn't need #2398 and #2742 anymore. In case of Wasmtime version mismatch, we would just re-prepare the artifact (which we need to do in that case anyway) and execute successfully during the retry.
We could enable artifact persistence again, even without extensive checks for node upgrade events. If the node has upgraded, or the hardware has been downgraded, okay, whatever, that's no nevermind of ours, everything just gets re-prepared and executed.
In the "dirty node upgrade" scenario, as far as I can imagine in my head, only one concern remains: one worker is upgraded, and another is not. All the other scenarios are covered by re-preparation. And this is only the concern in the "latest-to-master" upgrade scenario, as upgrades from version to version are still guarded by the version-checking mechanism.

CC @eskimor @koute @bkchr

This issue is open for external contributions

CC @maksimryndin

maksimryndin commented 9 months ago

@s0me0ne-unkn0wn thank you for an excellent summary! I am ready to take this issue.

In the "dirty node upgrade" scenario, as far as I can imagine in my head, only one concern remains: one worker is upgraded, and another is not. All the other scenarios are covered by re-preparation. And this is only the concern in the "latest-to-master" upgrade scenario, as upgrades from version to version are still guarded by the version-checking mechanism.

Is it viable to use commits hash (as before) for release profile and fallback for semver for debug profile to alleviate a DevEx?

s0me0ne-unkn0wn commented 9 months ago

Is it viable to use commits hash (as before) for release profile and fallback for semver for debug profile to alleviate a DevEx?

IIRC, we discussed it (and maybe even implemented it at some point?), but it turned out that virtually nobody runs cargo test without -r as it's taking an infinite amount of memory, storage, and time otherwise.

Anyway, if our failure scenario is "someone who upgraded from the latest stable version to self-built master and forgot to overwrite only one worked binary of two," it's already much, much better than now.

eskimor commented 9 months ago

Is it really only runtime construction that can fail? E.g. for instructions that are not present on the current hardware: Sounds like it could only hit that instruction when actually passing in the PoV?

s0me0ne-unkn0wn commented 9 months ago

Is it really only runtime construction that can fail?

From what we've seen in #2863, yes, Wasmtime checks for everything, including artifact version and host architecture, during the module deserialization, aka runtime construction.

koute commented 9 months ago

Well, I think the two best options here are probably:

1) Do not persist across artifacts across restarts. (Which is what we have now after https://github.com/paritytech/polkadot-sdk/pull/2895) 2) Make the persistence guaranteed to be infallible.

I would prefer (1) as it's simpler, but if we really want persistence across restarts then I would prefer to just make this infallible and keep it conservative.

So, how would we make this infallible? I think something like this should be pretty much watertight in practice.

First, hash all of these together:

The currently running binary.
The worker binaries.
/proc/cpuinfo.
Current kernel version.

This will give us a hash that will change if either any of the executables change, or the environment changes in a major way. Then, use this hash to make sure we only try to load artifacts that match it. And also maybe verify checksums of the artifacts themselves, so that we know they weren't corrupted/overwritten/mangled.

So something like this (sans error handling):

use std::os::unix::ffi::OsStrExt;

// Only done once at startup.
let env_hash = {
    let mut h = blake3::Hasher::new();
    h.update(&std::fs::read("/proc/self/exe").unwrap());
    h.update(&prepare_worker_bytes);
    h.update(&execute_worker_bytes);
    h.update(&std::fs::read("/proc/cpuinfo").unwrap());
    h.update(nix::sys::utsname::uname().unwrap().release().as_bytes());
    h.finalize()
};

// Have a separate directory per each unique environment, so that we can easily delete stale ones.
let cache_path = cache_root_path.join(env_hash.to_string());

// For every unique PVF blob.
let artifact_path = cache_path.join(format!("{wasm_blob_hash}.bin"));
let checksum_path = cache_path.join(format!("{wasm_blob_hash}.checksum"));

// Load the expected checksum of the artifact.
let Ok(expected_artifact_checksum) = std::fs::read(checksum_path) else { return; };

// Load the artifact itself to hash it.
let Ok(artifact_data) = std::fs::read(&artifact_path) else { return; }
let artifact_checksum = {
    let mut h = blake3::Hasher::new();
    h.update(&artifact_data);
    // Use the environment hash as salt in case a user gets clever and starts moving these files.
    h.update(env_hash.as_bytes());
    // Same for the hash of the original blob.
    h.update(wasm_blob_hash.as_bytes());
    h.finalize().to_string()
};

if artifact_checksum != expected_artifact_checksum {
    // Something's wrong; the checksum doesn't match.
    let _ = std::fs::remove(artifact_path);
    let _ = std::fs::remove(checksum_path);
    return;
}

// Artifact is OK; we can load it!

This would keep the persistence working if someone simply restarts the binary, but if they migrate to a different machine, or change/update the operating system, or update one of the Polkadot binaries, or shuffle the artifacts around, or the artifacts bitrots on their disk, then the cache would be invalidated. And you wouldn't need to bother with having to retry and rebuild, since this pretty much should never fail in practice due to how conservative it is.

But of course, it's up to you how you want to handle this. :P

s0me0ne-unkn0wn commented 9 months ago

@koute that's fair, and thanks for the detailed description! But I still think the re-preparation is superior to that approach for two reasons:

It rules out more possible errors. Re-preparation (nearly) guarantees we can execute even if an artifact gets corrupted during the node run, or if someone replaces binaries in place without restart, or whatever. And it's cheap! The checks you mention would introduce an unforgivable overhead if we ran them before every execution. With re-preparation, the only overhead is an attempt to deserialize the module, which doesn't take long. Otherwise, the whole process is the same as if we didn't have an artifact at all, so we can easily afford that;
Your approach handles "known unknowns", that is, situations we've already encountered. Re-preparation can handle "unknown unknowns" as well; if something screws up in a new way, we just re-prepare and run. And that's why I'm saying that's not just for artifact persistence but to stop researching new failure points, creating hashes of hashes, and ending up with a whole Merkle tree of execution environment state :)

koute commented 9 months ago

It rules out more possible errors.

Please correct me if I'm wrong, but it wouldn't handle subtle incompatibilities that wouldn't result in the runtime construction step failing, right?

Imagine a hypothetical scenario where a single bit is flipped inside of the compiled artifact which still results in the artifact being valid (as in: can be successfully loaded and executed by wasmtime), but now it will give wrong results at execution time and result in a dispute. How would repreparation help here?

The whole idea of me suggesting what I did suggest was to catch more possible cases than what just can be caught by detecting that the artifact fails to load and recompiling it.

The checks you mention would introduce an unforgivable overhead if we ran them before every execution.

You would need to run them only before loading the module with wasmtime, and they should be relatively cheap (blake3 hashes at 7GB/s, and you can mmap the artifact instead of std::fs::reading it).

maksimryndin commented 9 months ago

@s0me0ne-unkn0wn @koute I've drafted a PR for this issue with the solution from the issue description. As regards for hashing, there are https://github.com/paritytech/polkadot-sdk/issues/2399 and https://github.com/paritytech/polkadot-sdk/issues/677 as well.

I would also wonder the rationale behind such a complex design of pvf validation host. I understand security implications but cannot grasp the deep nesting of processes. If I understood right, the host process creates a pool of worker processes. Each worker process creates a child process on execution, which in turn creates 3 threads. What are the reasons? In any case, on every execution request we pay the price of a process fork. Why do we need the pool then?

@s0me0ne-unkn0wn I am ready for a call if it is necessary :)

eskimor commented 9 months ago

Do we really need to bother with artifact persistence across restarts, with Polkavm approaching? I am also not quite in the loop, has this already been a problem in practice (that we delete)?

koute commented 9 months ago

Do we really need to bother with artifact persistence across restarts, with Polkavm approaching?

Although we should be able to have it working relatively soon it'll be still a little while before PolkaVM is officially stabilized and production ready, so the effort we put into maintaining our current WASM-based environment isn't necessarily wasted, although I would probably suggest to mostly treat it as being in maintenance mode and not invest too much time into it.

I am also not quite in the loop, has this already been a problem in practice (that we delete)?

This is also something I'd like to know - have any of our users actually complained that the artifacts are not cached, or does no one care?

s0me0ne-unkn0wn commented 9 months ago

Imagine a hypothetical scenario where a single bit is flipped inside of the compiled artifact which still results in the artifact being valid (as in: can be successfully loaded and executed by wasmtime), but now it will give wrong results at execution time and result in a dispute. How would repreparation help here?

That's a valid concern (considering that the artifact is just an ELF and is not guarded by any CRC), but that's a different issue as well. I probably didn't properly define the scope of this one, it's more like "get rid of disputes due to runtime construction errors forever, with future guarantees". We can implement hash checks as an additional measure. BTW, it would be immensely useful if PolkaVM artifacts had some internal integrity guarantee. But I'm sure you've already thought it through :)

Do we really need to bother with artifact persistence across restarts, with Polkavm approaching? I am also not quite in the loop, has this already been a problem in practice (that we delete)?

We haven't got any real problems related to artifact pruning yet. That was supposed to be an optimization, which was targeting on-demand parachains and coretime, when a node would need to prepare much more artifacts than now.

But it's neither the single nor the main problem we're addressing here. It's just a nice bonus. I just want us not to have disputes due to runtime construction errors ever again, and it gives us the opportunity to re-enable artifact persistence automagically, so why not do that?

paritytech / polkadot-sdk

PVF: Consider re-preparing artifact on failed runtime construction #3139

Overview

The problem

Dirty node upgrade

Node hardware downgrade

Wasmtime version change

The solution

Outcomes