Closed s0me0ne-unkn0wn closed 8 months ago
@s0me0ne-unkn0wn thank you for an excellent summary! I am ready to take this issue.
- In the "dirty node upgrade" scenario, as far as I can imagine in my head, only one concern remains: one worker is upgraded, and another is not. All the other scenarios are covered by re-preparation. And this is only the concern in the "latest-to-master" upgrade scenario, as upgrades from version to version are still guarded by the version-checking mechanism.
Is it viable to use commits hash (as before) for release profile and fallback for semver for debug profile to alleviate a DevEx?
Is it viable to use commits hash (as before) for release profile and fallback for semver for debug profile to alleviate a DevEx?
IIRC, we discussed it (and maybe even implemented it at some point?), but it turned out that virtually nobody runs cargo test
without -r
as it's taking an infinite amount of memory, storage, and time otherwise.
Anyway, if our failure scenario is "someone who upgraded from the latest stable version to self-built master and forgot to overwrite only one worked binary of two," it's already much, much better than now.
Is it really only runtime construction that can fail? E.g. for instructions that are not present on the current hardware: Sounds like it could only hit that instruction when actually passing in the PoV?
Is it really only runtime construction that can fail?
From what we've seen in #2863, yes, Wasmtime checks for everything, including artifact version and host architecture, during the module deserialization, aka runtime construction.
Well, I think the two best options here are probably:
1) Do not persist across artifacts across restarts. (Which is what we have now after https://github.com/paritytech/polkadot-sdk/pull/2895) 2) Make the persistence guaranteed to be infallible.
I would prefer (1) as it's simpler, but if we really want persistence across restarts then I would prefer to just make this infallible and keep it conservative.
So, how would we make this infallible? I think something like this should be pretty much watertight in practice.
First, hash all of these together:
/proc/cpuinfo
.This will give us a hash that will change if either any of the executables change, or the environment changes in a major way. Then, use this hash to make sure we only try to load artifacts that match it. And also maybe verify checksums of the artifacts themselves, so that we know they weren't corrupted/overwritten/mangled.
So something like this (sans error handling):
use std::os::unix::ffi::OsStrExt;
// Only done once at startup.
let env_hash = {
let mut h = blake3::Hasher::new();
h.update(&std::fs::read("/proc/self/exe").unwrap());
h.update(&prepare_worker_bytes);
h.update(&execute_worker_bytes);
h.update(&std::fs::read("/proc/cpuinfo").unwrap());
h.update(nix::sys::utsname::uname().unwrap().release().as_bytes());
h.finalize()
};
// Have a separate directory per each unique environment, so that we can easily delete stale ones.
let cache_path = cache_root_path.join(env_hash.to_string());
// For every unique PVF blob.
let artifact_path = cache_path.join(format!("{wasm_blob_hash}.bin"));
let checksum_path = cache_path.join(format!("{wasm_blob_hash}.checksum"));
// Load the expected checksum of the artifact.
let Ok(expected_artifact_checksum) = std::fs::read(checksum_path) else { return; };
// Load the artifact itself to hash it.
let Ok(artifact_data) = std::fs::read(&artifact_path) else { return; }
let artifact_checksum = {
let mut h = blake3::Hasher::new();
h.update(&artifact_data);
// Use the environment hash as salt in case a user gets clever and starts moving these files.
h.update(env_hash.as_bytes());
// Same for the hash of the original blob.
h.update(wasm_blob_hash.as_bytes());
h.finalize().to_string()
};
if artifact_checksum != expected_artifact_checksum {
// Something's wrong; the checksum doesn't match.
let _ = std::fs::remove(artifact_path);
let _ = std::fs::remove(checksum_path);
return;
}
// Artifact is OK; we can load it!
This would keep the persistence working if someone simply restarts the binary, but if they migrate to a different machine, or change/update the operating system, or update one of the Polkadot binaries, or shuffle the artifacts around, or the artifacts bitrots on their disk, then the cache would be invalidated. And you wouldn't need to bother with having to retry and rebuild, since this pretty much should never fail in practice due to how conservative it is.
But of course, it's up to you how you want to handle this. :P
@koute that's fair, and thanks for the detailed description! But I still think the re-preparation is superior to that approach for two reasons:
It rules out more possible errors.
Please correct me if I'm wrong, but it wouldn't handle subtle incompatibilities that wouldn't result in the runtime construction step failing, right?
Imagine a hypothetical scenario where a single bit is flipped inside of the compiled artifact which still results in the artifact being valid (as in: can be successfully loaded and executed by wasmtime
), but now it will give wrong results at execution time and result in a dispute. How would repreparation help here?
The whole idea of me suggesting what I did suggest was to catch more possible cases than what just can be caught by detecting that the artifact fails to load and recompiling it.
The checks you mention would introduce an unforgivable overhead if we ran them before every execution.
You would need to run them only before loading the module with wasmtime
, and they should be relatively cheap (blake3 hashes at 7GB/s, and you can mmap
the artifact instead of std::fs::read
ing it).
@s0me0ne-unkn0wn @koute I've drafted a PR for this issue with the solution from the issue description. As regards for hashing, there are https://github.com/paritytech/polkadot-sdk/issues/2399 and https://github.com/paritytech/polkadot-sdk/issues/677 as well.
I would also wonder the rationale behind such a complex design of pvf validation host. I understand security implications but cannot grasp the deep nesting of processes. If I understood right, the host process creates a pool of worker processes. Each worker process creates a child process on execution, which in turn creates 3 threads. What are the reasons? In any case, on every execution request we pay the price of a process fork. Why do we need the pool then?
@s0me0ne-unkn0wn I am ready for a call if it is necessary :)
Do we really need to bother with artifact persistence across restarts, with Polkavm approaching? I am also not quite in the loop, has this already been a problem in practice (that we delete)?
Do we really need to bother with artifact persistence across restarts, with Polkavm approaching?
Although we should be able to have it working relatively soon it'll be still a little while before PolkaVM is officially stabilized and production ready, so the effort we put into maintaining our current WASM-based environment isn't necessarily wasted, although I would probably suggest to mostly treat it as being in maintenance mode and not invest too much time into it.
I am also not quite in the loop, has this already been a problem in practice (that we delete)?
This is also something I'd like to know - have any of our users actually complained that the artifacts are not cached, or does no one care?
Imagine a hypothetical scenario where a single bit is flipped inside of the compiled artifact which still results in the artifact being valid (as in: can be successfully loaded and executed by
wasmtime
), but now it will give wrong results at execution time and result in a dispute. How would repreparation help here?
That's a valid concern (considering that the artifact is just an ELF and is not guarded by any CRC), but that's a different issue as well. I probably didn't properly define the scope of this one, it's more like "get rid of disputes due to runtime construction errors forever, with future guarantees". We can implement hash checks as an additional measure. BTW, it would be immensely useful if PolkaVM artifacts had some internal integrity guarantee. But I'm sure you've already thought it through :)
Do we really need to bother with artifact persistence across restarts, with Polkavm approaching? I am also not quite in the loop, has this already been a problem in practice (that we delete)?
We haven't got any real problems related to artifact pruning yet. That was supposed to be an optimization, which was targeting on-demand parachains and coretime, when a node would need to prepare much more artifacts than now.
But it's neither the single nor the main problem we're addressing here. It's just a nice bonus. I just want us not to have disputes due to runtime construction errors ever again, and it gives us the opportunity to re-enable artifact persistence automagically, so why not do that?
Overview
The inability to execute a properly prepared artifact (and raising a dispute over a valid candidate as a result) has a year-long story that started after the Polkadot incident in March 2023 (paritytech/polkadot#6862). Efforts to mitigate it started with node version check (paritytech/polkadot#6861), evolved over time into more checks and safety measures (#1918, #2895, to name a few), and are still ongoing (#2742, #661).
Overviewing the broader picture of the problem, it increasingly looks like plugging holes in a spaghetti strainer. In this issue, I try to summarize many discussions carried out on that topic and propose a possible solution, for which I can't take credit; it was born by the hivemind somewhere in discussions.
The problem
execute_artifact
is the point where actual parachain runtime construction, instantiation and execution takes place. It is anunsafe
function that comes with the following security contract:There are two problems here: first, this contract is not always held (and, more broadly, it cannot be held every single time in the real world), and second, it is somewhat incomplete.
Let's consider some known incidents to understand what are the obstacles to upholding that contract.
Dirty node upgrade
The node now consists of three binaries: the node itself, the preparation worker, and the execution worker. That leaves a lot of room for the node operator in the sense of how he can screw up the node upgrade:
Some day, we found a good way of handling that: to let node version and worker version cross-check. "Version" was meant to be the commit hash, but that resulted in an awful developer experience because every change to the code required manually rebuilding workers. So, that was relaxed, and now we check the node's semantic version. That is still problematic: versions are not bumped every hour, and upgrading from the latest stable version to
master
can still lead to undetected version mismatch.Node hardware downgrade
Sometimes, node operators decide to save a bit of money by transferring their VMs from an expensive VPS plan, where the VMs were running on Intel Xeon CPUs, to a not-so-expensive plan with consumer-grade Intel Core CPUs. Pre-compiled artifacts that survived the transfer couldn't be executed on the Core hardware as they were compiled for Xeon hardware and used its extended set of features.
That was "fixed" in #2895, but that's a regressive approach. We wanted artifact persistence for optimization, and we still want it.
Wasmtime version change
This is closely related to the "dirty node upgrade" scenario. Wasmtime versions are not backward-compatible in the general case. A later Wasmtime version may refuse to execute an artifact produced by a former version. That is the second reason for removing artifact persistence, and that's also why #2398 was raised.
The solution
The aforementioned
execute_artifact
must distinguish between error types, returning a concrete error to the caller so the caller can know if the problem is with runtime construction, instantiation, or execution itself.Given that the runtime construction is checked during the PVF pre-checking process, it shouldn't fail during the actual execution, so the runtime construction error means the artifact is either corrupted or there's some kind of artifact version mismatch. In that case, the caller must discard the artifact, remove it from the filesystem, and re-submit the PVF in question into the preparation queue.
After the PVF is re-prepared, execution must be retried.
If the runtime construction fails for the second time in a row, that means that some external conditions have changed, and the PVF cannot be executed anymore. That is a deterministic error, and raising a dispute for the candidate is okay.
Outcomes
CC @eskimor @koute @bkchr
This issue is open for external contributions
CC @maksimryndin