Closed pepyakin closed 11 months ago
Should be done before parathreads!
Unsurprisingly, all of the syscall options are Linux-only, and we again run into portability concerns. Looks like there is potentially some alternative for MacOS, though it's marked deprecated and would also increase our complexity/testing burden.
However, I'm wondering if running workers on Docker would be an acceptable solution. It would solve the portability issues we are repeatedly running into. The performance cost should (theoretically) be negligible, and we would get seccomp
(and cgroups
for https://github.com/paritytech/polkadot-sdk/issues/767). These syscalls should be supported on Docker on Macs, at least according to docker info
on my M1 Mac.
Server: Containers: 5 Running: 0 Paused: 0 Stopped: 5 Images: 1 Server Version: 20.10.21 Storage Driver: overlay2 Backing Filesystem: extfs Supports d_type: true Native Overlay Diff: true userxattr: false Logging Driver: json-file Cgroup Driver: cgroupfs Cgroup Version: 2 Plugins: Volume: local Network: bridge host ipvlan macvlan null overlay Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog Swarm: inactive Runtimes: runc io.containerd.runc.v2 io.containerd.runtime.v1.linux Default Runtime: runc Init Binary: docker-init containerd version: 1c90a442489720eec95342e1789ee8a5e1b9536f runc version: v1.1.4-0-g5fd4c4d init version: de40ad0 Security Options: seccomp Profile: default cgroupns Kernel Version: 5.15.49-linuxkit Operating System: Docker Desktop OSType: linux Architecture: aarch64 CPUs: 5 Total Memory: 7.667GiB Name: docker-desktop ID: SCCV:JEJX:743F:W33L:I3VH:NTKL:GKQF:EMIT:E5BR:Q5LG:HMOW:NNYS Docker Root Dir: /var/lib/docker Debug Mode: false HTTP Proxy: http.docker.internal:3128 HTTPS Proxy: http.docker.internal:3128 No Proxy: hubproxy.docker.internal Registry: https://index.docker.io/v1/ Labels: Experimental: false Insecure Registries: hubproxy.docker.internal:5000 127.0.0.0/8 Live Restore Enabled: false
cc @koute in case you're not subscribed.
I think sandboxing is pretty much non-negotiable so we have to go through with it regardless of any portability concerns. (And if we want security we are pretty much forced to use OS-specific APIs to do it.)
For macOS (and any other OS) for now I'd suggest we just don't sandbox in there and refuse to start a validator unless a mandatory e.g. --running-validator-on-this-os-is-unsecure
(or something along those lines) command line argument is given. Virtually no one is currently running validators on macOS anyway, so our priority should be to make this work well on Linux which the majority uses and then we can expand the support to other OSes if we want. (Or as you'd say - just have a blessed Docker setup for macOS.)
Is there any data on OS distribution for validators? It's something I've been curious about for a while. cc @wpank. It would definitely be nice to drop our portability concerns as it simplifies things.
I agree that sandboxing is non-negotiable!
What portability concerns?
What portability concerns?
I just mean things we've been wanting to do that do not have good cross-platform support/alternatives. cgroups
as mentioned above is one example. So far no such features have been an absolute "must-have", until sandboxing -- so we haven't yet had to make a clear decision about how to handle non-Linux systems.
As we need this now, I'll begin work on it.
so we haven't yet had to make a clear decision about how to handle non-Linux systems.
I think the "solution" @koute proposed above sounds like a good enough solution for now? We don't need to support Mac or Linux kernels from 2016 or whatever.
I’ve been researching seccomp
and playing around with a PoC, and I have some concerns. The most sensible strategy is blocking by default, and explicitly specifying allowed syscalls. But, there are many different versions of syscalls and new ones are added all the time. This is a problem because when workers start running syscalls we didn’t account for, we will get disputes. At a minimum, we would need to restrict the arch and c std lib validators can run on, to a handful that we have approved and tested...
And maybe I’m overcomplicating this, but I also realized the difficulty of testing every possible path through execution. E.g. if there is a panic during execution, it may trigger some unexpected syscalls. Or if a signal happens and some signal handler runs some unexpected code (and we have to account for the sigreturn
syscall). Or something else I haven't thought of. I guess in these exceptional cases we would kill the process due to an unexpected syscall and it would be fine, because we retry once before disputing. But all these assumptions feel so fragile...
The most sensible strategy is blocking by default, and explicitly specifying allowed syscalls.
Yes.
But, there are many different versions of syscalls and new ones are added all the time. This is a problem because when workers start running syscalls we didn’t account for, we will get disputes. At a minimum, we would need to restrict the arch and c std lib validators can run on, to a handful that we have approved and tested...
Yes, different versions of libc could use different syscalls, and that is a problem.
We could just compile the PVF worker to use musl to make sure it's completely independent of the host's libc. A little annoying, but should be doable.
Another option would be to bundle a minimum known set of libraries that are necessary to run the PVF worker (our own mini docker, I suppose). We want to namespace/cgroup the process anyway, so that wouldn't be too big of a stretch.
There's also the issue of the libc picking different syscalls depending on the kernel that the host's running. But we could probably easily simulate that by just mocking the kernel version to an older one and then running the thing and see if it works.
Another (crazier alternative) would be to maybe compile the PVF worker as a unikernel (e.g. with something like this) and run it under KVM. This would also give us easy portability to other OSes (we could relatively easily make it also work on macOS)
At a minimum, we would need to restrict the arch
That was originally the plan, wasn't it? For now only amd64's going to be supported.
I general I think the set of syscalls that the worker actually needs should be quite limited, which is why I think this form of sandboxing should be actually practical for us to implement.
We could just compile the PVF worker to use musl to make sure it's completely independent of the host's libc. A little annoying, but should be doable.
Another option would be to bundle a minimum known set of libraries that are necessary to run the PVF worker (our own mini docker, I suppose). We want to namespace/cgroup the process anyway, so that wouldn't be too big of a stretch.
Why not use existing proven technology (e.g. Docker) at that point? Either way, this would be very nice for determinism, too.
That was originally the plan, wasn't it? For now only amd64's going to be supported.
Oh, I didn't know that. Has this been discussed before?
I general I think the set of syscalls that the worker actually needs should be quite limited, which is why I think this form of sandboxing should be actually practical for us to implement.
Yes, I might be overthinking it. If something unexpected happens (e.g. a signal handler runs some unaccounted syscall), killing the process is probably what we want to do anyway.
I would be for gradually improving security here. E.g. start with launching the worker threads with a different user in a chroot environment. Assuming that different user does not have access to keys and such, then we already have 3 security barriers in place, an attacker has to break through, to get to keys or otherwise manipulate the machine:
It would also be good to identify some threat model. What are we actually worried about? Things that come to mind:
In addition, limiting the attack surface is generally a good idea obviously. Any syscall that is exposed could have some bug leading to privilege escalation. At the same time, if we are too restrictive we risk DoSing parachains and causing consensus issues on upgrades. Therefore ramping the security up there slowly might make sense. E.g. in addition to the above start by only blocking syscalls that are clearly not needed (e.g. access to network), then once we have more tooling/tests or whatever in place to be able to block more syscalls with confidence, we do so.
I agree that gradually amping up the security is a good idea. I started with seccomp
per @koute's suggestion in https://github.com/paritytech/polkadot-sdk/issues/828, but seems like that still requires more research/effort to not break determinism, i.e. we have to control the c lib and arch. I'm not concerned about dependency upgrades (wasmtime) because new syscalls would be caught in testing, it would just slow down our ability to upgrade wasmtime in case of critical CVEs.
E.g. start with launching the worker threads with a different user in a chroot environment.
In my research I found that chroot
is not an effective security mechanism, and easy to break out of. I do think that running unshare
with CLONE_NEWUSER
is a good start, though that requires the calling process to not be threaded, so we would have to spawn an intermediate process first, as seen here.
Also, is it safe to assume that validators are running as root?
- ?
One concern I have is that an attacker can target certain validators and make them vote invalid and get them slashed.
Or, if they can get some source of randomness they can do an untargeted attack, and this wouldn't require much of an escalation at all. I.e. a baddie can do significant economic damage by voting against with 1/3 chance, without even stealing keys or completely replacing the binary.
One concern I have is that an attacker can target certain validators and make them vote invalid and get them slashed.
by e.g. killing its own worker?
by e.g. killing its own worker?
Yeah, or also by causing an error (right now all execution errors lead to invalid). If it always did this, then all validators would vote the same way, but if it could cause an invalid vote only sometimes, either target or untargeted way, then it would trigger some chaos.
So I agree that we can iteratively bump up security (splitting up the changes is good anyway), but eventually we need it to be pretty airtight. Luckily preparation/execution should require very little capabilities: opening/reading a file, memory operations. (May not be the full list, so far I ran into a bug and enabled seccomp on the wrong thread.) (Would be nice to obviate the first one and read the file ahead of time, passing in the contents to the worker. Not sure if that is supported, but it would also make this concern moot.)
BTW I just found this issue about specializing on Linux -- linking it here since I didn't know about it. Let's share more thoughts on the topic there.
There's also the issue of the libc picking different syscalls depending on the kernel that the host's running. But we could probably easily simulate that by just mocking the kernel version to an older one and then running the thing and see if it works.
@koute Can you elaborate on this?
There's also the issue of the libc picking different syscalls depending on the kernel that the host's running. But we could probably easily simulate that by just mocking the kernel version to an older one and then running the thing and see if it works.
@koute Can you elaborate on this?
Calling some of libc
functions may end up calling different syscalls depending on the kernel the host's running. For example, there is the clone
syscall and Linux 5.3+ only clone3
, so glibc will try to use clone3
and if that fails will fall back to clone
. Although now that I think about it, IIRC for clone3
/clone
at least glibc looks at whether clone3
returns ENOSYS and then tries clone
instead of explicitly checking the kernel version.
Nevertheless, there are two possible issues here:
1) Different libc versions can call different "versions" of a particular syscall for the same libc function. 2) Different libc versions can call a syscall or can instead emulate in userspace the functionality of a particular libc function.
This I think can be mostly solved by:
1) hooking into the uname
syscall through seccomp and get it to always return the same values for the kernel version (so the "calls different syscall based on the kernel version" problem is solved),
2) by having it return ENOSYS for blocked syscalls and not abort the process when an unknown syscall is called (so the "opportunistically tries to call newer versions of syscalls when available" problem is solved, as it'll fall back to the older one)
So actually we probably don't even need to test on different kernel versions; maybe except just run on the oldest supported kernel version to make sure all of the used syscalls are available there and refuse to run if an even older kernel is encountered.
opening/reading a file
This is something that could also be done by the parent process and then sending/receiving the artifact as we do the other communication.
Thanks for the proposal @koute!
This I think can be mostly solved by:
- hooking into the uname syscall through seccomp and get it to always return the same values for the kernel version (so the "calls different syscall based on the kernel version" problem is solved),
- by having it return ENOSYS for blocked syscalls and not abort the process when an unknown syscall is called (so the "opportunistically tries to call newer versions of syscalls when available" problem is solved, as it'll fall back to the older one)
I don't think that (1) is possible with seccomp
, though please correct me if I'm wrong. The closest thing is SECCOMP_RET_ERRNO
which doesn't quite work (we could use it for (2) though):
This value results in the SECCOMP_RET_DATA portion of the
filter's return value being passed to user space as the
errno value without executing the system call.
Without (1), does your proposal still work?
Although now that I think about it, IIRC for clone3/clone at least glibc looks at whether clone3 returns ENOSYS and then tries clone instead of explicitly checking the kernel version.
Looks like musl does this too, but it also has a lot of ifdef
s e.g. #ifdef SYS_select
. So I am guessing it will also try different syscalls based on the architecture it's compiled on, which complicates things unless we restrict the arch.
So actually we probably don't even need to test on different kernel versions; maybe except just run on the oldest supported kernel version to make sure all of the used syscalls are available there and refuse to run if an even older kernel is encountered.
I would really want to test this on different kernel versions as any oversights here can lead to disputes/slashing.
Also, can we assume an oldest supported kernel version, or that amd64 is the only arch? There are some values in the Validators guide under "Reference Hardware," but it explicitly states that this is "not a hard requirement". And while I see now that we only provide an amd64 release binary, I don't see a warning anywhere for validators not to build the source and run on e.g. arm, unless I missed it. That said, I think we do need to restrict "secure mode" (seccomp mode) to amd64 and arm64 because seccompiler
only supports those
musl
as @koute proposed above.This would ensure that in legitimate execution we never call an unexpected syscall unless we upgrade musl.
(2) requires coordination with the build team. Not clear how big of an effort it is, but non-trivial for sure.
For (5), I believe that for unrecognized syscalls we should kill the process as opposed to returning ENOSYS
. With the latter strategy, if an unknown syscall is called (e.g. by an attacker) the thread would keep running and eventually (probably?) return an error, but the error could really be anything. This is bad because we would like to treat certain errors like file-not-found as internal (do not vote against the candidate). I'm not sure what would happen in reality if we tried to read a file and ENOSYS
was returned, but that's the point -- we need a deterministic failure mode.
I don't think that (1) is possible with
seccomp
, though please correct me if I'm wrong.
AFAIK it is possible. You use SECCOMP_RET_TRACE
and then have the parent process modify the syscall's return value with ptrace
.
So I am guessing it will also try different syscalls based on the architecture it's compiled on, which complicates things unless we restrict the arch.
Yes, this must be amd64-only for now. Each architecture has slightly different syscalls (e.g. amd64 uses the mmap
syscall which takes the offset in bytes and arm uses the mmap2
syscall which takes the offset in pages) and the same syscalls can have different syscall numbers (but this can be easily mitigated by just grabbing the syscall numbers from the libc
crate).
As I've previously said, should not be a big deal since essentially everyone who's running a validator runs on amd64. Later on we can expand it to aarch64 once this works properly on amd64.
1. Trace and determine what libc calls are made by the PVF jobs. (by executing real PVFs)
Yes. It'd be good to have e.g. some debugging flag or something for polkadot
which would run the PVF worker under strace
and dump those logs somewhere.
2. We build against
musl
as @koute proposed above.
This does have some logistical issues though. Do we build everything under musl and just ship one polkadot
binary? We'd have to disallow running a validator with non-musl binaries, which from the users' POV might be a pain (some people compile their own binaries). Alternatively we could build a dedicated PVF worker executable and embed that into polkadot
, and then when we want to use it we write it to disk and execute it.
My vote would probably be to build a separate binary, e.g. do something similar to what we do for WASM runtimes and just build the binary in build.rs
. This will work automatically for the users manually compiling polkadot
(as long as they have the musl target installed, but we can just print an error like we do for WASM if it's missing), doesn't need much extra work from the CI people (just needs the target to be rustup-installed in the CI docker image), and in theory is also better for security as a smaller binary will give less opportunities for attackers to chain together ROP gadgets if they get remote code execution somehow. (Although with good enough sandboxing it probably won't matter, but still.)
I'm not sure what would happen in reality if we tried to read a file and
ENOSYS
was returned, but that's the point -- we need a deterministic failure mode.
Most likely the call would return an error, and would be passed along to the worker. (so e.g. if there's an unwrap
there then it'd panic) So this shouldn't really be any less deterministic (as long as the libc used is the same).
Thanks! Indeed, looks like that would work if we hooked up ptrace, here's a amd64-specific way to do it. That would be a nice solution. We would have a small surface area of syscalls, making it more secure and also more predictable and deterministic (because we know that only one of e.g. clone
and clone3
would ever be used). I'm also good with restricting the arch and keeping the initial scope low, since other validators can just run in insecure mode for now.
My only reservation is that I don't feel confident about relying on ENOSYS
from unsupported syscalls to trigger a panic. I'm not convinced it would panic in the file-not-found case, and there also may be other error states in the future that we would want to treat as internal i.e. not vote against. (Also, we currently treat panics as internal errors, though I have an issue open to change that.)
My approach (casting a wider net of syscalls and killing the process when seccomp is triggered) still seems safer to me. The main issue with it is just that we would need to revisit the whitelist whenever we upgrade musl, so we would to have some process around that, but I expect it to be done rarely. This way we also don't need the "hack" with uname and ptrace. Let me know what you think.
My vote would probably be to build a separate binary, e.g. do something similar to what we do for WASM runtimes and just build the binary in build.rs.
I agree that a separate binary for the workers makes sense. It would indeed be better for security and executing with a known libc would be a win for determinism. Writing the binary to disk on execution would avoid the UX problems that are being discussed on the forum.
Yes. It'd be good to have e.g. some debugging flag or something for polkadot which would run the PVF worker under strace and dump those logs somewhere.
Definitely, thanks for the tip!
This way we also don't need the "hack" with uname and ptrace. Let me know what you think.
In this case we'll need to audit musl's sources and check whether it checks for the kernel version anywhere and handle that appropriately. (Although it'd be safer to pin the kernel version anyway just in case; one less source of non-determinism.)
In this case we'll need to audit musl's sources and check whether it checks for the kernel version anywhere and handle that appropriately.
Is uname
the only way to get the kernel version? Doing a quick ripgrep for it in the source doesn't come up with anything.
(Although it'd be safer to pin the kernel version anyway just in case; one less source of non-determinism.)
Agreed. Is it possible with seccomp when we also have a kill filter applied? From man:
Note that a tracer process will not be notified if another
filter returns an action value with a precedence greater
than SECCOMP_RET_TRACE.
SECCOMP_RET_KILL_PROCESS
has the highest precedence. But then I read this:
If multiple filters exist, they are all executed, in reverse
order of their addition to the filter tree—that is, the most
recently installed filter is executed first.
Does this mean that if we install the kill filter first and the trace filter last, that the latter will run and notify the tracer, before the kill filter is hit? I was wondering if you knew, because the man page is confusing and Google is failing me -- otherwise I'll write up a PoC to test this.
(Although it'd be safer to pin the kernel version anyway just in case; one less source of non-determinism.)
IMO it starts getting ridiculous. We should not force any kernel/glibc/OS version.
(Although it'd be safer to pin the kernel version anyway just in case; one less source of non-determinism.)
IMO it starts getting ridiculous. We should not force any kernel/glibc/OS version.
@bkchr You've misunderstood; it's the other way around. (: Sorry, I should have been more clear.
By "pin the kernel version" I meant to just pin the version string which the sandboxed PVF worker sees (basically always pretend we're running on an older kernel), precisely so that the exact host kernel doesn't matter (as long as it's new enough). Basically to make it impossible for the PVF worker to do something like this:
if kernel_version > 5.13 {
use_new_syscall();
} else {
use_old_syscall();
}
If it'll always see the minimum supported kernel version then it won't be able to do this. And since we want to sandbox which syscalls it can do we must ensure that it'll always use the same ones. (But as I said, it might be unnecessary if the libc doesn't explicitly check the version anywhere, although it could be nice to have just in case if it'd suddenly start doing so and we wouldn't notice.)
We do not want to force a specific kernel/glibc/OS version on the host (besides a reasonable requirement that it's not ancient) as that'd be a horrible user experience. Everything should work out-of-box and not require any fiddling.
I believe there is no need to do this right now anyway. Currently uname
is never called either in preparation or in execution or in the musl source -- if this ever changes in the future, that would be a new syscall that should be caught in a musl upgrade. It doesn't make sense to add this complication (an attached tracer program and extra syscall filter) until we need it, if we ever do.
(And thanks for ignoring my previous post; I realized in the middle of the night that it makes no sense. The uname
filter could be applied independently of the default kill-the-process filter.)
So it seems we have a solid plan now. The main challenge is the musl binary. Later today I need to figure out who I can with with on the build team to make that a reality.
And thanks for ignoring my previous post;
Sorry, that was not deliberate; I just didn't have the time to fully reply yet. I was meaning to reply in more detail today. (:
Hah, I thought you were being nice because my post makes no sense. We don't want two filters running for uname
. 🙂 But I'd definitely be interested in learning more details about precedences!
By "pin the kernel version" I meant to just pin the version string which the sandboxed PVF worker sees (basically always pretend we're running on an older kernel), precisely so that the exact host kernel doesn't matter (as long as it's new enough). Basically to make it impossible for the PVF worker to do something like this:
Ahh, thank you for the explanation @koute!
So it seems we have a solid plan now. The main challenge is the musl binary. Later today I need to figure out who I can with with on the build team to make that a reality.
What is the solid plan? Whatever you do here should be doable by every body and not just 1 person in Parity. Everybody will need to be able to build the binary in the required way, so best to write a proper guide instead of trying to teach it one person. In general I would favor @koute approach of doing the build in build.rs
(We could strip it down to just build the validation worker, currently we use the entire polkadot binary which is not needed for the worker and then compilation time would also be less.). However, this should be hidden the profile=production
.
As I understand it, the current plan is a slight update of the one here:
build.rs
against musl.Corrections welcome. I would need help with (3) and (8).
3. Build the worker binary in
build.rs
against musl.I would need help with (3)
This should be relatively easy. We should first start with some refactoring to have a minimal bin
crate which only compiles in the PVF worker and nothing else. (This should also help with compile times as we won't have to compile the whole of polkadot
twice.)
8. Establish a process around musl upgrades to update needed syscalls (should be done rarely).
We are not in control what musl version people will use. So, we need to somehow find out the musl version. In a perfect world we would have a set of PoVs that we could execute to trigger all of the sys calls, to record them. However, this is probably also not possible to be sure that we hit all sys calls in this testing phase.
Some kind of static analyzing for sys calls would be nice :thinking:
We are not in control what musl version people will use.
In a way yes, but, considering musl is shipped with rustc
/rustup
's x86_64-unknown-linux-musl
target it's not completely up to the users. So we could just test on a range of supported rustc
versions plus whatever's on nightly.
Some kind of static analyzing for sys calls would be nice thinking
Hmmm.... actually, this is an interesting idea. With good enough LTO maybe this could be doable by just disassembling the binary with objdump
and grepping for the syscalls? @mrcnski Can you ping me once we have a binary with just the PVF worker? I could try this out then and see if it's feasible.
Hmmm.... actually, this is an interesting idea. With good enough LTO maybe this could be doable by just disassembling the binary with objdump and grepping for the syscalls? @mrcnski Can you ping me once we have a binary with just the PVF worker? I could try this out then and see if it's feasible.
Sure! I am quite interested to see what you find. I'm curious how you will separate out the noise (we are only interested in syscalls from the validate_using_artifact
and prepare_artifact
threads).
@koute I have a WIP branch here separating out the binaries. You should already be able to build in the node/core/pvf
module. You should get prepare_worker
and execute_worker
binaries.
Next steps are embedding the binaries in polkadot
and figuring out how to build with musl.
I realized that ltrace
won't work, since it can't trace a single thread, at least not without adding some non-trivial machinery. So I'm curious if something comes of the static analysis approach. Otherwise, we should whitelist all related syscalls (e.g. if we observe mmap
then we also let through mmap2
, mmap3
, mmap3.14
or whatever).
Alternatively we could build a dedicated PVF worker executable and embed that into polkadot, and then when we want to use it we write it to disk and execute it.
My vote would probably be to build a separate binary, e.g. do something similar to what we do for WASM runtimes and just build the binary in build.rs.
@koute Can you point me to where we do this?
Apparently someone has had an issue with musl conflicting with jemalloc: https://kobzol.github.io/rust/ci/2021/05/07/building-rust-binaries-in-ci-that-work-with-older-glibc.html Not sure if this is still the case.
- Implement a flag that enables tracing of libc calls in PVF jobs.
- Determine what libc calls are made by PVF jobs. (by executing real PVFs)
That way, you don't follow all the possible execution paths, I believe fuzz testing is needed to be (relatively) sure.
- Determine every syscall that each libc could make call we observed (just by looking through the musl source for each function).
Why do you think only libc functions are doing syscalls (or did I get it wrong)?
With good enough LTO maybe this could be doable by just disassembling the binary with
objdump
and grepping for the syscalls?
Also was the first thing I thought of, something like objdump -d execute-worker |grep -B1 syscall
would do the trick, but what about WASM JIT-compiled code itself? Does it use any syscalls? Does the Wasmtime runtime code?
That way, you don't follow all the possible execution paths, I believe fuzz testing is needed to be (relatively) sure.
Fuzzing is on my todo list, the issue is that we would need to do it on every kernel version. That's why I also wanted to get a list of libc functions and analyze those.
Also was the first thing I thought of, something like
objdump -d execute-worker |grep -B1
syscall would do the trick
We want to get the syscalls made only by a single thread, and not by the whole worker. Otherwise we get file operations, socket operations, and more that we don't want to allow in the whitelist. I tried that command anyway but couldn't tell what syscalls were being made; I see output like:
7d288a: 31 c0 xor %eax,%eax
7d288c: ff 15 3e 40 49 00 call *0x49403e(%rip) # c668d0 <syscall@GLIBC_2.2.5>
I ran the logging on linux-gnu-x86-64
with zombienet and undying-collator
.
9 - mmap
11 - munmap
13 - rt_sigaction
14 - rt_sigprocmask
28 - madvise
60 - exit
98 - getrusage
131 - sigaltstack
11 - munmap
14 - rt_sigprocmask
24 - sched_yield
28 - madvise
60 - exit
131 - sigaltstack
157 - prctl
202 - futex
228 - clock_gettime
Note: I looked into moving the fs access out of the thread, but it turns out we do a direct memory mapping (mmap
, madvise
and munmap
). So I think we should whitelist these syscalls to avoid a performance regression.
Alternatively we could build a dedicated PVF worker executable and embed that into polkadot, and then when we want to use it we write it to disk and execute it. My vote would probably be to build a separate binary, e.g. do something similar to what we do for WASM runtimes and just build the binary in build.rs.
@koute Can you point me to where we do this?
For WASM the wasm-builder
crate takes care of building the WASM blob which is then called in a build.rs
like this and then the blob's included in the native code like this.
Why do you think only libc functions are doing syscalls (or did I get it wrong)?
In general that's going to be the case; the majority of code (> 99%) is not going to be making syscalls directly. (There are some exceptions, e.g. rustix
crate makes them directly by default, but we want to disable that anyway since it makes profiling with Bytehound harder.)
but what about WASM JIT-compiled code itself? Does it use any syscalls? Does the Wasmtime runtime code?
AFAIK the JITted code doesn't do any syscalls directly; it always has to go through a host function.
- Establish a process around musl upgrades to update needed syscalls (should be done rarely).
We are not in control what musl version people will use. So, we need to somehow find out the musl version. In a perfect world we would have a set of PoVs that we could execute to trigger all of the sys calls, to record them. However, this is probably also not possible to be sure that we hit all sys calls in this testing phase.
Maybe we should be always recording the syscalls and musl versions people use. Before actually enabling the filtering we should already have a complete list of syscalls from validators running on mainnet and testnets for weeks with real PoVs. I believe it is feasible to log and report via Telemetry for example.
Agreed @sandreim! But I hit a snag with logging, quoting from here:
I don't think we can change where it logs. ☹️ I also don't know if the log location is always the same on each machine. On my GCP instance it's
kern.log
but the man page says it goes in the audit log, which I thought wasaudit.log
.
I've done a bit of Googling but still don't know where seccomp
is "supposed" to log. I'm afraid it may depend on the machines, configurations, kernels, distros, etc. Not a Linux user myself so I don't know where to find the answer.
So I tried to experiment with statically checking which syscalls a given binary can potentially use, and I can report that this was a success.
Here's the list for prepare_worker
from @mrcnski's branch compiled for the --release
profile:
$ ./list-syscalls.rb prepare_worker
Running objdump...
Parsing objdump output...
Found 77 functions doing direct syscalls
Found 32 functions doing indirect syscalls
Functions per syscall:
0 (read) [2 functions]
_rjem_je_pages_boot
read
1 (write) [3 functions]
_rjem_je_malloc_write
_rjem_je_wrtmessage
write
2 (open) [5 functions]
__init_libc
__map_file
_rjem_je_pages_boot
open
realpath
3 (close) [6 functions]
__map_file
__stdio_close
_rjem_je_pages_boot
close
fcntl
realpath
4 (stat) [1 functions]
fstatat
5 (fstat) [2 functions]
__map_file
fstatat
7 (poll) [2 functions]
__init_libc
poll
8 (lseek) [1 functions]
lseek
9 (mmap) [2 functions]
__init_tls
mmap
10 (mprotect) [1 functions]
mprotect
11 (munmap) [2 functions]
__unmapself
munmap
12 (brk) [2 functions]
__expand_heap
sbrk
13 (rt_sigaction) [2 functions]
__libc_sigaction
abort
14 (rt_sigprocmask) [8 functions]
__block_all_sigs
__block_app_sigs
__libc_sigaction
__pthread_create
__restore_sigs
abort
pthread_sigmask
start
15 (rt_sigreturn) [1 functions]
__restore_rt
20 (writev) [2 functions]
__stdio_write
writev
24 (sched_yield) [1 functions]
sched_yield
25 (mremap) [1 functions]
mremap
28 (madvise) [2 functions]
madvise
posix_madvise
39 (getpid) [1 functions]
getpid
41 (socket) [1 functions]
socket
42 (connect) [1 functions]
connect
45 (recvfrom) [1 functions]
recvfrom
53 (socketpair) [1 functions]
socketpair
55 (getsockopt) [1 functions]
getsockopt
56 (clone) [1 functions]
__clone
60 (exit) [5 functions]
_Exit
__clone
__unmapself
pthread_exit
start
61 (wait4) [1 functions]
waitpid
62 (kill) [1 functions]
kill
72 (fcntl) [5 functions]
fcntl
fstatat
open
socket
socketpair
79 (getcwd) [1 functions]
getcwd
87 (unlink) [1 functions]
unlink
89 (readlink) [1 functions]
readlink
96 (gettimeofday) [1 functions]
clock_gettime
97 (getrlimit) [1 functions]
getrlimit
98 (getrusage) [1 functions]
getrusage
99 (sysinfo) [1 functions]
sysinfo
110 (getppid) [1 functions]
getppid
131 (sigaltstack) [1 functions]
sigaltstack
144 (sched_setscheduler) [1 functions]
__pthread_create
157 (prctl) [1 functions]
prctl
158 (arch_prctl) [1 functions]
__set_thread_area
200 (tkill) [2 functions]
abort
raise
202 (futex) [39 functions]
_ZN11parking_lot10raw_rwlock9RawRwLock21unlock_exclusive_slow17h61503e7277b6f0a5E
_ZN11parking_lot10raw_rwlock9RawRwLock21unlock_exclusive_slow17ha0c630b95dc468edE
_ZN11parking_lot7condvar7Condvar19wait_until_internal17hcb9d2fca1f4f4b38E
_ZN11parking_lot9raw_mutex8RawMutex9lock_slow17heea83734fedad8e1E
_ZN11static_init12phase_locker4sync13transfer_lock17ha02b3c6e9250d56aE
_ZN16parking_lot_core9word_lock8WordLock9lock_slow17hae697b4476ee60abE
_ZN16parking_lot_core9word_lock8WordLock9lock_slow17hdb213fea64923d87E
_ZN3std10sys_common4once5futex4Once4call17h2f44d6b7e1729488E
_ZN3std10sys_common4once5futex4Once4call17h9dc68d51eb09f1fbE
_ZN3std10sys_common4once5futex4Once4call17hc037e91effa7fa2fE
_ZN3std10sys_common4once5futex4Once4call17he0bbb62abf4de447E
_ZN3std3sys4unix5futex10futex_wait17h9944e141004324baE
_ZN3std3sys4unix5locks11futex_mutex5Mutex14lock_contended17h89d5f0a549c030aeE
_ZN3std3sys4unix5locks12futex_rwlock6RwLock14read_contended17he239402dfc02afd4E
_ZN3std3sys4unix5locks12futex_rwlock6RwLock15write_contended17hb26a22f98d4f4ed1E
_ZN3std3sys4unix5locks13futex_condvar7Condvar4wait17he8dabc36a2ae18a5E
_ZN3std6thread4park17hbeef6ed68ff7e8d6E
__bin_chunk
__lock
__lockfile
__pthread_cond_timedwait
__pthread_create
__pthread_mutex_trylock_owner
__timedwait_cp
__tl_sync
__tl_unlock
__unlock
__unlockfile
__vm_unlock
__wait
alloc_fwd
alloc_rev
malloc
pthread_cond_signal
pthread_exit
pthread_mutex_timedlock
pthread_mutex_unlock
pthread_rwlock_unlock
unlock
203 (sched_setaffinity) [1 functions]
sched_setaffinity
204 (sched_getaffinity) [2 functions]
sched_getaffinity
sysconf
213 (epoll_create) [1 functions]
epoll_create1
218 (set_tid_address) [2 functions]
__init_tp
start
228 (clock_gettime) [1 functions]
clock_gettime
231 (exit_group) [1 functions]
_Exit
232 (epoll_wait) [1 functions]
epoll_pwait
233 (epoll_ctl) [1 functions]
epoll_ctl
262 (newfstatat) [1 functions]
fstatat
273 (set_robust_list) [2 functions]
__pthread_mutex_trylock_owner
pthread_exit
281 (epoll_pwait) [1 functions]
epoll_pwait
284 (eventfd) [1 functions]
eventfd
290 (eventfd2) [1 functions]
eventfd
291 (epoll_create1) [1 functions]
epoll_create1
302 (prlimit64) [1 functions]
getrlimit
309 (getcpu) [1 functions]
sched_getcpu
318 (getrandom) [1 functions]
_ZN3std3sys4unix4rand19hashmap_random_keys17h7507563321f8d22bE
Used syscalls:
0 (read)
1 (write)
2 (open)
3 (close)
4 (stat)
5 (fstat)
7 (poll)
8 (lseek)
9 (mmap)
10 (mprotect)
11 (munmap)
12 (brk)
13 (rt_sigaction)
14 (rt_sigprocmask)
15 (rt_sigreturn)
20 (writev)
24 (sched_yield)
25 (mremap)
28 (madvise)
39 (getpid)
41 (socket)
42 (connect)
45 (recvfrom)
53 (socketpair)
55 (getsockopt)
56 (clone)
60 (exit)
61 (wait4)
62 (kill)
72 (fcntl)
79 (getcwd)
87 (unlink)
89 (readlink)
96 (gettimeofday)
97 (getrlimit)
98 (getrusage)
99 (sysinfo)
110 (getppid)
131 (sigaltstack)
144 (sched_setscheduler)
157 (prctl)
158 (arch_prctl)
200 (tkill)
202 (futex)
203 (sched_setaffinity)
204 (sched_getaffinity)
213 (epoll_create)
218 (set_tid_address)
228 (clock_gettime)
231 (exit_group)
232 (epoll_wait)
233 (epoll_ctl)
262 (newfstatat)
273 (set_robust_list)
281 (epoll_pwait)
284 (eventfd)
290 (eventfd2)
291 (epoll_create1)
302 (prlimit64)
309 (getcpu)
318 (getrandom)
This list will probably get cut down when the binary's built with LTO and/or if we further cut the dependencies used.
It's also not entirely bulletproof, but as long as there's no code which deliberately tries to trigger syscalls in a sneaky way or doesn't do anything super strange it should be able essentially extract all of the syscalls used.
The script used to do this can be found here:
https://gist.github.com/koute/166f82bfee5e27324077891008fca6eb
(Written in Ruby because it's just a lot more productive language when experimenting than Python, but if we're going to hook it up to CI or something I can eventually rewrite it if necessary.)
A note regarding building the PVF worker to use musl: this unfortunately requires a musl-enabled C compiler (musl-gcc
) to be installed, which can be kind of a pain to get installed depending on the distro. That's not a very good user experience. But fortunately it looks like we can probably work around this by just faking a musl-compatible compiler by e.g. having a script like this:
musl-gcc
:
#!/bin/sh
gcc "$@"
Since this is not actually used to link any binaries it should most likely work just fine.
Nice, thank you!!!
This seems to be getting all the syscalls for the process and not just the thread that does the work, so there are syscalls that I would not want to let through (read
, write
, socket
, etc. Also getrandom
to address my concerns here.). So we should have a blacklist for the whitelist. 🙃 The rest should mostly be fine to allow. ~50 syscalls is still a much lower surface area for attacks than 400+.
This seems to be getting all the syscalls for the process and not just the thread that does the work
Could we perhaps further cut down on the amount of stuff this binary does?
I am not sure, it already doesn't do much. The socket operations are necessary, at least we need some kind of IPC. The work thread itself does very little and I already refactored it to do bare minimum work in a WIP branch.
I had an idea that instead of CI (or in addition to), we could include your script in the binary itself (ideally by RIIRing it). Since validators can build against their own version of musl it would be their responsibility to run a subcommand and have polkadot verify itself. I think validators already are recommended to do some testing before they run. It could be a prereq of secure mode.
Overview
As future work for the PVF validation host we envisioned that the workers should be hardened.
Ideally, if a worker process was compromised then the attacker won't get the rest of the system on a silver plate. To do that, we may want to consider:
Summary / Roadmap
https://hackmd.io/@gIXf9c2_SLijWKSGjgDJrg/BkNikRcf2
Potentially Useful Resources
https://www.redhat.com/sysadmin/container-security-seccomp