paritytech / polkadot-sdk

The Parity Polkadot Blockchain SDK
https://polkadot.network/
1.86k stars 680 forks source link

PVF host: sandbox/harden worker process #882

Closed pepyakin closed 11 months ago

pepyakin commented 2 years ago

Overview

As future work for the PVF validation host we envisioned that the workers should be hardened.

Ideally, if a worker process was compromised then the attacker won't get the rest of the system on a silver plate. To do that, we may want to consider:

  1. running workers under a different user
  2. jailing/unshare or clone(2) with fresh new namespace
  3. seccomp

Summary / Roadmap

https://hackmd.io/@gIXf9c2_SLijWKSGjgDJrg/BkNikRcf2

Potentially Useful Resources

https://www.redhat.com/sysadmin/container-security-seccomp

eskimor commented 1 year ago

Should be done before parathreads!

mrcnski commented 1 year ago

Unsurprisingly, all of the syscall options are Linux-only, and we again run into portability concerns. Looks like there is potentially some alternative for MacOS, though it's marked deprecated and would also increase our complexity/testing burden.

However, I'm wondering if running workers on Docker would be an acceptable solution. It would solve the portability issues we are repeatedly running into. The performance cost should (theoretically) be negligible, and we would get seccomp (and cgroups for https://github.com/paritytech/polkadot-sdk/issues/767). These syscalls should be supported on Docker on Macs, at least according to docker info on my M1 Mac.

Server:
 Containers: 5
  Running: 0
  Paused: 0
  Stopped: 5
 Images: 1
 Server Version: 20.10.21
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: runc io.containerd.runc.v2 io.containerd.runtime.v1.linux
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 1c90a442489720eec95342e1789ee8a5e1b9536f
 runc version: v1.1.4-0-g5fd4c4d
 init version: de40ad0
 Security Options:
  seccomp
   Profile: default
  cgroupns
 Kernel Version: 5.15.49-linuxkit
 Operating System: Docker Desktop
 OSType: linux
 Architecture: aarch64
 CPUs: 5
 Total Memory: 7.667GiB
 Name: docker-desktop
 ID: SCCV:JEJX:743F:W33L:I3VH:NTKL:GKQF:EMIT:E5BR:Q5LG:HMOW:NNYS
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 HTTP Proxy: http.docker.internal:3128
 HTTPS Proxy: http.docker.internal:3128
 No Proxy: hubproxy.docker.internal
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  hubproxy.docker.internal:5000
  127.0.0.0/8
 Live Restore Enabled: false
mrcnski commented 1 year ago

cc @koute in case you're not subscribed.

koute commented 1 year ago

I think sandboxing is pretty much non-negotiable so we have to go through with it regardless of any portability concerns. (And if we want security we are pretty much forced to use OS-specific APIs to do it.)

For macOS (and any other OS) for now I'd suggest we just don't sandbox in there and refuse to start a validator unless a mandatory e.g. --running-validator-on-this-os-is-unsecure (or something along those lines) command line argument is given. Virtually no one is currently running validators on macOS anyway, so our priority should be to make this work well on Linux which the majority uses and then we can expand the support to other OSes if we want. (Or as you'd say - just have a blessed Docker setup for macOS.)

mrcnski commented 1 year ago

Is there any data on OS distribution for validators? It's something I've been curious about for a while. cc @wpank. It would definitely be nice to drop our portability concerns as it simplifies things.

I agree that sandboxing is non-negotiable!

bkchr commented 1 year ago

What portability concerns?

mrcnski commented 1 year ago

What portability concerns?

I just mean things we've been wanting to do that do not have good cross-platform support/alternatives. cgroups as mentioned above is one example. So far no such features have been an absolute "must-have", until sandboxing -- so we haven't yet had to make a clear decision about how to handle non-Linux systems.

As we need this now, I'll begin work on it.

bkchr commented 1 year ago

so we haven't yet had to make a clear decision about how to handle non-Linux systems.

I think the "solution" @koute proposed above sounds like a good enough solution for now? We don't need to support Mac or Linux kernels from 2016 or whatever.

mrcnski commented 1 year ago

I’ve been researching seccomp and playing around with a PoC, and I have some concerns. The most sensible strategy is blocking by default, and explicitly specifying allowed syscalls. But, there are many different versions of syscalls and new ones are added all the time. This is a problem because when workers start running syscalls we didn’t account for, we will get disputes. At a minimum, we would need to restrict the arch and c std lib validators can run on, to a handful that we have approved and tested...

And maybe I’m overcomplicating this, but I also realized the difficulty of testing every possible path through execution. E.g. if there is a panic during execution, it may trigger some unexpected syscalls. Or if a signal happens and some signal handler runs some unexpected code (and we have to account for the sigreturn syscall). Or something else I haven't thought of. I guess in these exceptional cases we would kill the process due to an unexpected syscall and it would be fine, because we retry once before disputing. But all these assumptions feel so fragile...

koute commented 1 year ago

The most sensible strategy is blocking by default, and explicitly specifying allowed syscalls.

Yes.

But, there are many different versions of syscalls and new ones are added all the time. This is a problem because when workers start running syscalls we didn’t account for, we will get disputes. At a minimum, we would need to restrict the arch and c std lib validators can run on, to a handful that we have approved and tested...

Yes, different versions of libc could use different syscalls, and that is a problem.

We could just compile the PVF worker to use musl to make sure it's completely independent of the host's libc. A little annoying, but should be doable.

Another option would be to bundle a minimum known set of libraries that are necessary to run the PVF worker (our own mini docker, I suppose). We want to namespace/cgroup the process anyway, so that wouldn't be too big of a stretch.

There's also the issue of the libc picking different syscalls depending on the kernel that the host's running. But we could probably easily simulate that by just mocking the kernel version to an older one and then running the thing and see if it works.

Another (crazier alternative) would be to maybe compile the PVF worker as a unikernel (e.g. with something like this) and run it under KVM. This would also give us easy portability to other OSes (we could relatively easily make it also work on macOS)

At a minimum, we would need to restrict the arch

That was originally the plan, wasn't it? For now only amd64's going to be supported.


I general I think the set of syscalls that the worker actually needs should be quite limited, which is why I think this form of sandboxing should be actually practical for us to implement.

mrcnski commented 1 year ago

We could just compile the PVF worker to use musl to make sure it's completely independent of the host's libc. A little annoying, but should be doable.

Another option would be to bundle a minimum known set of libraries that are necessary to run the PVF worker (our own mini docker, I suppose). We want to namespace/cgroup the process anyway, so that wouldn't be too big of a stretch.

Why not use existing proven technology (e.g. Docker) at that point? Either way, this would be very nice for determinism, too.

That was originally the plan, wasn't it? For now only amd64's going to be supported.

Oh, I didn't know that. Has this been discussed before?

I general I think the set of syscalls that the worker actually needs should be quite limited, which is why I think this form of sandboxing should be actually practical for us to implement.

Yes, I might be overthinking it. If something unexpected happens (e.g. a signal handler runs some unaccounted syscall), killing the process is probably what we want to do anyway.

eskimor commented 1 year ago

I would be for gradually improving security here. E.g. start with launching the worker threads with a different user in a chroot environment. Assuming that different user does not have access to keys and such, then we already have 3 security barriers in place, an attacker has to break through, to get to keys or otherwise manipulate the machine:

  1. Break through the wasm sandboxing.
  2. Break through chroot.
  3. Privilege escalation.

It would also be good to identify some threat model. What are we actually worried about? Things that come to mind:

  1. Stealing keys - that would be pretty bad. Assuming memory protection is in place, this is already very hard to do given the above three. We could make it even harder, by not allowing network access via seccomp for example.
  2. Taking control over the validator. E.g. replacing the polkadot binary with a polkadot-evil binary. Should again not be possible with the above.
  3. Intercepting and manipulating packages - effect very similar to 2, hard to do without also being able to do 1 or 2.
  4. ?

In addition, limiting the attack surface is generally a good idea obviously. Any syscall that is exposed could have some bug leading to privilege escalation. At the same time, if we are too restrictive we risk DoSing parachains and causing consensus issues on upgrades. Therefore ramping the security up there slowly might make sense. E.g. in addition to the above start by only blocking syscalls that are clearly not needed (e.g. access to network), then once we have more tooling/tests or whatever in place to be able to block more syscalls with confidence, we do so.

mrcnski commented 1 year ago

I agree that gradually amping up the security is a good idea. I started with seccomp per @koute's suggestion in https://github.com/paritytech/polkadot-sdk/issues/828, but seems like that still requires more research/effort to not break determinism, i.e. we have to control the c lib and arch. I'm not concerned about dependency upgrades (wasmtime) because new syscalls would be caught in testing, it would just slow down our ability to upgrade wasmtime in case of critical CVEs.

E.g. start with launching the worker threads with a different user in a chroot environment.

In my research I found that chroot is not an effective security mechanism, and easy to break out of. I do think that running unshare with CLONE_NEWUSER is a good start, though that requires the calling process to not be threaded, so we would have to spawn an intermediate process first, as seen here.

Also, is it safe to assume that validators are running as root?

  1. ?

One concern I have is that an attacker can target certain validators and make them vote invalid and get them slashed.

Or, if they can get some source of randomness they can do an untargeted attack, and this wouldn't require much of an escalation at all. I.e. a baddie can do significant economic damage by voting against with 1/3 chance, without even stealing keys or completely replacing the binary.

eskimor commented 1 year ago

One concern I have is that an attacker can target certain validators and make them vote invalid and get them slashed.

by e.g. killing its own worker?

mrcnski commented 1 year ago

by e.g. killing its own worker?

Yeah, or also by causing an error (right now all execution errors lead to invalid). If it always did this, then all validators would vote the same way, but if it could cause an invalid vote only sometimes, either target or untargeted way, then it would trigger some chaos.

So I agree that we can iteratively bump up security (splitting up the changes is good anyway), but eventually we need it to be pretty airtight. Luckily preparation/execution should require very little capabilities: opening/reading a file, memory operations. (May not be the full list, so far I ran into a bug and enabled seccomp on the wrong thread.) (Would be nice to obviate the first one and read the file ahead of time, passing in the contents to the worker. Not sure if that is supported, but it would also make this concern moot.)

mrcnski commented 1 year ago

BTW I just found this issue about specializing on Linux -- linking it here since I didn't know about it. Let's share more thoughts on the topic there.

mrcnski commented 1 year ago

There's also the issue of the libc picking different syscalls depending on the kernel that the host's running. But we could probably easily simulate that by just mocking the kernel version to an older one and then running the thing and see if it works.

@koute Can you elaborate on this?

koute commented 1 year ago

There's also the issue of the libc picking different syscalls depending on the kernel that the host's running. But we could probably easily simulate that by just mocking the kernel version to an older one and then running the thing and see if it works.

@koute Can you elaborate on this?

Calling some of libc functions may end up calling different syscalls depending on the kernel the host's running. For example, there is the clone syscall and Linux 5.3+ only clone3, so glibc will try to use clone3 and if that fails will fall back to clone. Although now that I think about it, IIRC for clone3/clone at least glibc looks at whether clone3 returns ENOSYS and then tries clone instead of explicitly checking the kernel version.

Nevertheless, there are two possible issues here:

1) Different libc versions can call different "versions" of a particular syscall for the same libc function. 2) Different libc versions can call a syscall or can instead emulate in userspace the functionality of a particular libc function.

This I think can be mostly solved by:

1) hooking into the uname syscall through seccomp and get it to always return the same values for the kernel version (so the "calls different syscall based on the kernel version" problem is solved), 2) by having it return ENOSYS for blocked syscalls and not abort the process when an unknown syscall is called (so the "opportunistically tries to call newer versions of syscalls when available" problem is solved, as it'll fall back to the older one)

So actually we probably don't even need to test on different kernel versions; maybe except just run on the oldest supported kernel version to make sure all of the used syscalls are available there and refuse to run if an even older kernel is encountered.

bkchr commented 1 year ago

opening/reading a file

This is something that could also be done by the parent process and then sending/receiving the artifact as we do the other communication.

mrcnski commented 1 year ago

Thanks for the proposal @koute!

This I think can be mostly solved by:

  1. hooking into the uname syscall through seccomp and get it to always return the same values for the kernel version (so the "calls different syscall based on the kernel version" problem is solved),
  2. by having it return ENOSYS for blocked syscalls and not abort the process when an unknown syscall is called (so the "opportunistically tries to call newer versions of syscalls when available" problem is solved, as it'll fall back to the older one)

I don't think that (1) is possible with seccomp, though please correct me if I'm wrong. The closest thing is SECCOMP_RET_ERRNO which doesn't quite work (we could use it for (2) though):

This value results in the SECCOMP_RET_DATA portion of the
filter's return value being passed to user space as the
errno value without executing the system call.

Without (1), does your proposal still work?

Although now that I think about it, IIRC for clone3/clone at least glibc looks at whether clone3 returns ENOSYS and then tries clone instead of explicitly checking the kernel version.

Looks like musl does this too, but it also has a lot of ifdefs e.g. #ifdef SYS_select. So I am guessing it will also try different syscalls based on the architecture it's compiled on, which complicates things unless we restrict the arch.

So actually we probably don't even need to test on different kernel versions; maybe except just run on the oldest supported kernel version to make sure all of the used syscalls are available there and refuse to run if an even older kernel is encountered.

I would really want to test this on different kernel versions as any oversights here can lead to disputes/slashing.

Also, can we assume an oldest supported kernel version, or that amd64 is the only arch? There are some values in the Validators guide under "Reference Hardware," but it explicitly states that this is "not a hard requirement". And while I see now that we only provide an amd64 release binary, I don't see a warning anywhere for validators not to build the source and run on e.g. arm, unless I missed it. That said, I think we do need to restrict "secure mode" (seccomp mode) to amd64 and arm64 because seccompiler only supports those

Counter-proposal

  1. Trace and determine what libc calls are made by the PVF jobs. (by executing real PVFs)
  2. We build against musl as @koute proposed above.
  3. Determine every syscall that could be made by each libc call we observed (just by looking through the musl source for each function).
  4. Whitelist these syscalls.
  5. Set the seccomp filter to kill the process on all unrecognized syscalls.
  6. Implement "secure mode" flag and restrict it to amd64 and arm64.

This would ensure that in legitimate execution we never call an unexpected syscall unless we upgrade musl.

(2) requires coordination with the build team. Not clear how big of an effort it is, but non-trivial for sure.

For (5), I believe that for unrecognized syscalls we should kill the process as opposed to returning ENOSYS. With the latter strategy, if an unknown syscall is called (e.g. by an attacker) the thread would keep running and eventually (probably?) return an error, but the error could really be anything. This is bad because we would like to treat certain errors like file-not-found as internal (do not vote against the candidate). I'm not sure what would happen in reality if we tried to read a file and ENOSYS was returned, but that's the point -- we need a deterministic failure mode.

koute commented 1 year ago

I don't think that (1) is possible with seccomp, though please correct me if I'm wrong.

AFAIK it is possible. You use SECCOMP_RET_TRACE and then have the parent process modify the syscall's return value with ptrace.

So I am guessing it will also try different syscalls based on the architecture it's compiled on, which complicates things unless we restrict the arch.

Yes, this must be amd64-only for now. Each architecture has slightly different syscalls (e.g. amd64 uses the mmap syscall which takes the offset in bytes and arm uses the mmap2 syscall which takes the offset in pages) and the same syscalls can have different syscall numbers (but this can be easily mitigated by just grabbing the syscall numbers from the libc crate).

As I've previously said, should not be a big deal since essentially everyone who's running a validator runs on amd64. Later on we can expand it to aarch64 once this works properly on amd64.

1. Trace and determine what libc calls are made by the PVF jobs. (by executing real PVFs)

Yes. It'd be good to have e.g. some debugging flag or something for polkadot which would run the PVF worker under strace and dump those logs somewhere.

2. We build against musl as @koute proposed above.

This does have some logistical issues though. Do we build everything under musl and just ship one polkadot binary? We'd have to disallow running a validator with non-musl binaries, which from the users' POV might be a pain (some people compile their own binaries). Alternatively we could build a dedicated PVF worker executable and embed that into polkadot, and then when we want to use it we write it to disk and execute it.

My vote would probably be to build a separate binary, e.g. do something similar to what we do for WASM runtimes and just build the binary in build.rs. This will work automatically for the users manually compiling polkadot (as long as they have the musl target installed, but we can just print an error like we do for WASM if it's missing), doesn't need much extra work from the CI people (just needs the target to be rustup-installed in the CI docker image), and in theory is also better for security as a smaller binary will give less opportunities for attackers to chain together ROP gadgets if they get remote code execution somehow. (Although with good enough sandboxing it probably won't matter, but still.)

I'm not sure what would happen in reality if we tried to read a file and ENOSYS was returned, but that's the point -- we need a deterministic failure mode.

Most likely the call would return an error, and would be passed along to the worker. (so e.g. if there's an unwrap there then it'd panic) So this shouldn't really be any less deterministic (as long as the libc used is the same).

mrcnski commented 1 year ago

Thanks! Indeed, looks like that would work if we hooked up ptrace, here's a amd64-specific way to do it. That would be a nice solution. We would have a small surface area of syscalls, making it more secure and also more predictable and deterministic (because we know that only one of e.g. clone and clone3 would ever be used). I'm also good with restricting the arch and keeping the initial scope low, since other validators can just run in insecure mode for now.

My only reservation is that I don't feel confident about relying on ENOSYS from unsupported syscalls to trigger a panic. I'm not convinced it would panic in the file-not-found case, and there also may be other error states in the future that we would want to treat as internal i.e. not vote against. (Also, we currently treat panics as internal errors, though I have an issue open to change that.)

My approach (casting a wider net of syscalls and killing the process when seccomp is triggered) still seems safer to me. The main issue with it is just that we would need to revisit the whitelist whenever we upgrade musl, so we would to have some process around that, but I expect it to be done rarely. This way we also don't need the "hack" with uname and ptrace. Let me know what you think.

My vote would probably be to build a separate binary, e.g. do something similar to what we do for WASM runtimes and just build the binary in build.rs.

I agree that a separate binary for the workers makes sense. It would indeed be better for security and executing with a known libc would be a win for determinism. Writing the binary to disk on execution would avoid the UX problems that are being discussed on the forum.

Yes. It'd be good to have e.g. some debugging flag or something for polkadot which would run the PVF worker under strace and dump those logs somewhere.

Definitely, thanks for the tip!

koute commented 1 year ago

This way we also don't need the "hack" with uname and ptrace. Let me know what you think.

In this case we'll need to audit musl's sources and check whether it checks for the kernel version anywhere and handle that appropriately. (Although it'd be safer to pin the kernel version anyway just in case; one less source of non-determinism.)

mrcnski commented 1 year ago

In this case we'll need to audit musl's sources and check whether it checks for the kernel version anywhere and handle that appropriately.

Is uname the only way to get the kernel version? Doing a quick ripgrep for it in the source doesn't come up with anything.

(Although it'd be safer to pin the kernel version anyway just in case; one less source of non-determinism.)

Agreed. Is it possible with seccomp when we also have a kill filter applied? From man:

   Note that a tracer process will not be notified if another
   filter returns an action value with a precedence greater
   than SECCOMP_RET_TRACE.

SECCOMP_RET_KILL_PROCESS has the highest precedence. But then I read this:

   If multiple filters exist, they are all executed, in reverse
   order of their addition to the filter tree—that is, the most
   recently installed filter is executed first.

Does this mean that if we install the kill filter first and the trace filter last, that the latter will run and notify the tracer, before the kill filter is hit? I was wondering if you knew, because the man page is confusing and Google is failing me -- otherwise I'll write up a PoC to test this.

bkchr commented 1 year ago

(Although it'd be safer to pin the kernel version anyway just in case; one less source of non-determinism.)

IMO it starts getting ridiculous. We should not force any kernel/glibc/OS version.

koute commented 1 year ago

(Although it'd be safer to pin the kernel version anyway just in case; one less source of non-determinism.)

IMO it starts getting ridiculous. We should not force any kernel/glibc/OS version.

@bkchr You've misunderstood; it's the other way around. (: Sorry, I should have been more clear.

By "pin the kernel version" I meant to just pin the version string which the sandboxed PVF worker sees (basically always pretend we're running on an older kernel), precisely so that the exact host kernel doesn't matter (as long as it's new enough). Basically to make it impossible for the PVF worker to do something like this:

    if kernel_version > 5.13 {
        use_new_syscall();
    } else {
        use_old_syscall();
    }

If it'll always see the minimum supported kernel version then it won't be able to do this. And since we want to sandbox which syscalls it can do we must ensure that it'll always use the same ones. (But as I said, it might be unnecessary if the libc doesn't explicitly check the version anywhere, although it could be nice to have just in case if it'd suddenly start doing so and we wouldn't notice.)

We do not want to force a specific kernel/glibc/OS version on the host (besides a reasonable requirement that it's not ancient) as that'd be a horrible user experience. Everything should work out-of-box and not require any fiddling.

mrcnski commented 1 year ago

I believe there is no need to do this right now anyway. Currently uname is never called either in preparation or in execution or in the musl source -- if this ever changes in the future, that would be a new syscall that should be caught in a musl upgrade. It doesn't make sense to add this complication (an attached tracer program and extra syscall filter) until we need it, if we ever do.

(And thanks for ignoring my previous post; I realized in the middle of the night that it makes no sense. The uname filter could be applied independently of the default kill-the-process filter.)

So it seems we have a solid plan now. The main challenge is the musl binary. Later today I need to figure out who I can with with on the build team to make that a reality.

koute commented 1 year ago

And thanks for ignoring my previous post;

Sorry, that was not deliberate; I just didn't have the time to fully reply yet. I was meaning to reply in more detail today. (:

mrcnski commented 1 year ago

Hah, I thought you were being nice because my post makes no sense. We don't want two filters running for uname. 🙂 But I'd definitely be interested in learning more details about precedences!

bkchr commented 1 year ago

By "pin the kernel version" I meant to just pin the version string which the sandboxed PVF worker sees (basically always pretend we're running on an older kernel), precisely so that the exact host kernel doesn't matter (as long as it's new enough). Basically to make it impossible for the PVF worker to do something like this:

Ahh, thank you for the explanation @koute!

So it seems we have a solid plan now. The main challenge is the musl binary. Later today I need to figure out who I can with with on the build team to make that a reality.

What is the solid plan? Whatever you do here should be doable by every body and not just 1 person in Parity. Everybody will need to be able to build the binary in the required way, so best to write a proper guide instead of trying to teach it one person. In general I would favor @koute approach of doing the build in build.rs (We could strip it down to just build the validation worker, currently we use the entire polkadot binary which is not needed for the worker and then compilation time would also be less.). However, this should be hidden the profile=production.

mrcnski commented 1 year ago

As I understand it, the current plan is a slight update of the one here:

  1. Implement a flag that enables tracing of libc calls in PVF jobs.
  2. Determine what libc calls are made by PVF jobs. (by executing real PVFs)
  3. Build the worker binary in build.rs against musl.
  4. Determine every syscall that could be made by each libc call we observed (just by looking through the musl source for each function).
  5. Whitelist these syscalls.
  6. Set the seccomp filter to kill the process on all unrecognized syscalls.
  7. Implement the "secure mode" flag and restrict it to Linux amd64 for now.
  8. Establish a process around musl upgrades to update needed syscalls (should be done rarely).
  9. And yes, document all this.

Corrections welcome. I would need help with (3) and (8).

koute commented 1 year ago

3. Build the worker binary in build.rs against musl.

I would need help with (3)

This should be relatively easy. We should first start with some refactoring to have a minimal bin crate which only compiles in the PVF worker and nothing else. (This should also help with compile times as we won't have to compile the whole of polkadot twice.)

bkchr commented 1 year ago

8. Establish a process around musl upgrades to update needed syscalls (should be done rarely).

We are not in control what musl version people will use. So, we need to somehow find out the musl version. In a perfect world we would have a set of PoVs that we could execute to trigger all of the sys calls, to record them. However, this is probably also not possible to be sure that we hit all sys calls in this testing phase.

bkchr commented 1 year ago

Some kind of static analyzing for sys calls would be nice :thinking:

koute commented 1 year ago

We are not in control what musl version people will use.

In a way yes, but, considering musl is shipped with rustc/rustup's x86_64-unknown-linux-musl target it's not completely up to the users. So we could just test on a range of supported rustc versions plus whatever's on nightly.

Some kind of static analyzing for sys calls would be nice thinking

Hmmm.... actually, this is an interesting idea. With good enough LTO maybe this could be doable by just disassembling the binary with objdump and grepping for the syscalls? @mrcnski Can you ping me once we have a binary with just the PVF worker? I could try this out then and see if it's feasible.

mrcnski commented 1 year ago

Hmmm.... actually, this is an interesting idea. With good enough LTO maybe this could be doable by just disassembling the binary with objdump and grepping for the syscalls? @mrcnski Can you ping me once we have a binary with just the PVF worker? I could try this out then and see if it's feasible.

Sure! I am quite interested to see what you find. I'm curious how you will separate out the noise (we are only interested in syscalls from the validate_using_artifact and prepare_artifact threads).

mrcnski commented 1 year ago

@koute I have a WIP branch here separating out the binaries. You should already be able to build in the node/core/pvf module. You should get prepare_worker and execute_worker binaries.

Next steps are embedding the binaries in polkadot and figuring out how to build with musl.

mrcnski commented 1 year ago

I realized that ltrace won't work, since it can't trace a single thread, at least not without adding some non-trivial machinery. So I'm curious if something comes of the static analysis approach. Otherwise, we should whitelist all related syscalls (e.g. if we observe mmap then we also let through mmap2, mmap3, mmap3.14 or whatever).

Alternatively we could build a dedicated PVF worker executable and embed that into polkadot, and then when we want to use it we write it to disk and execute it.

My vote would probably be to build a separate binary, e.g. do something similar to what we do for WASM runtimes and just build the binary in build.rs.

@koute Can you point me to where we do this?

mrcnski commented 1 year ago

Apparently someone has had an issue with musl conflicting with jemalloc: https://kobzol.github.io/rust/ci/2021/05/07/building-rust-binaries-in-ci-that-work-with-older-glibc.html Not sure if this is still the case.

s0me0ne-unkn0wn commented 1 year ago
  • Implement a flag that enables tracing of libc calls in PVF jobs.
  • Determine what libc calls are made by PVF jobs. (by executing real PVFs)

That way, you don't follow all the possible execution paths, I believe fuzz testing is needed to be (relatively) sure.

  • Determine every syscall that each libc could make call we observed (just by looking through the musl source for each function).

Why do you think only libc functions are doing syscalls (or did I get it wrong)?

With good enough LTO maybe this could be doable by just disassembling the binary with objdump and grepping for the syscalls?

Also was the first thing I thought of, something like objdump -d execute-worker |grep -B1 syscall would do the trick, but what about WASM JIT-compiled code itself? Does it use any syscalls? Does the Wasmtime runtime code?

mrcnski commented 1 year ago

That way, you don't follow all the possible execution paths, I believe fuzz testing is needed to be (relatively) sure.

Fuzzing is on my todo list, the issue is that we would need to do it on every kernel version. That's why I also wanted to get a list of libc functions and analyze those.

mrcnski commented 1 year ago

Also was the first thing I thought of, something like objdump -d execute-worker |grep -B1 syscall would do the trick

We want to get the syscalls made only by a single thread, and not by the whole worker. Otherwise we get file operations, socket operations, and more that we don't want to allow in the whitelist. I tried that command anyway but couldn't tell what syscalls were being made; I see output like:

  7d288a:       31 c0                   xor    %eax,%eax
  7d288c:       ff 15 3e 40 49 00       call   *0x49403e(%rip)        # c668d0 <syscall@GLIBC_2.2.5>
mrcnski commented 1 year ago

I ran the logging on linux-gnu-x86-64 with zombienet and undying-collator.

Preparation syscalls

9  - mmap
11 - munmap
13 - rt_sigaction
14 - rt_sigprocmask
28 - madvise
60 - exit
98 - getrusage
131 - sigaltstack

Execution syscalls

11 - munmap
14 - rt_sigprocmask
24 - sched_yield
28 - madvise
60 - exit
131 - sigaltstack
157 - prctl
202 - futex
228 - clock_gettime

Note: I looked into moving the fs access out of the thread, but it turns out we do a direct memory mapping (mmap, madvise and munmap). So I think we should whitelist these syscalls to avoid a performance regression.

koute commented 1 year ago

Alternatively we could build a dedicated PVF worker executable and embed that into polkadot, and then when we want to use it we write it to disk and execute it. My vote would probably be to build a separate binary, e.g. do something similar to what we do for WASM runtimes and just build the binary in build.rs.

@koute Can you point me to where we do this?

For WASM the wasm-builder crate takes care of building the WASM blob which is then called in a build.rs like this and then the blob's included in the native code like this.

Why do you think only libc functions are doing syscalls (or did I get it wrong)?

In general that's going to be the case; the majority of code (> 99%) is not going to be making syscalls directly. (There are some exceptions, e.g. rustix crate makes them directly by default, but we want to disable that anyway since it makes profiling with Bytehound harder.)

but what about WASM JIT-compiled code itself? Does it use any syscalls? Does the Wasmtime runtime code?

AFAIK the JITted code doesn't do any syscalls directly; it always has to go through a host function.

sandreim commented 1 year ago
  1. Establish a process around musl upgrades to update needed syscalls (should be done rarely).

We are not in control what musl version people will use. So, we need to somehow find out the musl version. In a perfect world we would have a set of PoVs that we could execute to trigger all of the sys calls, to record them. However, this is probably also not possible to be sure that we hit all sys calls in this testing phase.

Maybe we should be always recording the syscalls and musl versions people use. Before actually enabling the filtering we should already have a complete list of syscalls from validators running on mainnet and testnets for weeks with real PoVs. I believe it is feasible to log and report via Telemetry for example.

mrcnski commented 1 year ago

Agreed @sandreim! But I hit a snag with logging, quoting from here:

I don't think we can change where it logs. ☹️ I also don't know if the log location is always the same on each machine. On my GCP instance it's kern.log but the man page says it goes in the audit log, which I thought was audit.log.

I've done a bit of Googling but still don't know where seccomp is "supposed" to log. I'm afraid it may depend on the machines, configurations, kernels, distros, etc. Not a Linux user myself so I don't know where to find the answer.

koute commented 1 year ago

So I tried to experiment with statically checking which syscalls a given binary can potentially use, and I can report that this was a success.

Here's the list for prepare_worker from @mrcnski's branch compiled for the --release profile:

$ ./list-syscalls.rb prepare_worker
Running objdump...
Parsing objdump output...
Found 77 functions doing direct syscalls
Found 32 functions doing indirect syscalls
Functions per syscall:
    0 (read) [2 functions]
        _rjem_je_pages_boot
        read
    1 (write) [3 functions]
        _rjem_je_malloc_write
        _rjem_je_wrtmessage
        write
    2 (open) [5 functions]
        __init_libc
        __map_file
        _rjem_je_pages_boot
        open
        realpath
    3 (close) [6 functions]
        __map_file
        __stdio_close
        _rjem_je_pages_boot
        close
        fcntl
        realpath
    4 (stat) [1 functions]
        fstatat
    5 (fstat) [2 functions]
        __map_file
        fstatat
    7 (poll) [2 functions]
        __init_libc
        poll
    8 (lseek) [1 functions]
        lseek
    9 (mmap) [2 functions]
        __init_tls
        mmap
    10 (mprotect) [1 functions]
        mprotect
    11 (munmap) [2 functions]
        __unmapself
        munmap
    12 (brk) [2 functions]
        __expand_heap
        sbrk
    13 (rt_sigaction) [2 functions]
        __libc_sigaction
        abort
    14 (rt_sigprocmask) [8 functions]
        __block_all_sigs
        __block_app_sigs
        __libc_sigaction
        __pthread_create
        __restore_sigs
        abort
        pthread_sigmask
        start
    15 (rt_sigreturn) [1 functions]
        __restore_rt
    20 (writev) [2 functions]
        __stdio_write
        writev
    24 (sched_yield) [1 functions]
        sched_yield
    25 (mremap) [1 functions]
        mremap
    28 (madvise) [2 functions]
        madvise
        posix_madvise
    39 (getpid) [1 functions]
        getpid
    41 (socket) [1 functions]
        socket
    42 (connect) [1 functions]
        connect
    45 (recvfrom) [1 functions]
        recvfrom
    53 (socketpair) [1 functions]
        socketpair
    55 (getsockopt) [1 functions]
        getsockopt
    56 (clone) [1 functions]
        __clone
    60 (exit) [5 functions]
        _Exit
        __clone
        __unmapself
        pthread_exit
        start
    61 (wait4) [1 functions]
        waitpid
    62 (kill) [1 functions]
        kill
    72 (fcntl) [5 functions]
        fcntl
        fstatat
        open
        socket
        socketpair
    79 (getcwd) [1 functions]
        getcwd
    87 (unlink) [1 functions]
        unlink
    89 (readlink) [1 functions]
        readlink
    96 (gettimeofday) [1 functions]
        clock_gettime
    97 (getrlimit) [1 functions]
        getrlimit
    98 (getrusage) [1 functions]
        getrusage
    99 (sysinfo) [1 functions]
        sysinfo
    110 (getppid) [1 functions]
        getppid
    131 (sigaltstack) [1 functions]
        sigaltstack
    144 (sched_setscheduler) [1 functions]
        __pthread_create
    157 (prctl) [1 functions]
        prctl
    158 (arch_prctl) [1 functions]
        __set_thread_area
    200 (tkill) [2 functions]
        abort
        raise
    202 (futex) [39 functions]
        _ZN11parking_lot10raw_rwlock9RawRwLock21unlock_exclusive_slow17h61503e7277b6f0a5E
        _ZN11parking_lot10raw_rwlock9RawRwLock21unlock_exclusive_slow17ha0c630b95dc468edE
        _ZN11parking_lot7condvar7Condvar19wait_until_internal17hcb9d2fca1f4f4b38E
        _ZN11parking_lot9raw_mutex8RawMutex9lock_slow17heea83734fedad8e1E
        _ZN11static_init12phase_locker4sync13transfer_lock17ha02b3c6e9250d56aE
        _ZN16parking_lot_core9word_lock8WordLock9lock_slow17hae697b4476ee60abE
        _ZN16parking_lot_core9word_lock8WordLock9lock_slow17hdb213fea64923d87E
        _ZN3std10sys_common4once5futex4Once4call17h2f44d6b7e1729488E
        _ZN3std10sys_common4once5futex4Once4call17h9dc68d51eb09f1fbE
        _ZN3std10sys_common4once5futex4Once4call17hc037e91effa7fa2fE
        _ZN3std10sys_common4once5futex4Once4call17he0bbb62abf4de447E
        _ZN3std3sys4unix5futex10futex_wait17h9944e141004324baE
        _ZN3std3sys4unix5locks11futex_mutex5Mutex14lock_contended17h89d5f0a549c030aeE
        _ZN3std3sys4unix5locks12futex_rwlock6RwLock14read_contended17he239402dfc02afd4E
        _ZN3std3sys4unix5locks12futex_rwlock6RwLock15write_contended17hb26a22f98d4f4ed1E
        _ZN3std3sys4unix5locks13futex_condvar7Condvar4wait17he8dabc36a2ae18a5E
        _ZN3std6thread4park17hbeef6ed68ff7e8d6E
        __bin_chunk
        __lock
        __lockfile
        __pthread_cond_timedwait
        __pthread_create
        __pthread_mutex_trylock_owner
        __timedwait_cp
        __tl_sync
        __tl_unlock
        __unlock
        __unlockfile
        __vm_unlock
        __wait
        alloc_fwd
        alloc_rev
        malloc
        pthread_cond_signal
        pthread_exit
        pthread_mutex_timedlock
        pthread_mutex_unlock
        pthread_rwlock_unlock
        unlock
    203 (sched_setaffinity) [1 functions]
        sched_setaffinity
    204 (sched_getaffinity) [2 functions]
        sched_getaffinity
        sysconf
    213 (epoll_create) [1 functions]
        epoll_create1
    218 (set_tid_address) [2 functions]
        __init_tp
        start
    228 (clock_gettime) [1 functions]
        clock_gettime
    231 (exit_group) [1 functions]
        _Exit
    232 (epoll_wait) [1 functions]
        epoll_pwait
    233 (epoll_ctl) [1 functions]
        epoll_ctl
    262 (newfstatat) [1 functions]
        fstatat
    273 (set_robust_list) [2 functions]
        __pthread_mutex_trylock_owner
        pthread_exit
    281 (epoll_pwait) [1 functions]
        epoll_pwait
    284 (eventfd) [1 functions]
        eventfd
    290 (eventfd2) [1 functions]
        eventfd
    291 (epoll_create1) [1 functions]
        epoll_create1
    302 (prlimit64) [1 functions]
        getrlimit
    309 (getcpu) [1 functions]
        sched_getcpu
    318 (getrandom) [1 functions]
        _ZN3std3sys4unix4rand19hashmap_random_keys17h7507563321f8d22bE

Used syscalls:
    0 (read)
    1 (write)
    2 (open)
    3 (close)
    4 (stat)
    5 (fstat)
    7 (poll)
    8 (lseek)
    9 (mmap)
    10 (mprotect)
    11 (munmap)
    12 (brk)
    13 (rt_sigaction)
    14 (rt_sigprocmask)
    15 (rt_sigreturn)
    20 (writev)
    24 (sched_yield)
    25 (mremap)
    28 (madvise)
    39 (getpid)
    41 (socket)
    42 (connect)
    45 (recvfrom)
    53 (socketpair)
    55 (getsockopt)
    56 (clone)
    60 (exit)
    61 (wait4)
    62 (kill)
    72 (fcntl)
    79 (getcwd)
    87 (unlink)
    89 (readlink)
    96 (gettimeofday)
    97 (getrlimit)
    98 (getrusage)
    99 (sysinfo)
    110 (getppid)
    131 (sigaltstack)
    144 (sched_setscheduler)
    157 (prctl)
    158 (arch_prctl)
    200 (tkill)
    202 (futex)
    203 (sched_setaffinity)
    204 (sched_getaffinity)
    213 (epoll_create)
    218 (set_tid_address)
    228 (clock_gettime)
    231 (exit_group)
    232 (epoll_wait)
    233 (epoll_ctl)
    262 (newfstatat)
    273 (set_robust_list)
    281 (epoll_pwait)
    284 (eventfd)
    290 (eventfd2)
    291 (epoll_create1)
    302 (prlimit64)
    309 (getcpu)
    318 (getrandom)

This list will probably get cut down when the binary's built with LTO and/or if we further cut the dependencies used.

It's also not entirely bulletproof, but as long as there's no code which deliberately tries to trigger syscalls in a sneaky way or doesn't do anything super strange it should be able essentially extract all of the syscalls used.

The script used to do this can be found here:

https://gist.github.com/koute/166f82bfee5e27324077891008fca6eb

(Written in Ruby because it's just a lot more productive language when experimenting than Python, but if we're going to hook it up to CI or something I can eventually rewrite it if necessary.)


A note regarding building the PVF worker to use musl: this unfortunately requires a musl-enabled C compiler (musl-gcc) to be installed, which can be kind of a pain to get installed depending on the distro. That's not a very good user experience. But fortunately it looks like we can probably work around this by just faking a musl-compatible compiler by e.g. having a script like this:

musl-gcc:

#!/bin/sh

gcc "$@"

Since this is not actually used to link any binaries it should most likely work just fine.

mrcnski commented 1 year ago

Nice, thank you!!!

This seems to be getting all the syscalls for the process and not just the thread that does the work, so there are syscalls that I would not want to let through (read, write, socket, etc. Also getrandom to address my concerns here.). So we should have a blacklist for the whitelist. 🙃 The rest should mostly be fine to allow. ~50 syscalls is still a much lower surface area for attacks than 400+.

koute commented 1 year ago

This seems to be getting all the syscalls for the process and not just the thread that does the work

Could we perhaps further cut down on the amount of stuff this binary does?

mrcnski commented 1 year ago

I am not sure, it already doesn't do much. The socket operations are necessary, at least we need some kind of IPC. The work thread itself does very little and I already refactored it to do bare minimum work in a WIP branch.

I had an idea that instead of CI (or in addition to), we could include your script in the binary itself (ideally by RIIRing it). Since validators can build against their own version of musl it would be their responsibility to run a subcommand and have polkadot verify itself. I think validators already are recommended to do some testing before they run. It could be a prereq of secure mode.