paritytech / polkadot-sdk

The Parity Polkadot Blockchain SDK
https://polkadot.network/
1.85k stars 673 forks source link

PVF: Research the use of CPU virtualization for PVF execution #652

Open sandreim opened 1 year ago

sandreim commented 1 year ago

For an improved security posture, we should consider running the PVF on a KVM virtualized CPU. As I/O is not required and communication with the host can happen via a memory mapped region there shouldn't be any performance degradation even in the case of nested virtualization.

bkchr commented 1 year ago

Sounds like this could be easier than using seccomp?

sandreim commented 1 year ago

No, it is not. Hardware virtualization provides far better security than any other software sandboxing technology. AWS , GCP and other cloud providers are using KVM to virtualize their compute and memory.

sandreim commented 1 year ago

Building a PoC would help to understand any shortcomings in the context our very specific usecase. We don't even need device emulation or a real VM, just a single virutalized CPU with a few memory regions. We could easily fork https://github.com/firecracker-microvm/firecracker and strip it down to our basic needs.

bkchr commented 1 year ago

No, it is not. Hardware virtualization provides far better security than any other software sandboxing technology

Yeah for sure, I know :D What I meant is that doing this is much better then trying to predict what kind of syscalls we are doing and prohibiting the others. I'm still afraid that this will fail at some point when we oversee some syscall.

sandreim commented 1 year ago

Yes, I am also afraid of stalling the chain because we did not add a syscall to whitelist. seccompis a defense in depth measure, when all else has failed and you want to reduce the blast radius of the incident.

mrcnski commented 1 year ago

No, it is not. Hardware virtualization provides far better security than any other software sandboxing technology

Yeah for sure, I know :D What I meant is that doing this is much better then trying to predict what kind of syscalls we are doing and prohibiting the others. I'm still afraid that this will fail at some point when we oversee some syscall.

That's definitely a concern. I've been putting in a lot of work to mitigate that. Using @koute's script at build-time should mostly prevent even building a binary that contains disallowed syscalls. (In practice though, the syscalls that are actually used by the worker threads are quite few (about a dozen).)

But yeah, we want additional measures because just blocking syscalls is not enough -- if an attacker can break out of the WASM sandbox they probably can also get out-of-bounds memory, which essentially gives them a source of randomness. So say they make the worker job vote against with 50% chance, that would stall the chain.

koute commented 1 year ago

Note that AFAIK seccomp should essentially work out-of-box everywhere while KVM might require some extra setup from the users (e.g. some distros disallow access to KVM and require the user to be added to a special kvm group), or could not work at all on machines without support for hardware virtualization (not sure how common is that nowadays).

But you're right that hardware virtualization would be technically more secure.

It should be worth it to make a proof of concept and test it out in practice.

if an attacker can break out of the WASM sandbox they probably can also get out-of-bounds memory, which essentially gives them a source of randomness

Hmm... preventing the attacker from acquiring a source of randomness is going to be tricky. In presence of remote code execution it's not possible to disallow access to a source of randomness. Even if you disallow things like creating threads or measuring time through seccomp (although IIRC grabbing the time on amd64 goes through the vdso shim, so it might not be even possible to sandbox that with seccomp as no syscalls are involved) you still have e.g. the rdrand hardware instruction which the attacker could execute to just ask the CPU directly for some random bytes. AFAIK the only way to prevent that is to use virtualization and set the appropriate VMX bit to make the VM abort when that's called. There might be more corner cases here.

sandreim commented 1 year ago

Note that AFAIK seccomp should essentially work out-of-box everywhere while KVM might require some extra setup from the users (e.g. some distros disallow access to KVM and require the user to be added to a special kvm group), or could not work at all on machines without support for hardware virtualization (not sure how common is that nowadays).

AFAIK it is widely supported, and if it is not the validator can run without it. Indeed it requires some extra setup, but this is a small price to pay for the increased security. It could easily be part of the validator setup guide.

But you're right that hardware virtualization would be technically more secure.

It should be worth it to make a proof of concept and test it out in practice.

Yeah, that is something we should pursue !

if an attacker can break out of the WASM sandbox they probably can also get out-of-bounds memory, which essentially gives them a source of randomness

Hmm... preventing the attacker from acquiring a source of randomness is going to be tricky. In presence of remote code execution it's not possible to disallow access to a source of randomness. Even if you disallow things like creating threads or measuring time through seccomp (although IIRC grabbing the time on amd64 goes through the vdso shim, so it might not be even possible to sandbox that with seccomp as no syscalls are involved) you still have e.g. the rdrand hardware instruction which the attacker could execute to just ask the CPU directly for some random bytes. AFAIK the only way to prevent that is to use virtualization and set the appropriate VMX bit to make the VM abort when that's called. There might be more corner cases here.

What we would need is to develop and maintain a CPU template that customizes what we expose in CPUID and allow. An example, which is for general purpose VMs can be seen here: https://github.com/firecracker-microvm/firecracker/tree/main/src/vmm/src/cpuid . AFAIK this was not really supported on aarch64, but things might have changed. In our case we would have something very restrictive, or maybe we want to enable some CPU instruction set extensions that WASM can use for increased performance.

bkchr commented 1 year ago

It should be worth it to make a proof of concept and test it out in practice.

@mrcnski I would highly recommend that this is done before moving forward with the seccomp implementation.

koute commented 1 year ago

@mrcnski I would highly recommend that this is done before moving forward with the seccomp implementation.

Yep. But I think we still can first do the work of splitting the worker into a separate binary (and stripping it as much as possible) without necessarily sandboxing it yet, as that will be necessary regardless of which approach we pick.

mrcnski commented 1 year ago

@mrcnski I would highly recommend that this is done before moving forward with the seccomp implementation.

Agreed. I already have much of the seccomp logging implemented, so IMO it makes sense to finish that before a big context switch. And then the logging can run on validators for a while while I work on virtualization. And yeah, I will first split out the worker binaries (without musl-builder for now).[^1]

[^1]: Not having musl may make the syscalls less deterministic in theory, but in practice very few are triggered and the logging may show us that musl is not strictly needed.

bkchr commented 1 year ago

Okay ty!

sandreim commented 1 year ago

@mrcnski I would highly recommend that this is done before moving forward with the seccomp implementation.

Agreed. I already have much of the seccomp logging implemented, so IMO it makes sense to finish that before a big context switch. And then the logging can run on validators for a while while I work on virtualization. And yeah, I will first split out the worker binaries (without musl-builder for now).1

Footnotes

  1. Not having musl may make the syscalls less deterministic in theory, but in practice very few are triggered and the logging may show us that musl is not strictly needed.

IMO virtualization PoC is a big rock to push uphill. We should shoud first enable the logging to at least collect some data while we work on the PoC.

Polkadot-Forum commented 1 year ago

This issue has been mentioned on Polkadot Forum. There might be relevant details there:

https://forum.polkadot.network/t/ux-of-distributing-multiple-binaries-take-2/2854/1

alindima commented 1 year ago

I like this idea and I think it'll provide far superior than any process-level sandboxing indeed 👍🏻 but as Andrei said, it'll be challenging.

What we would need is to develop and maintain a CPU template that customizes what we expose in CPUID and allow. An example, which is for general purpose VMs can be seen here: https://github.com/firecracker-microvm/firecracker/tree/main/src/vmm/src/cpuid . AFAIK this was not really supported on aarch64, but things might have changed. In our case we would have something very restrictive, or maybe we want to enable some CPU instruction set extensions that WASM can use for increased performance.

Actually, even if a given CPUID template is set (which doesn't advertise a certain instruction), a guest program can still try to call that instruction (and it'll work). CPUID only helps with non-malicious guests, to provide a common view of the CPU features to all guests. We'd still need to set the right VMX bits to trap on instructions like RDRAND, on top of CPUID masking.