rust-vmm / seccompiler

Provides easy-to-use Linux seccomp-bpf jailing.
https://crates.io/crates/seccompiler
Apache License 2.0
72 stars 11 forks source link

Documentation/clarification on SeccompCmpArgLen #59

Open boustrophedon opened 1 year ago

boustrophedon commented 1 year ago

Is it basically always okay to use SeccompCmpArgLen::Qword on 64 bit systems? I'm doing so in my tests and I don't seem to see any issues, but I also don't have anything close to exhaustive tests of all syscalls.

Libseccomp seems to always do the full 64 bit comparison on 64 bit systems: https://github.com/seccomp/libseccomp/blob/main/src/db.c#L1665

I don't have any plans currently to support 32 bit systems but I'm just not sure when I might encounter issues. Does it exist basically so that the backend can generate either 32 or 64 bit ebpf regardless of what the host architecture is? i.e. When actually running the ebpf, you should always use Qword on 64 bit systems and Dword on 32 bit systems?

alindima commented 1 year ago

Hi. Have a look here to see why this arg len was introduced: https://github.com/firecracker-microvm/firecracker/pull/1227 The issue linked in that PR description has a lot of details.

Let me know if this solves the issue. I think you're right, we should better document this enum

boustrophedon commented 1 year ago

Thanks for the link! I've now read the thread and am able to reproduce the issue in both libseccomp and libseccomp-rs. I think to summarize (mostly the same as serban300's summary):

glibc's ioctl definition takes a c_ulong for its second argument, whereas musl takes a c_int. man 2 ioctl says they are 32 bit constants, and so does posix, but glibc is a pain as always.

# in rust's libc crate
[linux_like (main)]$ rg 'pub fn ioctl' | rg 'gnu|musl'
linux/musl/mod.rs:    pub fn ioctl(fd: ::c_int, request: ::c_int, ...) -> ::c_int;
linux/gnu/mod.rs:    pub fn ioctl(fd: ::c_int, request: ::c_ulong, ...) -> ::c_int;

While you can cast x as libc::c_int or x as libc::c_ulong depending on what libc you're building for, the issue is that since musl treats it as signed, it does a sign-extending mov into the kernel's 64 bit argument register:

# excerpt from `objdump -d /usr/lib/musl/lib/libc.a`
Disassembly of section .text.ioctl:

0000000000000000 <ioctl>:
   0:   48 83 ec 58             sub    $0x58,%rsp
   4:   48 63 ff                movslq %edi,%rdi
   7:   48 63 f6                movslq %esi,%rsi                     # rsi is the second parameter of a syscall
   a:   48 89 54 24 30          mov    %rdx,0x30(%rsp)
   f:   64 48 8b 04 25 28 00    mov    %fs:0x28,%rax
  16:   00 00
  18:   48 89 44 24 18          mov    %rax,0x18(%rsp)
  1d:   31 c0                   xor    %eax,%eax
  1f:   48 8d 44 24 60          lea    0x60(%rsp),%rax
  24:   c7 04 24 10 00 00 00    movl   $0x10,(%rsp)
  2b:   48 89 44 24 08          mov    %rax,0x8(%rsp)
  30:   48 8d 44 24 20          lea    0x20(%rsp),%rax
  35:   48 89 44 24 10          mov    %rax,0x10(%rsp)
  3a:   b8 10 00 00 00          mov    $0x10,%eax
  3f:   0f 05                   syscall

So when a 32 bit int greater than 0x8000_0000 gets passed as a signed 32 bit parameter, the sign bit gets extended to the upper 32 bits.

On the seccomp/BPF side, struct seccomp_data (see <linux/seccomp.h>) has the arguments uniformly in an array of unsigned 64 bit ints, so there's no way to tell whether it was originally "supposed to be" a 32 bit integer or not. You have to know when creating the filter itself, so you can know whether the argument is "supposed to be 32 bit" and you should only compare the lsb, or whether you should compare the entire parameter.


libseccomp and libseccomp-rs repros https://gist.github.com/boustrophedon/aa46c0c6c28b38a542a73d2205471c07

boustrophedon commented 1 year ago

So I'm wondering if the best thing to do would be to create a "quirks file" either in seccompiler or in extrasafe that simply keeps track of which syscalls have 32 bit arguments. The only risk I can think of is that a syscall expanding the type of a parameter, which seems unlikely, but even that can probably be mitigated by requiring that the msb all be 0 or 1 in the filter so that nothing gets through.

The alternative is that this work is done in each project that ues seccomp individually.

Do you know if anyone spoke to the libseccomp people or the seccomp maintainers about this issue?

boustrophedon commented 1 year ago

Ah, I think they are aware because I found this section in the manpage:


       It is recommended that whenever possible developers avoid using
       the SCMP_CMP() and SCMP_A{0-5}() macros and use the variants
       which are explicitly 32 or 64-bit.  This should help eliminate
       problems caused by an unwanted sign extension of negative datum
       values.
alindima commented 1 year ago

I don't think maintaining in seccompiler a list of syscalls that use 32-bit and 64-bit params is feasible. Every project that wants to use secomp is supposed to audit their filter and decide which argument length to use depending on whether they use musl/glibc and their respective versions.

This is quite a confusing issue, so maybe having it better documented somewhere would help. Both in the readme and in the rust docs of SeccompCmpArgLen Since you got a pretty good understanding of it, would you like to contribute it? 😄

boustrophedon commented 1 year ago

Every project that wants to use secomp is supposed to audit their filter and decide which argument length to use depending on whether they use musl/glibc and their respective versions.

I think this is why seccomp hasn't seen wider adoption and why people praise openbsd's pledge and unveil over it - it's just simpler to use. 99.99% of developers have no idea whether they're using an ioctl parameter with the 32nd bit set or not when they develop a web application with a directory traversal vulnerability.

I'll experiment with a quirks list inside extrasafe.

Regarding the documentation, I can maybe write a PR but I have a bunch of other things I also need to do so it might take some time.

alindima commented 1 year ago

it's just simpler to use

Maybe, for basic use-cases, but I believe seccomp is arguably more flexbile for an advanced user. AFAICT, pledge is a higher-level primitive than seccomp, one that probably can have an equivalent wrapper over seccomp, on linux, at least to some extent.

Now, system call filtering is a very difficult thing to get right for an application, simply because most of them are not calling system calls directly and don't have control over which system calls their dependencies are using. My opinion is that if you want a simple but secure sandbox, use light-weight virtualization. IMO seccomp is better as a last-resort mechanism (to simply limit the attack surface of the kernel in the unlikely event that a sandbox escape does happen).

Anyway, the vast majority of use-cases shouldn't be concerned IMO with restricting parameters of system calls. For the majority of them, I think simple filtering on the syscall number is probably enough.

Regarding the documentation, I can maybe write a PR but I have a bunch of other things I also need to do so it might take some time.

same here. I'll leave this issue open to track the documentation change, in case anybody gets some time to write it