sys/linux: generating ioctl(FS_IOC_SETFLAGS) and ioctl(FS_IOC_FSSETXATTR) may break VM

ramosian-glider commented 3 months ago

The following program:

r0 = openat$kvm(0x0, &(0x7f0000000040), 0x0, 0x0)
r1 = ioctl$KVM_CREATE_VM(r0, 0xae01, 0x0)
r2 = ioctl$KVM_CREATE_VCPU(r1, 0xae41, 0x0)
r3 = mmap$KVM_VCPU(&(0x7f0000009000/0x1000)=nil, 0x930, 0x280000f, 0x11, r2, 0x0)
syz_memcpy_off$KVM_EXIT_HYPERCALL(r3, 0x20, &(0x7f00000001c0)="fb4149dd033be3ac2cc4a22332a77b23b08986814d7bb14c94a6ab8031d1dfd92f00000000010000005a9610fbff67521ce16f8f1f449a7a835673312b54ebb2aa7fc869d22627e7", 0x0, 0x48)
mmap$KVM_VCPU(&(0x7f0000000000/0xa000)=nil, 0x930, 0x1000001, 0x11, r2, 0x0)
openat$kvm(0x0, &(0x7f0000000040), 0x0, 0x0)
openat$kvm(0xffffff9c, &(0x7f0000000040), 0x0, 0x0)
mmap$KVM_VCPU(&(0x7f0000000000/0x14000)=nil, 0x930, 0x0, 0x5c1fd1b656592f1, 0xffffffffffffffff, 0x0)
mmap$KVM_VCPU(&(0x7f0000001000/0x2000)=nil, 0x930, 0x2000003, 0x4120932, 0xffffffffffffffff, 0x0)
openat$kvm(0xffffffffffffff9c, &(0x7f0000000080), 0x0, 0x0)
ioctl$KVM_IRQFD(0xffffffffffffffff, 0x4020ae76, &(0x7f0000000140)={0xffffffffffffffff, 0x6})
r4 = openat$kvm(0xffffffffffffff9c, &(0x7f0000000080), 0x0, 0x0)
r5 = ioctl$KVM_CREATE_VM(r4, 0xae01, 0x0)
ioctl$KVM_CREATE_DEVICE(r5, 0xc00caee0, &(0x7f0000000140)={0x4, <r6=>0xffffffffffffffff, 0x1})
ioctl$KVM_SET_DEVICE_ATTR(r6, 0x40086602, &(0x7f0000000040)={0x5cfe, 0x0, 0x0, 0x0})

breaks fuzzing in a subtle way, setting a number of root filesystems attributes that prevent syz-executor from creating and deleting files:

# lsattr -d /
-uS-iadAc-j---------

This happens because the last call, ioctl$KVM_SET_DEVICE_ATTR(r6, 0x40086602, &(0x7f0000000040)={0x5cfe, 0x0, 0x0, 0x0}), has an incorrect ioctl number corresponding to FS_IOC_GETFLAGS, and because it happens to be called on a file descriptor somehow pointing at the root filesystem.

The latter part is not clear to me yet. The fd in question has a value of 6, and is supposed to be returned by KVM_CREATE_DEVICE, but could as well be some fd created by the executor.

I'll dig into this, but anyway I believe we need to disallow mutating ioctl numbers to value 0x40086602, unless that's FS_IOC_GETFLAGS.

dvyukov commented 3 months ago

0x40086602 is FS_IOC_SETFLAGS:

sys/linux/fs_ioctl.txt.const:FS_IOC_SETFLAGS = 1074292226, 386:arm:1074030082, mips64le:ppc64le:2148034050

dvyukov commented 3 months ago

0x5cfe maps to these attributes:

#define FS_UNRM_FL          0x00000002 /* Undelete */
#define FS_COMPR_FL         0x00000004 /* Compress file */
#define FS_SYNC_FL          0x00000008 /* Synchronous updates */
#define FS_IMMUTABLE_FL         0x00000010 /* Immutable file */
#define FS_APPEND_FL            0x00000020 /* writes to file may only append */
#define FS_NODUMP_FL            0x00000040 /* do not dump file */
#define FS_NOATIME_FL           0x00000080 /* do not update atime */
#define FS_NOCOMP_FL            0x00000400 /* Don't compress */
#define FS_ENCRYPT_FL           0x00000800 /* Encrypted file */
#define FS_BTREE_FL         0x00001000 /* btree format dir */
#define FS_INDEX_FL         0x00001000 /* hash-indexed directory */
#define FS_JOURNAL_DATA_FL      0x00004000 /* Reserved for ext3 */

ramosian-glider commented 3 months ago

Yeah, sorry, I mistyped the name. It is FS_IOC_SETFLAGS that is harmful.

ramosian-glider commented 3 months ago

As far as I understand, it's still possible for the fuzzer to accidentally random ioctl constants that don't match the discrimination.

Shall I handle it in Neutralize() in sys/targets/common.go?

dvyukov commented 3 months ago

Neutralizing it is the last resort, since then we won't ever test it on any filesystem. And it seems to be doing lots of complex things that are fs specific.

We can try to attack this from different angles:

prevent it from opening / (which is shouldn't)
always mount tmpfs and hide root fs completly

We already prohibit a bunch of ioctls for similar reasons (FIFREEZE, EXT4_IOC_SHUTDOWN, EXT4_IOC_RESIZE_FS), so if we hide root fs, we could re-enable these as well.

I am trying to understand how it gets access to /, but I can't reproduce it, I always get:

#0 [1304ms] -> ioctl$KVM_SET_DEVICE_ATTR(0xffffffffffffffff, 0x40086602, 0x20000040)
#0 [1304ms] <- ioctl$KVM_SET_DEVICE_ATTR=0xffffffffffffffff errno=9

Is it reliably reproducible for you? If so, please run with -debug flag to see what fd it gets.

ramosian-glider commented 3 months ago

Neutralizing it is the last resort, since then we won't ever test it on any filesystem. And it seems to be doing lots of complex things that are fs specific.

I was going to do something along the lines of:

+               if c.Args[1].(*prog.ConstArg).Val == arch.FS_IOC_SETFLAGS &&
+                       c.Meta.Name != "ioctl$FS_IOC_SETFLAGS" {
+                       c.Args[1].(*prog.ConstArg).Val = arch.FS_IOC_GETFLAGS
+               }

, i.e. only neutralize if the ioctl number doesn't match its discrimination.

This won't solve the problem of generating a perfectly valid program doing ioctl$FS_IOC_SETFLAGS() though.

Is it reliably reproducible for you? If so, please run with -debug flag to see what fd it gets.

Yeah, I'm on it.

FWIW there's also a similar problem with FS_IOC_FSSETXATTR

dvyukov commented 3 months ago

This won't solve the problem of generating a perfectly valid program doing ioctl$FS_IOC_SETFLAGS() though.

Exactly, so it won't solve the problem. It's even easier for the fuzzer to do the same using ioctl$FS_IOC_SETFLAGS.

ramosian-glider commented 3 months ago

log.txt

Here's the log from my VM. I am trying to reduce the repro now, but it is extremely fragile. I suppose it also depends on one of the FDs opened by the executor.

dvyukov commented 3 months ago

Fun. This indeed works:

#include <fcntl.h>
#include <unistd.h>

int main() {
    int fds[2];
    pipe(fds);
    openat(fds[1], "/dev/kvm", 0, 0);
    sleep(1000);
}

What is this fd?

$ cat /proc/890477/fdinfo/5
pos:    0
flags:  0100000
mnt_id: 25
ino:    724

dvyukov commented 3 months ago

It looks like it's just "/dev/kvm".

But in today's episode of Fun With Fuzzers:

0x5cfe returns EINVAL for this ioctl, but it looks like it passed memory mmaped from kvm vcpu to open openat and the ioctl :) So we don't know the actual flags value, nor the opened file name. If the mmaped vcpu happened to have /\x00 at that location, then it could open root.

#0 [60126ms] -> mmap$KVM_VCPU(0x20000000, 0x930, 0x1000001, 0x11, 0x5, 0x0)
#0 [60137ms] <- mmap$KVM_VCPU=0x20000000
SIGSEGV on 0x20000040, skipping
#0 [60149ms] -> openat$kvm(0x0, 0x20000040, 0x0, 0x0)
#0 [60157ms] <- openat$kvm=0x6
...
#0 [60269ms] -> ioctl$KVM_SET_DEVICE_ATTR(0x6, 0x40086602, 0x20000040)

dvyukov commented 3 months ago

The mmaped memory most likely has 0s (at least among values that can possibly succeed when being using as a file name in openat), but opening an empty string from pipe dir fails, so that wasn't that :)

pipe2([3, 4], 0)                        = 0
openat(4, "", O_RDONLY)                 = -1 ENOENT (No such file or directory)

ramosian-glider commented 3 months ago

Note that mmap() calls are requesting 0x930 bytes, which the executor probably rounds up, but prog2c happily leaves as is (trying a C repro now)

ramosian-glider commented 3 months ago

Running a slightly modified C repro under strace:

...
mmap(0x1ffff000, 4096, PROT_NONE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x1ffff000
mmap(0x20000000, 16777216, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x20000000
mmap(0x21000000, 4096, PROT_NONE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x21000000
write(1, "executing program\n", 18executing program
)     = 18
openat(0, "/dev/kvm", O_RDONLY)         = 3
ioctl(3, KVM_CREATE_VM, 0)              = 4
ioctl(4, KVM_CREATE_VCPU, 0)            = 5
mmap(0x20009000, 4096, PROT_READ|PROT_WRITE|PROT_EXEC|PROT_SEM|PROT_GROWSUP|0x800000, MAP_SHARED|MAP_FIXED, 5, 0) = 0x20009000
mmap(0x20000000, 4096, PROT_READ|PROT_GROWSDOWN, MAP_SHARED|MAP_FIXED, 5, 0) = 0x20000000
newfstatat(1, "", {st_mode=S_IFCHR|0600, st_rdev=makedev(0xcc, 0x40), ...}, AT_EMPTY_PATH) = 0
ioctl(1, TCGETS, {c_iflag=BRKINT|ICRNL|IXON|IMAXBEL, c_oflag=NL0|CR0|TAB0|BS0|VT0|FF0|OPOST|ONLCR, c_cflag=B38400|CS8|CREAD|HUPCL|CLOCAL, c_lflag=ISIG|ICANON|ECHO|ECHOE|ECHOK|IEXTEN|ECHOCTL|ECHOKE, ...}) = 0
write(1, "path: '/' (0x1000000002f)\n", 26path: '/' (0x1000000002f)
) = 26
openat(AT_FDCWD, "/", O_RDONLY)         = 6
mmap(0x20000000, 4096, PROT_NONE, MAP_SHARED|MAP_FIXED|MAP_ANONYMOUS|MAP_POPULATE|MAP_NONBLOCK|MAP_EXECUTABLE|MAP_HUGETLB|0x16002c0|25<<MAP_HUGE_SHIFT, -1, 0) = -1 ENOMEM (Cannot allocate memory)
mmap(0x20001000, 4096, PROT_READ|PROT_WRITE|PROT_GROWSUP, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS|MAP_GROWSDOWN|MAP_DENYWRITE|MAP_STACK|MAP_FIXED_NOREPLACE|1<<MAP_HUGE_SHIFT, -1, 0) = 0x20001000
ioctl(-1, KVM_IRQFD, 0x20000140)        = -1 EBADF (Bad file descriptor)
openat(AT_FDCWD, "/dev/kvm", O_RDONLY)  = 7
ioctl(7, KVM_CREATE_VM, 0)              = 8
ioctl(8, KVM_CREATE_DEVICE, 0x20000140) = 0
ioctl(6, FS_IOC_SETFLAGS, [FS_UNRM_FL|FS_COMPR_FL|FS_SYNC_FL|FS_IMMUTABLE_FL|FS_APPEND_FL|FS_NODUMP_FL|FS_NOATIME_FL|FS_NOCOMP_FL|FS_ENCRYPT_FL|FS_INDEX_FL|FS_JOURNAL_DATA_FL]) = 0
exit_group(0)                           = ?
+++ exited with 0 +++

so the mapped CPU data contains 0x1000000002f, which is indeed treated as "/\0". Essentially we're doing the following:

Set up a VM with a virtual CPU that does nothing, but provides some static data in its struct kvm_run.
Map that CPU's struct kvm_run at address 0x20000000 with PROT_READ, so that it overlaps with syzlang's heap.
Attempt to write "/dev/kvm" at address 0x20000040 and fail, because the memory is protected.
Call openat(AT_FDCWD, "/", O_RDONLY) instead of opening /dev/kvm, because struct kvm_run has 0x1000000002f at address 0x20000040.
Remap address 0x20000000 with PROT_WRITE, so that program arguments can be stored in it again.
Prepare the arguments for KVM_IRQFD at address 0x20000140, placing the totally random value 6 at address 0x20000144. This value coincidentally happens to be the value returned by openat() at step 4.
Call ioctl(-1, KVM_IRQFD, 0x20000140), which does nothing interesting (we only wanted to write 6 to address 0x20000144).
Create a new VM and a virtual device, initializing a struct kvm_device at address 0x20000140, so that the output kvm_device.fd parameter is located at address 0x20000144.
Take fd (which is 6) from the output parameter of ioctl(KVM_CREATE_DEVICE) and pass it to ioctl(FS_IOC_SETFLAGS), breaking the file system.

dvyukov commented 3 months ago

so the mapped CPU data contains 0x1000000002f, which is indeed treated as "/\0".

This is hilarious.

I don't think we can prevent all such cases, fuzzer came with other creative ways to break attempts to restrict it. It's more reliable to prevent these things on sandbox level. I think we should mount tmpfs as rootfs for sandbox=none and then re-enable all prohibited syscalls.

ramosian-glider commented 3 months ago

We could also do a couple more things to reduce the risk of such cases.

First, it makes little sense to call discriminated versions of openat() with random string parameters, so maybe we need to allocate them in a region that can't be mprotect()ed or mmap()ed by the program.

Second, we'd better not reuse addresses for structs allocated for different syscalls. Given the typical program size, we can afford placing them at a new address every time. In return we'll have more reproducible results that are easier to minimize (calls do not implicitly depend on each other). An alternative would be to explicitly zero out a struct before initializing its fields.

dvyukov commented 3 months ago

All of this will also reduce fuzzer's ability to find bugs (passing device mmaped memory as a file name looks like a good corner case to test), so it's a tough choice.

ramosian-glider commented 3 months ago

Does having overlapping structs really help much? To me it looks more like a loophole that allows the fuzzer to work around resource type checks by writing something into one struct as an int and then reading it back as an fd.

dvyukov commented 3 months ago

I think we try to avoid overlapping structs as much as possible, where it's reasonably easy to do.

ramosian-glider commented 3 months ago

My current intent is to factor out the ./syz-tmp bits from namespace_sandbox_proc() and make sure every sandbox uses them.

dvyukov commented 3 months ago

My current intent is to factor out the ./syz-tmp bits from namespace_sandbox_proc() and make sure every sandbox uses them.

This would be good for sandbox=none, but I am not sure about sandbox=setuid. sandbox=none can mount own filesystems of all types, so it can test them. sandbox=setuid can't so it will test just this tmpfs, and won't test the actual rootfs at all. But this looks important -- the reachable surface is already pretty small under sandbox=setuid, and executing open/read/write/ftruncate on the actual rootfs (e.g. ext4) is a significant part of that surface. If we do tmpfs in this case, we won't test what an unpriv user can do with the root fs. But sandbox=setuid also totally shouldn't be able to change rootfs attrs (if it manages to get same failures woth broken footfs, we already found a very serious bug).

ramosian-glider commented 3 months ago

Hmm, some tests start failing when I am trying to mount tmpfs in do_sandbox_none(). I suppose this is because we don't run the tests under root.

I could do unshare(CLONE_NEWUSER) to pretend I am root, but I doubt I can elevate the capabilities for mounting tmpfs in the tests.

I could as well skip the tmpfs part for non-root users, but it will remain untested then, which is undesirable.

ramosian-glider commented 3 months ago

Oh wait, there's CAP_LINUX_IMMUTABLE that we can drop in order to prevent setting FS_IMMUTABLE_FL and FS_APPEND_FL. This won't stop the fuzzer from messing up with other attributes, but at least it won't break the basic functionality like file/directory creation.

ramosian-glider commented 3 months ago

Yet another angle to tackle this (e.g. to make less harmful attributes do not persist after an executor run) would be to do chattr -R = / on executor startup. I still think we need to drop CAP_LINUX_IMMUTABLE to make sure temporary files are deleted.

dvyukov commented 3 months ago

Yet another angle to tackle this (e.g. to make less harmful attributes do not persist after an executor run) would be to do chattr -R = / on executor startup.

What executor process do you mean? There are usually 4 hierarchical processes now.

I still think we need to drop CAP_LINUX_IMMUTABLE to make sure temporary files are deleted.

You mean entirely and don't test it on any filesystems?

ramosian-glider commented 3 months ago

What executor process do you mean? There are usually 4 hierarchical processes now.

I was going to suggest doing that around makeCommand() in pkg/ipc/ipc.go, but apparently it is gone now :)

We need to reset the attributes before preparing to every program execution, what would be the best place to do so?

You mean entirely and don't test it on any filesystems?

That was my intent, right. Given that we cannot reliably mount tmpfs (see above), this might be the lesser evil. It will still be possible to set other attributes, except for FS_IMMUTABLE_FL and FS_APPEND_FL.

dvyukov commented 3 months ago

We need to reset the attributes before preparing to every program execution, what would be the best place to do so?

This looks expensive. We would need to read every dir on disk and read attributes of every file? While a test can take just 40 ms.

dvyukov commented 3 months ago

Given that we cannot reliably mount tmpfs (see above)

Do you mean this one?

Hmm, some tests start failing when I am trying to mount tmpfs in do_sandbox_none(). I suppose this is because we don't run the tests under root.

What tests have failed? Were they for linux, or test OS? For which OSes have you changed sandbox setup?

dvyukov commented 3 months ago

What tests have failed? Were they for linux, or test OS? For which OSes have you changed sandbox setup?

I don't see where we run test programs for Linux OS in tests. We build executor for Linux in pkg/runtest/executor_test.go, but there we only run executor unit-tests "syz-executor test", it shouldn't do sandbox setup.

ramosian-glider commented 3 months ago

I think there were failing tests in pkg/ipc on Linux, but those are gone now. Let me update my setup, I am pretty sure even the original bug is going to manifest in a different way now (if at all)

dvyukov commented 3 months ago

If mount was failing, changing attribute of every file on the disk will fail too, right. Both things are not something that tests should be doing.

But I had memories that mounting tmpfs in unpriv operation nowadays.

ramosian-glider commented 3 months ago

As far as I understand, mounting tmpfs is not enough. We want to chroot into it, after which we'll also need to mount /dev and procfs under that new root.

dvyukov commented 3 months ago

Right. Then the question is what test runs executor for Linux OS.

google / syzkaller

sys/linux: generating ioctl(FS_IOC_SETFLAGS) and ioctl(FS_IOC_FSSETXATTR) may break VM #4939