kata-containers / kata-containers

Kata Containers is an open source project and community working to build a standard implementation of lightweight Virtual Machines (VMs) that feel and perform like containers, but provide the workload isolation and security advantages of VMs. https://katacontainers.io/
Apache License 2.0
5.32k stars 1.04k forks source link

Host Memory Exhaustion Attack from Inside Kata Containers #3373

Open WatchDoug opened 2 years ago

WatchDoug commented 2 years ago

This is Yutian Yang from Zhejiang University. Our team have discovered a new attack from inside Kata containers, leading to host memory exhaustion.

The root cause lies in the Linux kernel, even in the latest version. Briefly speaking, the kernel memcg does not charge posix locks allocated by user processes. The bug report and patches can be found at https://lore.kernel.org/linux-mm/20210902215519.AWcuVc3li%25akpm@linux-foundation.org/

Unfortunately, we find that even virtualized containers like Kata are also affected by this bug. With "-o posix_lock" option enabled, Kata runtime forwards posix lock allocation requests to virtiofsd on the host. The virtiofsd then allocates posix locks on behalf. Although virtiofsd processes are limited by memcg, memory consumption of posix locks in kernel are not properly charged. Attackers inside containers can thus allocate a huge number of posix locks to run out of all memory on the node. Note that the number of posix locks are not limited by rlimit/sysctl by default.

We have developed a PoC that causes host memory exhaustion. We are glad to share them via emails if you are interested in reproducing the attack.

We also want to discuss whether there is a way to mitigate such problems. A quick mitigation could be disabling the "-o posix_lock" option. However, how can we enable the functionality without triggering kernel bugs before they are patched in the kernel code?

fidencio commented 2 years ago

@WatchDoug, thanks for raising the issue. I see this as a possible CVE, with the issue not being on the Kata Containers itself. For now, I'd like to ask you to treating this as a CVE, and thus following our CVE process (please, see: https://github.com/kata-containers/community/blob/main/VMT/VMT.md).

Also, I'd like to take a look at the PoC, and that could be done via https://launchpad.net/katacontainers.io, as everything there can be done privately.

Last but not least, let's discuss the mitigation plans and whatnot in the issue opened on launchpad.

/cc @kata-containers/architecture-committee!

And a huge thanks to @gkurz for promptly bringing this to our attention!

gkurz commented 2 years ago

@WatchDoug, thanks for raising the issue. I see this as a possible CVE, with the issue not being on the Kata Containers itself. For now, I'd like to ask you to treating this as a CVE, and thus following our CVE process (please, see: https://github.com/kata-containers/community/blob/main/VMT/VMT.md).

Also, I'd like to take a look at the PoC, and that could be done via https://launchpad.net/katacontainers.io, as everything there can be done privately.

Last but not least, let's discuss the mitigation plans and whatnot in the issue opened on launchpad.

/cc @kata-containers/architecture-committee!

And a huge thanks to @gkurz for promptly bringing this to our attention!

Please Cc me (gkurz) on the launchpad issue as well.

gkurz commented 2 years ago

Discussion has moved to launchpad.

rhvgoyal commented 2 years ago

Nice find. I think using "-o no_posix_lock" is the short term mitigation of the issue till we figure out a proper way to handle it.

BTW, remote posix locks are disabled by default in virtiofsd. So to run into this issue, one will have to explicitly enable it. Following is the commit which disabled remote posix locks by default.

commit 88fc107956a5812649e5918e0c092d3f78bb28ad Author: Vivek Goyal vgoyal@redhat.com Date: Mon Jul 27 12:18:41 2020 -0400

virtiofsd: Disable remote posix locks by default

Remote posix locks are useful only if a filesystem is shared across multiple VMs using virtiofs. I believe kata is using virtiofs only for rootfs which is prepared separately for each VM using overlayfs. Hence no sharing. And hence kata should not be needing to enable remote posix locks.

rhvgoyal commented 2 years ago

Is kata enabling posix locks by default? Is there a reason they need remote posix locks enabled? Anyway functionality is not complete. It does not support waiting posix locks.

So I believe that first thing we should probably do is not enable remote posix locks in kata by default.

rhvgoyal commented 2 years ago

Is this specific to virtiofsd only? Can a simple privileged (and unprivileged) process on host drive system out of memory without being OOM killed?

gkurz commented 2 years ago

Is kata enabling posix locks by default? Is there a reason they need remote posix locks enabled? Anyway functionality is not complete. It does not support waiting posix locks.

No kata explicitely passes "-o no_posix_lock" for other reasons but it is possible for the end user to provide extra options that get appended to the virtiofsd command line.

So I believe that first thing we should probably do is not enable remote posix locks in kata by default.

kata might be able to filter out options that the user should really not pass.

gkurz commented 2 years ago

Is this specific to virtiofsd only? Can a simple privileged (and unprivileged) process on host drive system out of memory without being OOM killed?

I haven't tried yet but looking at the kernel fix it looks like it can happen with any process.

rhvgoyal commented 2 years ago

I guess simplest short term fix is for kata to disallow option "-o posix_lock", till a proper long term fix gets committed to Linux kernel.

haslersn commented 2 years ago

I believe kata is using virtiofs only for rootfs

Also for filesystem PVCs (Kubernetes terminology), right?

gkurz commented 2 years ago

I believe kata is using virtiofs only for rootfs

Also for filesystem PVCs (Kubernetes terminology), right?

yes

haslersn commented 2 years ago

For filesystem PVCs I'd argue it's important to enable remote locks. (Blocking remote locks aren't supported, yet, but that's another story. Better not supported than data corruption.)

gkurz commented 2 years ago

For filesystem PVCs I'd argue it's important to enable remote locks. (Blocking remote locks aren't supported, yet, but that's another story. Better not supported than data corruption.)

Remote locks currently have several issues that justify they aren't enabled by default IMHO. Users can still ask their admin to enable the 'virtio_fs_extra_args' annotation so that they can pass '-o posix_lock' themselves if they're doing locking in a shared directory.

haslersn commented 2 years ago

@gkurz The problem is that sysadmins are not always aware that an app performs locking (Typically an app developer doesn't document this requirement, because on runc it just works) and -o no_posix_lock doesn't lead to an error but rather to local locking semantics. So the sysadmins will not consider enabling it and will one day wonder why their app corrupted its data.

gkurz commented 2 years ago

@haslersn I understand your concern but this is out the scope. With or without kata, POSIX locks can currently be used by a container to hog the host memory : this issue is just about finding a mitigation.

Please contact virtiofs people for the final availability of POSIX locks in C virtiofsd (part of QEMU, work in progress) and rust virtiofsd (not started yet).

00xc commented 2 years ago

Discussion has moved to launchpad.

Hi, I cannot access the discussion here. I'd like to know if/when the bug will be opened to the public. Thanks!