nestybox / sysbox

An open-source, next-generation "runc" that empowers rootless containers to run workloads such as Systemd, Docker, Kubernetes, just like VMs.
Apache License 2.0
2.73k stars 151 forks source link

How to get /sys/fs/cgroup filesystem mounted in rw if sysbox enabled "system" container is started by kubernetes with read-only root-fs? #821

Open FFock opened 1 month ago

FFock commented 1 month ago

We have the issue, that root users of a sysbox enabled container can store large files in the root (/) directory (or self-created new sub-directories) of the container. This can cause disk-pressure on the underlying kubernetes worker node as well as trigger a DoS on that worker node (no other containers can store any data in /var/lib/ of the host node anymore).

To prevent such a scenario, putting the root filesystem to read-only on kubernetes is the preferred method.

Unfortunately, sysbox is changes the mount mode of /sys/fs/group as follows from:

TARGET                              SOURCE                                                                     FSTYPE   OPTIONS
/                                   overlay                                                                    overlay  rw,relatime,lowerdir=/var/lib/containers/storage/overlay/l/D6IPGY273EBXCRCM65XCRK
|-/sys                              sysfs                                                                      sysfs    rw,nosuid,nodev,noexec,relatime
| |-/sys/firmware                   tmpfs                                                                      tmpfs    ro,relatime,uid=755360,gid=755360,inode64
| |-/sys/fs/cgroup                  cgroup                                                                     cgroup2  rw,nosuid,nodev,noexec,relatime
| |-/sys/devices/virtual            sysboxfs[/sys/devices/virtual]                                             fuse     rw,nosuid,nodev,relatime,user_id=0,group_id=0,default_permissions,allow_other
| |-/sys/kernel                     sysboxfs[/sys/kernel]                                                      fuse     rw,nosuid,nodev,relatime,user_id=0,group_id=0,default_permissions,allow_other
| `-/sys/module/nf_conntrack/parameters
|                                   sysboxfs[/sys/module/nf_conntrack/parameters]                              fuse     rw,nosuid,nodev,relatime,user_id=0,group_id=0,default_permissions,allow_other

to

TARGET                    SOURCE                                                                       FSTYPE   OPTIONS
/                         overlay                                                                      overlay  ro,relatime,lowerdir=/var/lib/containers/storage/overlay/l/JX7BIS6OKXXVE2O65CJJ2KSCON:/var/lib/containers/storage/overlay/l/JX7BIS6OKXX
|-/sys                    sysfs                                                                        sysfs    rw,nosuid,nodev,noexec,relatime
| |-/sys/firmware         tmpfs                                                                        tmpfs    ro,relatime,uid=296608,gid=296608,inode64
| |-/sys/fs/cgroup        cgroup                                                                       cgroup2  ro,nosuid,nodev,noexec,relatime
| |-/sys/devices/virtual  sysboxfs[/sys/devices/virtual]                                               fuse     rw,nosuid,nodev,relatime,user_id=0,group_id=0,default_permissions,allow_other
| |-/sys/kernel           sysboxfs[/sys/kernel]                                                        fuse     rw,nosuid,nodev,relatime,user_id=0,group_id=0,default_permissions,allow_other
| `-/sys/module/nf_conntrack/parameters
|                         sysboxfs[/sys/module/nf_conntrack/parameters]                                fuse     rw,nosuid,nodev,relatime,user_id=0,group_id=0,default_permissions,allow_other

You may notice that the /sys mount is read-write (rw) although it is not after the container start. I was able to remount it using sudo mount -o remount,rw /sys without problems. But same approach fails for /sys/fs/cgroup which is required to be "rw" for running inner docker containers:

$ sudo mount -o remount,rw /sys/fs/cgroup
mount: /sys/fs/cgroup: permission denied.

On the sysbox documentation (see https://github.com/nestybox/sysbox/blob/d61db2575c197fc8d37b54efb25027f454b75c17/docs/user-guide/security.md?plain=1#L218) it is stated that the enabling of the option allow-immutable-remounts=true can be set to allow such a remount.

My question is now, how and where to set this option allow-immutable-remounts=true on a kubernetes sysbox deployment?

FFock commented 1 month ago

To answer part of my question myself, the allow-immutable-remounts=true option can be activated by vi /lib/systemd/system/sysbox-fs.service edit the line with ExecStart=/usr/bin/sysbox-fs on the host to:

ExecStart=/usr/bin/sysbox-fs --allow-immutable-remounts=true

The restart sysbox with:

systemctl stop sysbox
systemctl start sysbox

After that the error message changes when I try to remount /sys/fs/cgroup to rw:

$ sudo mount -o remount,rw /sys/fs/cgroup
mount: /sys/fs/cgroup: mount(2) system call failed: Function not implemented.

So now the question is more general: "Is this a bug of sysbox or how to enabled rw cgroup fs with read-only root fs then?"

ctalledo commented 1 month ago

Hi @FFock,

We have the issue, that root users of a sysbox enabled container can store large files in the root (/) directory (or self-created new sub-directories) of the container. This can cause disk-pressure on the underlying kubernetes worker node as well as trigger a DoS on that worker node (no other containers can store any data in /var/lib/ of the host node anymore).

To prevent such a scenario, putting the root filesystem to read-only on kubernetes is the preferred method.

Got it, makes sense.

So to provide a bit of background on how it works.

When a container is started with --read-only, Sysbox will honor that and set all the container mounts to read-only. By default, Sysbox will disallow the container from remounting those as read-write, although I can see that it allows it for /sys (a bug) but not for other mounts (including those under /sys, such as /sys/fs/cgroup). For example:

$ docker run --runtime=sysbox-runc -it --rm --read-only nestybox/ubuntu-jammy-docker

# /sys is mounted read-only as expected
root@079cca62228a:/# findmnt | grep "|\-/sys"
|-/sys     sysfs        sysfs    ro,nosuid,nodev,noexec,relatime

# But /sys can be remounted to read-write (a bug)
root@079cca62228a:/# mount -o remount,rw /sys
root@079cca62228a:/# findmnt | grep "|\-/sys" 
|-/sys     sysfs       sysfs    rw,nosuid,nodev,noexec,relatime

# However other mounts (e.g., /sys/fs/cgroup) can't be remounted to read-write (as expected):
root@079cca62228a:/# mount -o remount,rw,bind /sys/fs/cgroup
mount: /sys/fs/cgroup: permission denied.

Now, Sysbox can be configured to allow the remount, by passing the --allow-immutable-remounts=true flag to sysbox-fs via its systemd service (/lib/systemd/system/sysbox-fs.service). For example, assuming that flag is set, then the remount of /sys/fs/cgroup from read-only -> read-write is now allowed:

$ docker run --runtime=sysbox-runc -it --rm --read-only nestybox/ubuntu-jammy-docker
root@079cca62228a:/# mount -o remount,rw,bind /sys/fs/cgroup
root@5845499df3de:/# findmnt | grep "|\-/sys"
| |-/sys/fs/cgroup    cgroup      cgroup2   rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot

So now the question is more general: "Is this a bug of sysbox or how to enabled rw cgroup fs with read-only root fs then?"

So based on the above, and assuming you are starting the container with --read-only (or equivalent in K8s), you can see that to do that you'll need to configure sysbox-fs with --allow-immutable-remounts=true, as otherwise Sysbox won't allow the remount of /sys/fs/cgroup to read-write.

Hope that helps!

FFock commented 1 month ago

When the container is started with read-only and sysbox-fs is started with --allow-immutable-remounts=true then I got an error on the remount of /sys/fs/cgroup fails with the following error (as noted my first comment on the original issue):

$ sudo mount -o remount,rw /sys/fs/cgroup
mount: /sys/fs/cgroup: mount(2) system call failed: Function not implemented.

So that is a bug, right?

FFock commented 3 weeks ago

Hi @ctalledo, did you got some time to look into this bug?

I did not find any workaround yet, I from my point of view it is a critical security issue, because the promised virtualisation for root access on kubernetes containers without "privileged" access rights is broken if the container can create/write arbitrary files on the host/node system!