google / nsjail

A lightweight process isolation tool that utilizes Linux namespaces, cgroups, rlimits and seccomp-bpf syscall filters, leveraging the Kafel BPF language for enhanced security.
https://nsjail.dev
Apache License 2.0
2.96k stars 275 forks source link

Cannot make nsjail work on cgroupsv2 system #196

Open carlbordum opened 2 years ago

carlbordum commented 2 years ago

For example, when I run nsjail with --use_cgroupv2 --cgroupv2_mount /sys/fs/cgroup/NSJAIL, I still see errors like

writeBufToFile():95 Couldn't open '/sys/fs/cgroup/NSJAIL/NSJAIL.10/memory.max' for writing: No such file or directory

If I udnerstand cgroups v2 correctly, it should look for /sys/fs/cgroup/NSJAIL/memory.max, not /sys/fs/cgroup/NSJAIL/NSJAIL.10/memory.max.

/sys/fs/cgroup/NSJAIL exists.

disconnect3d commented 2 years ago

Can you show your mounts or at least mount | grep cgroup? Also, this isn't in a Docker container is it?

carlbordum commented 2 years ago

It is in a docker container. What needs to be different?

mateuszlewko commented 2 years ago

The issue persists even if we bind the cgroup mount point to the container:

docker run -v /sys/fs/cgroup:/sys/fs/cgroup --privileged --rm -it nsjailcontainer nsjail --cgroup_mem_max 104857600  --user 99999 --group 99999 --disable_proc --chroot / --time_limit 100 /bin/bash --use_cgroupv2
[I][2022-05-19T20:16:52+0000] Mode: STANDALONE_ONCE
[I][2022-05-19T20:16:52+0000] Jail parameters: hostname:'NSJAIL', chroot:'/', process:'/bin/bash', bind:[::]:0, max_conns:0, max_conns_per_ip:0, time_limit:100, personality:0, daemonize:false, clone_newnet:true, clone_newuser:true, clone_newns:true, clone_newpid:true, clone_newipc:true, clone_newuts:true, clone_newcgroup:true, clone_newtime:false, keep_caps:false, disable_no_new_privs:false, max_cpus:0
[I][2022-05-19T20:16:52+0000] Mount: '/' -> '/' flags:MS_RDONLY|MS_BIND|MS_REC|MS_PRIVATE type:'' options:'' dir:true
[I][2022-05-19T20:16:52+0000] Uid map: inside_uid:99999 outside_uid:0 count:1 newuidmap:false
[W][2022-05-19T20:16:52+0000][1] logParams():267 Process will be UID/EUID=0 in the global user namespace, and will have user root-level access to files
[I][2022-05-19T20:16:52+0000] Gid map: inside_gid:99999 outside_gid:0 count:1 newgidmap:false
[W][2022-05-19T20:16:52+0000][1] logParams():277 Process will be GID/EGID=0 in the global user namespace, and will have group root-level access to files
[E][2022-05-19T20:16:52+0000][1] writeBufToFile():100 Couldn't write '1' bytes to file '/sys/fs/cgroup/NSJAIL.6/cgroup.procs' (fd='6'): No such file or directory
[W][2022-05-19T20:16:52+0000][1] addPidToProcList():73 Could not update cgroup.procs
[E][2022-05-19T20:16:52+0000][1] initParent():411 Couldn't initialize cgroup 2 user namespace for pid=6
[F][2022-05-19T20:16:52+0000][1] runChild():469 Launching child process failed

With the mount nsjail can write to memory.max, but can't move process to the created group.

To be fair, the issue occurs also outside of nsjail. Moving process to the new cgroup manually doesn't seem to work in privileged docker container. As expected it does work outside the container. Do you know why is that and how to overcome this? What permissions are needed to move a process to cgroup?

disconnect3d commented 2 years ago

EDIT: below you can see some diagnosis of your issues, but I am wondering: is there any particular reason you want to use nsjail with cgroups v2 instead of v1?


Docker enables lots of options that may influence whether you can or cannot do a certain operation and for example even if you use the --privileged flag, Docker will still use Linux namespaces and specifically the cgroup namespace which will make the /sys/fs/cgroup/ to render the cgroup controllers with the groups hierarchy that were created only in this container (or rather: the namespaces that were created for it). But yeah, what @mateuszlewko showed, bind mounting the "host" cgroup mount point should help here.

Fwiw it is hard to diagnose your issues not having much details about what commands you executed or the environment you run this against. But anyway, lets try to help :).

I have tried to reproduce your issues on my side on Ubuntu 21.04 and my first issue was that /sys/fs/cgroup is read-only:

$ ./nsjail --cgroup_mem_max 104857600  --user 99999 --group 99999 --disable_proc --chroot / --time_limit 100 /bin/bash --use_cgroupv2
[I][2022-05-20T01:27:07+0200] Mode: STANDALONE_ONCE
[I][2022-05-20T01:27:07+0200] Jail parameters: hostname:'NSJAIL', chroot:'/', process:'/bin/bash', bind:[::]:0, max_conns:0, max_conns_per_ip:0, time_limit:100, personality:0, daemonize:false, clone_newnet:true, clone_newuser:true, clone_newns:true, clone_newpid:true, clone_newipc:true, clone_newuts:true, clone_newcgroup:true, clone_newtime:false, keep_caps:false, disable_no_new_privs:false, max_cpus:0
[I][2022-05-20T01:27:07+0200] Mount: '/' -> '/' flags:MS_RDONLY|MS_BIND|MS_REC|MS_PRIVATE type:'' options:'' dir:true
[I][2022-05-20T01:27:07+0200] Uid map: inside_uid:99999 outside_uid:1000 count:1 newuidmap:false
[I][2022-05-20T01:27:07+0200] Gid map: inside_gid:99999 outside_gid:1000 count:1 newgidmap:false
[W][2022-05-20T01:27:07+0200][30182] createCgroup():49 mkdir('/sys/fs/cgroup/NSJAIL.30183', 0700) failed: Read-only file system
[E][2022-05-20T01:27:07+0200][30182] initParent():411 Couldn't initialize cgroup 2 user namespace for pid=30183
[F][2022-05-20T01:27:07+0200][1] runChild():469 Launching child process failed

$ sudo ./nsjail --cgroup_mem_max 104857600  --user 99999 --group 99999 --disable_proc --chroot / --time_limit 100 /bin/bash --use_cgroupv2
[I][2022-05-20T01:27:09+0200] Mode: STANDALONE_ONCE
[I][2022-05-20T01:27:09+0200] Jail parameters: hostname:'NSJAIL', chroot:'/', process:'/bin/bash', bind:[::]:0, max_conns:0, max_conns_per_ip:0, time_limit:100, personality:0, daemonize:false, clone_newnet:true, clone_newuser:true, clone_newns:true, clone_newpid:true, clone_newipc:true, clone_newuts:true, clone_newcgroup:true, clone_newtime:false, keep_caps:false, disable_no_new_privs:false, max_cpus:0
[I][2022-05-20T01:27:09+0200] Mount: '/' -> '/' flags:MS_RDONLY|MS_BIND|MS_REC|MS_PRIVATE type:'' options:'' dir:true
[I][2022-05-20T01:27:09+0200] Uid map: inside_uid:99999 outside_uid:0 count:1 newuidmap:false
[W][2022-05-20T01:27:09+0200][30188] logParams():265 Process will be UID/EUID=0 in the global user namespace, and will have user root-level access to files
[I][2022-05-20T01:27:09+0200] Gid map: inside_gid:99999 outside_gid:0 count:1 newgidmap:false
[W][2022-05-20T01:27:09+0200][30188] logParams():275 Process will be GID/EGID=0 in the global user namespace, and will have group root-level access to files
[W][2022-05-20T01:27:09+0200][30188] createCgroup():49 mkdir('/sys/fs/cgroup/NSJAIL.30189', 0700) failed: Read-only file system
[E][2022-05-20T01:27:09+0200][30188] initParent():411 Couldn't initialize cgroup 2 user namespace for pid=30189
[F][2022-05-20T01:27:09+0200][1] runChild():469 Launching child process failed

On my side, this is because I have both cgroups v1 and v2 and v2 is mounted in a different path:

$ mount | grep cgroup
tmpfs on /sys/fs/cgroup type tmpfs (ro,nosuid,nodev,noexec,size=4096k,nr_inodes=1024,mode=755,inode64)
cgroup2 on /sys/fs/cgroup/unified type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate)
cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,name=systemd)
cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer)
cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (rw,nosuid,nodev,noexec,relatime,net_cls,net_prio)
cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset)
cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory)
cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,perf_event)
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpu,cpuacct)
cgroup on /sys/fs/cgroup/pids type cgroup (rw,nosuid,nodev,noexec,relatime,pids)
cgroup on /sys/fs/cgroup/hugetlb type cgroup (rw,nosuid,nodev,noexec,relatime,hugetlb)
cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio)
cgroup on /sys/fs/cgroup/rdma type cgroup (rw,nosuid,nodev,noexec,relatime,rdma)
cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices)

I was able to resolve this issue with the --cgroupv2_mount=/sys/fs/cgroup/unified flag:

$ sudo ./nsjail --cgroup_mem_max 104857600  --user 99999 --group 99999 --disable_proc --chroot / --time_limit 100 /bin/bash --use_cgroupv2 --cgroupv2_mount=/sys/fs/cgroup/unified
[I][2022-05-20T01:29:02+0200] Mode: STANDALONE_ONCE
[I][2022-05-20T01:29:02+0200] Jail parameters: hostname:'NSJAIL', chroot:'/', process:'/bin/bash', bind:[::]:0, max_conns:0, max_conns_per_ip:0, time_limit:100, personality:0, daemonize:false, clone_newnet:true, clone_newuser:true, clone_newns:true, clone_newpid:true, clone_newipc:true, clone_newuts:true, clone_newcgroup:true, clone_newtime:false, keep_caps:false, disable_no_new_privs:false, max_cpus:0
[I][2022-05-20T01:29:02+0200] Mount: '/' -> '/' flags:MS_RDONLY|MS_BIND|MS_REC|MS_PRIVATE type:'' options:'' dir:true
[I][2022-05-20T01:29:02+0200] Uid map: inside_uid:99999 outside_uid:0 count:1 newuidmap:false
[W][2022-05-20T01:29:02+0200][30304] logParams():265 Process will be UID/EUID=0 in the global user namespace, and will have user root-level access to files
[I][2022-05-20T01:29:02+0200] Gid map: inside_gid:99999 outside_gid:0 count:1 newgidmap:false
[W][2022-05-20T01:29:02+0200][30304] logParams():275 Process will be GID/EGID=0 in the global user namespace, and will have group root-level access to files
[I][2022-05-20T01:29:02+0200] Setting 'memory.max' to '104857600'
[E][2022-05-20T01:29:02+0200][30304] writeBufToFile():95 Couldn't open '/sys/fs/cgroup/unified/NSJAIL.30305/memory.max' for writing: No such file or directory
[W][2022-05-20T01:29:02+0200][30304] writeToCgroup():61 Could not update memory.max
[E][2022-05-20T01:29:02+0200][30304] initParent():411 Couldn't initialize cgroup 2 user namespace for pid=30305
[F][2022-05-20T01:29:02+0200][1] runChild():469 Launching child process failed

But as we can see, now I am getting the error that @carlbordum was getting:

[E][2022-05-20T01:29:02+0200][30304] writeBufToFile():95 Couldn't open '/sys/fs/cgroup/unified/NSJAIL.30305/memory.max' for writing: No such file or directory

So what happens here? Well, while the cgroup v2 memory controller indeed does expose such file it does not exist on my side because... I don't have a memory cgroup v2 controllers enabled or even available! :(

We can see that here, as according to this kernel documentation page the cgroup.controllers file should list us the available controllers (e.g. memory io cpu):

$ cat /sys/fs/cgroup/unified/cgroup.controllers 
$ 

But it shows nothing instead! So why is that? Why are there no cgroupv2 controllers available?

If I understand correctly, this is related to what they write here:

cgroup2 filesystem has the magic number 0x63677270 (“cgrp”). All controllers which support v2 and are not bound to a v1 hierarchy are automatically bound to the v2 hierarchy and show up at the root. Controllers which are not in active use in the v2 hierarchy can be bound to other hierarchies. This allows mixing v2 hierarchy with the legacy v1 multiple hierarchies in a fully backward compatible way.

A controller can be moved across hierarchies only after the controller is no longer referenced in its current hierarchy. Because per-cgroup controller states are destroyed asynchronously and controllers may have lingering references, a controller may not show up immediately on the v2 hierarchy after the final umount of the previous hierarchy. Similarly, a controller should be fully disabled to be moved out of the unified hierarchy and it may take some time for the disabled controller to become available for other hierarchies; furthermore, due to inter-controller dependencies, other controllers may need to be disabled too.

While useful for development and manual configurations, moving controllers dynamically between the v2 and other hierarchies is strongly discouraged for production use. It is recommended to decide the hierarchies and controller associations before starting using the controllers after system boot.

During transition to v2, system management software might still automount the v1 cgroup filesystem and so hijack all controllers during boot, before manual intervention is possible. To make testing and experimenting easier, the kernel parameter cgroup_no_v1= allows disabling controllers in v1 and make them always available in v2.

It seems that a given controller may be bound either to v1 or to v2 but never to both of them. I guess this kinda makes sense, and just to recap, my memory controller is indeed bound to v1 as what my mount output showed:

cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory)

So if you are in the same situation as me, I guess the easiest is to change kernel boot parameters and add cgroup_no_v1=memory there or/and other controller excludes (not sure which are all of those that nsjail use). As I guess removing all processes from cgroup v1 may be hard at runtime (e.g. since lots of this may be managed by systemd and idk if it supports v1->v2 migration).

disconnect3d commented 2 years ago

@robertswiecki with what we see above, I guess we could improve nsjail UX by:

  1. Inspecting the <cgroup-v2-path>/cgroup.controllers file and erroring out with a nice log that one has to enable specific cgroup v2 controllers first? And maybe linking to the kernel docs.
  2. Maybe inspecting mounts and fetching the cgroup v2 path from there?
carlbordum commented 2 years ago

Wow, I am completely blown away be your helpfulness for such a poor "bug" report.

I am specifically working on this little project. It is very reproducible, so I was confused about why it stopped working, but I think its because my systems now run the cgroupv2 controller.

Is there any decent way to run nsjail commands that work with both cgroupv1 and cgroupv2 or does my program need to detect it an inject different flags?

edit: if you want, you can clone the project and run docker-compose up -d and docker-compose logs cody to see the error.

robertswiecki commented 2 years ago

@disconnect3d thank you for such cool in-depth analysis.

I'm vaguely familiar with cgroups2 myself, but I guess I can take a look at what can be improved here.

Though, if anyone will beat me to that, I won't complain :)

mattgodbolt commented 1 year ago

We're also seeing issues in trying to get nsjail running for Compiler Explorer on newer cgroupss (on Ubuntu 22.04):

[D][2022-11-07T21:41:28-0600][6205] bool cgroup::createCgroup(const string&, pid_t)():41 Create '/sys/fs/cgroup/memory/ce-compile/NSJAIL.6207' for pid=6207
[W][2022-11-07T21:41:28-0600][6205] bool cgroup::createCgroup(const string&, pid_t)():43 mkdir('/sys/fs/cgroup/memory/ce-compile/NSJAIL.6207', 0700) failed: No such file or directory

(or with --use_cgroupv2)

[D][2022-11-07T21:51:35-0600][7547] bool cgroup2::createCgroup(const string&, pid_t)():47 Create '/sys/fs/cgroup/NSJAIL.7548' for pid=7548
[W][2022-11-07T21:51:35-0600][7547] bool cgroup2::createCgroup(const string&, pid_t)():49 mkdir('/sys/fs/cgroup/NSJAIL.7548', 0700) failed: Permission denied

is what we see, which may be similar. We wouldn't choose to run cgroups2 but Ubuntu 22.04 seems to have made it the default, and it's easier not to special case boot params to get it back to the old system.

mattgodbolt commented 1 year ago

In my case:

$ cat /sys/fs/cgroup/cgroup.controllers
cpuset cpu io memory hugetlb pids rdma misc
$ mount | grep cg
cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot)

so it doesn't quite seem the same issue as others have seen, though I'm having trouble with the cgcreate equivalent, so now I might "just" be hitting a hole in the tooling.

That said: with a bit more hacking and fiddling with settings I supplied a different cgroupts2 mount (including the cgroup parents name I had cgcreate-d, it got further:

[D][2022-11-07T22:06:19-0600][8754] bool cgroup2::createCgroup(const string&, pid_t)():47 Create '/sys/fs/cgroup/ce-compile/NSJAIL.8755' for pid=8755
[D][2022-11-07T22:06:19-0600][8754] bool cgroup2::addPidToProcList(const string&, pid_t)():70 Adding pid='8755' to cgroup.procs
[E][2022-11-07T22:06:19-0600][8754] bool util::writeBufToFile(const char*, const void*, size_t, int)():100 Couldn't write '4' bytes to file '/sys/fs/cgroup/ce-compile/NSJAIL.8755/cgroup.procs' (fd='6'): Permission denied
[W][2022-11-07T22:06:19-0600][8754] bool cgroup2::addPidToProcList(const string&, pid_t)():73 Could not update cgroup.procs
[E][2022-11-07T22:06:19-0600][8754] bool subproc::initParent(nsjconf_t*, pid_t, int)():392 Couldn't initialize cgroup 2 user namespace for pid=8755
[F][2022-11-07T22:06:19-0600][1] bool subproc::runChild(nsjconf_t*, int, int, int, int)():448 Launching child process failed

and ...

$ ls -l /sys/fs/cgroup/ce-compile/NSJAIL.8755/
total 0
-r--r--r-- 1 matthew matthew 0 Nov  7 22:06 cgroup.controllers
-r--r--r-- 1 matthew matthew 0 Nov  7 22:06 cgroup.events
-rw-r--r-- 1 matthew matthew 0 Nov  7 22:06 cgroup.freeze
--w------- 1 matthew matthew 0 Nov  7 22:06 cgroup.kill
-rw-r--r-- 1 matthew matthew 0 Nov  7 22:06 cgroup.max.depth
-rw-r--r-- 1 matthew matthew 0 Nov  7 22:06 cgroup.max.descendants
-rw-r--r-- 1 matthew matthew 0 Nov  7 22:06 cgroup.procs
-r--r--r-- 1 matthew matthew 0 Nov  7 22:06 cgroup.stat
-rw-r--r-- 1 matthew matthew 0 Nov  7 22:06 cgroup.subtree_control
-rw-r--r-- 1 matthew matthew 0 Nov  7 22:06 cgroup.threads
-rw-r--r-- 1 matthew matthew 0 Nov  7 22:06 cgroup.type
-rw-r--r-- 1 matthew matthew 0 Nov  7 22:06 cpu.pressure
-r--r--r-- 1 matthew matthew 0 Nov  7 22:06 cpu.stat
-rw-r--r-- 1 matthew matthew 0 Nov  7 22:06 io.pressure
-rw-r--r-- 1 matthew matthew 0 Nov  7 22:06 memory.pressure

so it looks like I ought to be able to write to the file.

ndrewh commented 1 year ago

At least for my use-case (running nsjail in LISTEN mode as the root process in a Docker container), once I specified --use_cgroupv2, the issue was that /sys/fs/cgroup/cgroup.subtree_control was empty. This means that every cgroup created by nsjail cannot inherit any of the controllers (even though they are all present in /sys/fs/cgroup/cgroup.controllers).

I worked some on a fix for this in my fork. All we really need to do is look at the root cgroup.subtree_control and make sure the controllers we need are there. If they aren't there, we need to add them (in fact, this is exactly what redpwn/jail does). A minor issue is that in order to modify cgroup.subtree_control, you have to move all processes from the root cgroup (this is apparently a thing called the "no internal processes" rule). Currently my patch only handles the case where nsjail is the only process in the root cgroup. This is good enough for my use-case, but probably not good enough for others.

My patch is here, and works for my use case, but likely needs some work to be useful to others: https://github.com/google/nsjail/compare/master...ndrewh:cgroupsv2-fix

(Footnote: If you're going to try my fork, nsjail needs to be the root process in the cgroup, you can accomplish this by invoking nsjail using the execve-variant of the CMD dockerfile directive. ie. Use CMD ['/usr/bin/nsjail', ...] not CMD nsjail ...)

mattgodbolt commented 1 year ago

@ndrewh I'm not running on docker in my case; this is "just" on a plain Ubuntu 22.04 system

ndrewh commented 1 year ago

I looked at this a little more, since I know I've run into issues running on stock 22.04 as well. I tried this on a 22.04 desktop in virtualbox. I did see slightly different initial behavior in AWS, but I think what's here should still be helpful.

Linux 5.15.0-53-generic #59-Ubuntu SMP Mon Oct 17 18:53:30 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux For some reason, on boot the cpu controller is missing from cgroup.subtree_control (why? i have no idea):

$ cat /sys/fs/cgroup/cgroup.subtree_control
memory pids

Example 1: Running as root?

If you just straight up run nsjail now in the root cgroup (as sudo, so it can create it's child cgroup), --cgroup_mem_max works fine, but if you set --cgroup_cpu_ms_per_sec you'll get:

$ sudo ./nsjail -R /bin/ -R /lib/ -R /lib64 -R /usr/ -R /sbin/ --use_cgroupv2 --cgroup_cpu_ms_per_sec 500 -- /bin/bash -i
[I][2022-11-16T15:29:05-0500] Setting 'cpu.max' to '500000 1000000'
[E][2022-11-16T15:29:05-0500][4983] writeBufToFile():96 Couldn't open '/sys/fs/cgroup/NSJAIL.4984/cpu.max' for writing: No such file or directory
[W][2022-11-16T15:29:05-0500][4983] writeToCgroup():61 Could not update cpu.max
[E][2022-11-16T15:29:05-0500][4983] initParent():425 Couldn't initialize cgroup 2 user namespace for pid=4984
[F][2022-11-16T15:29:05-0500][1] runChild():483 Launching child process failed

Fix:

echo "+cpu" > /sys/fs/cgroup/cgroup.subtree_control

(If you try the same thing on my fork, it'll do this last line for you. Whether this is desirable behavior in general or not, I am not sure.)

Example 2: Creating a cgroup, running non-root (by adding user to /sys/fs/cgroup/cgroup.procs)

OK, but what if instead of using the root cgroup, we want to make a new cgroup (as @mattgodbolt was trying), and give our user permissions to use it?

sudo cgcreate -a $USER -t $USER -g memory,cpu:jailtest

I think the permissions error @mattgodbolt was running into is due to the fact you don't have permission to move processes out of the root cgroup? We can fix that:

sudo chown andrew:root /sys/fs/cgroup/cgroup.procs

Now nsjail can move it's children into the appropriate cgroup, and we get a little further:

$ ./nsjail --cgroup_mem_max 1000000 -R /bin/ -R /lib/ -R /lib64 -R /usr/ -R /sbin/ --use_cgroupv2 --cgroup_cpu_ms_per_sec 500 --cgroupv2_mount /sys/fs/cgroup/jailtest/ -- /bin/bash -i
...
[E][2022-11-16T16:58:12-0500][3096] writeBufToFile():96 Couldn't open '/sys/fs/cgroup/jailtest//NSJAIL.3097/memory.max' for writing: No such file or directory
[W][2022-11-16T16:58:12-0500][3096] writeToCgroup():61 Could not update memory.max
[E][2022-11-16T16:58:12-0500][3096] initParent():425 Couldn't initialize cgroup 2 user namespace for pid=3097

We need just one more thing, since /sys/fs/cgroup/jailtest/cgroup.subtree_control is empty:

echo "+cpu +memory" > /sys/fs/cgroup/jailtest/cgroup.subtree_control

Now nsjail works :)

Example 3: Exec into a cgroup, do everything from there (non-root)

sudo cgcreate -a $USER -t $USER -g memory,cpu:jailtest3
sudo cgexec -g memory,cpu:jailtest3 sudo -s -u andrew
andrew@andrew2204:~/nsjail$

Now we are in the child cgroup... lets try to run nsjail

$ ./nsjail --cgroup_mem_max 1000000 -R /bin/ -R /lib/ -R /lib64 -R /usr/ -R /sbin/ --use_cgroupv2 --cgroup_cpu_ms_per_sec 500 --cgroupv2_mount /sys/fs/cgroup/jailtest3/ -- /bin/bash -i
...
[I][2022-11-16T17:13:58-0500] Setting 'memory.max' to '1000000'
[E][2022-11-16T17:13:58-0500][3251] writeBufToFile():96 Couldn't open '/sys/fs/cgroup/jailtest3//NSJAIL.3252/memory.max' for writing: No such file or directory
[W][2022-11-16T17:13:58-0500][3251] writeToCgroup():61 Could not update memory.max
[E][2022-11-16T17:13:58-0500][3251] initParent():425 Couldn't initialize cgroup 2 user namespace for pid=3252

Same issue... lets do the same thing, right?

$ echo "+cpu +memory" > /sys/fs/cgroup/jailtest3/cgroup.subtree_control
bash: echo: write error: Device or resource busy

Why can't we do this? It's because the "no internal processes rule" won't let us have controllers in cgroup.subtree_control if our cgroup currently has processes. First, lets see how to fix this manually:

$ cat /sys/fs/cgroup/jailtest3/cgroup.procs
3281
3282
3283
3299
$ mkdir /sys/fs/cgroup/jailtest3/lol/
$ echo "3281" >  /sys/fs/cgroup/jailtest3/lol/cgroup.procs
$ echo "3282" >  /sys/fs/cgroup/jailtest3/lol/cgroup.procs
$ echo "3283" >  /sys/fs/cgroup/jailtest3/lol/cgroup.procs
$ echo "+cpu +memory" > /sys/fs/cgroup/jailtest3/cgroup.subtree_control

(Since we spawned a shell, we have a couple of processes in the jailtest3 cgroup -- these have to be moved before we can add to subtree_control)

Now nsjail works!


The point of my PR is to add controllers to cgroup.subtree_control (the very last command I ran in each of these examples) if they are not present. For the last example, it doesn't seem like nsjail ought to move all those processes into a subgroup blindly -- so my PR only handles the case where nsjail is the only process in the group.

robertswiecki commented 1 year ago

Thanks all for working on this and contributing. My cgroup1/2-foo is not great, but from what I can tell it works as expected.

./nsjail --config configs/bash-with-fake-geteuid.cfg --detect_cgroupv2 --cgroup_cpu_ms_per_sec 100 --cgroupv2_mount /sys/fs/cgroup/user.slice/user-1000.slice/user\@1000.service/
...
[JAILED-BASH:21:33:03:sh-5.2.2:/tmp]# openssl speed

And I can see that only 10% of a single CPU core is used (via top and with openssl speed results). So.. promising.

Without cgroups

[JAILED-BASH:21:35:48:sh-5.2.2:/tmp]# openssl speed
Doing md5 for 3s on 16 size blocks: 19115100 md5's in 3.00s

With cgroups

[JAILED-BASH:21:36:05:sh-5.2.2:/tmp]# openssl speed
Doing md5 for 3s on 16 size blocks: 2666542 md5's in 0.39s

Not exactly 10%, but close enough, assuming the cores were not isolated for the test.

mattgodbolt commented 1 year ago

Some progress at least, with the c7c0adfffe79ebebfacca003f3cd8e27ef909185 version, and having run sudo cgcreate -a matthew -t matthew -g memory,pids,cpu:ce-sandbox I now get:

 $ nsjail --detect_cgroupv2 --config etc/nsjail/execute.cfg --cgroupv2_mount=/sys/fs/cgroup/ce-sandbox/ --verbose -- /bin/bash
...
[D][2022-11-28T17:16:31-0600][554658] addProc():243 Added pid=554659 with start time 1669677391 to the queue for IP: '[STANDALONE MODE]'
[D][2022-11-28T17:16:31-0600][554658] createCgroup():41 Create '/sys/fs/cgroup/memory/ce-compile/NSJAIL.554659' for pid=554659
[W][2022-11-28T17:16:31-0600][554658] createCgroup():43 mkdir('/sys/fs/cgroup/memory/ce-compile/NSJAIL.554659', 0700) failed: No such file or directory
[E][2022-11-28T17:16:31-0600][554658] initParent():429 Couldn't initialize cgroup user namespace for pid=554659
[F][2022-11-28T17:16:31-0600][1] runChild():483 Launching child process failed

It's not clear to my why I need to pass --cgroupv2_mount (but it makes things "better"). And I don't know what the error with "no such file or directory" means in this context; it doesn't seem to map to anything I can see from earlier comments. Overall context is me trying to update my project's setup for cgroup1 to cgroup2, after upgrading and now being left in an unfortunate state of not being able to run the jailing scripts on my local machine that my deployed system uses :) If a kind soul on this repo can help, I'd be super grateful.

ndrewh commented 1 year ago

@mattgodbolt An unfortunate fix:

 $ nsjail --config etc/nsjail/execute.cfg --cgroupv2_mount=/sys/fs/cgroup/ce-sandbox/ --verbose --detect_cgroupv2 -- /bin/bash

yup, argument order apparently matters (relative to the --config?), i guess :/

BTW - all of these options can be specified in the cfg file as well. Thank you for Compiler Explorer! ❤️

robertswiecki commented 1 year ago

Yup, the current way of parsing args is to run file config parsing at the moment the --config is spotted in the cmd-line.

I thought it was a clever way of doing things, but clever is not always the best :).

However, changing it now would a). break backwards compatibility c). would be not-so easy to implement, b/c two passes of cmdline arguments would be needed (first file, then args) - or some way of caching them.

mattgodbolt commented 1 year ago

thanks all:

[D][2022-11-29T07:17:29-0600][556719] addPidToProcList():117 Adding pid='556720' to cgroup.procs
[E][2022-11-29T07:17:29-0600][556719] writeBufToFile():105 Couldn't write '6' bytes to file '/sys/fs/cgroup/ce-sandbox//NSJAIL.556720/cgroup.procs' (fd='6'): Permission denied
[W][2022-11-29T07:17:29-0600][556719] addPidToProcList():120 Could not update cgroup.procs
[E][2022-11-29T07:17:29-0600][556719] initParent():425 Couldn't initialize cgroup 2 user namespace for pid=556720

is now more in line with the other stuff here I think?

BTW - all of these options can be specified in the cfg file as well.

right! I'm just trying to use the existing config as it's far easier to supply a couple extra cmdline flags on a v2 system than it is to have two config files, one for v1 and one for v2 (Ideally I can support both as we transition).

Thank you for Compiler Explorer!

You're so welcome! nsjail is a big part of what makes it (mostly) secure :)

--

Yup, the current way of parsing args is to run file config parsing at the moment the --config is spotted in the cmd-line.

makes sense to me! thanks! :)

ndrewh commented 1 year ago

@mattgodbolt

The point of --detect_cgroupv2 (at least, as I intended it) was to allow you to specify options for both v1 and v2, and nsjail would infer which options are valid at runtime. So if you specify the 'cgroupv2_mount' and 'detect_cgroupv2: true' in the config file, it should be backwards-compatible. It will check if the v2 mount is a valid cgroupv2 filesystem and will use v2 only if it is.

As for the permissions error, I think you're closest to the "Example 2" in my previous comment. My guess is nsjail does not have permission to move the child out of the current cgroup. You can fix this by either (1) spawning nsjail inside a cgroup it has permissions to move children out of (e.g. via cgexec or Docker), or (2) modifying the permissions on the cgroup.procs file for nsjail's current cgroup (probably either the root one or the one associated with your terminal).

mattgodbolt commented 1 year ago

Awesome! Thanks that clears up a few things. I'll try fiddling with settings on 22.04 to see if I can work out what environmental things need changing both for me as a user and then also in the VM for the site (which can be more bespoke)

Cheers!

mattgodbolt commented 1 year ago

@ndrewh I was able to get things working with that chown! yay!

That works for my specific use case, but more genreally on a multi-tenant system is there any way thi can be made to work do you think? Is that an Ubuntu issue?

ndrewh commented 1 year ago

@mattgodbolt I don't think it's a ubuntu issue, I think it's just that you need nsjail to be in a cgroup that it has permissions to move it's child processes out of.

I think the following should work on a multi-tenant system:

Make a new cgroup:

sudo cgcreate -a $USER -t $USER -g memory,cpu,pids:mygroup

Run nsjail in that new cgroup

cgexec -g memory,cpu,pids:mygroup nsjail --config etc/nsjail/execute.cfg --cgroupv2_mount=/sys/fs/cgroup/mygroup/ --verbose --detect_cgroupv2 -- /bin/bash
...
[I][2022-11-29T23:57:01+0000] Detected cgroups version: 2
[I][2022-11-29T23:57:01+0000] nsjail is moving itself to a new child cgroup: /sys/fs/cgroup/mygroup//NSJAIL_SELF.268
...
[D][2022-11-29T23:57:01+0000][268] createCgroup():52 Create '/sys/fs/cgroup/mygroup//NSJAIL_SELF.268' for pid=268
[D][2022-11-29T23:57:01+0000][268] addPidToProcList():86 Adding pid='0' to cgroup.procs
[D][2022-11-29T23:57:01+0000][268] writeBufToFile():109 Written '1' bytes to '/sys/fs/cgroup/mygroup//NSJAIL_SELF.268/cgroup.procs'
[D][2022-11-29T23:57:01+0000][268] enableCgroupSubtree():61 Enable cgroup.subtree_control +'memory' to '/sys/fs/cgroup/mygroup/' for pid=268
[D][2022-11-29T23:57:01+0000][268] writeBufToFile():109 Written '7' bytes to '/sys/fs/cgroup/mygroup//cgroup.subtree_control'
[D][2022-11-29T23:57:01+0000][268] enableCgroupSubtree():61 Enable cgroup.subtree_control +'pids' to '/sys/fs/cgroup/mygroup/' for pid=268
[D][2022-11-29T23:57:01+0000][268] writeBufToFile():109 Written '5' bytes to '/sys/fs/cgroup/mygroup//cgroup.subtree_control'
[D][2022-11-29T23:57:01+0000][268] enableCgroupSubtree():61 Enable cgroup.subtree_control +'cpu' to '/sys/fs/cgroup/mygroup/' for pid=268
[D][2022-11-29T23:57:01+0000][268] writeBufToFile():109 Written '4' bytes to '/sys/fs/cgroup/mygroup//cgroup.subtree_control'
...
[D][2022-11-29T23:57:01+0000][268] runChild():467 Creating new process with clone flags:CLONE_NEWNS|CLONE_NEWCGROUP|CLONE_NEWUTS|CLONE_NEWIPC|CLONE_NEWUSER|CLONE_NEWPID|CLONE_NEWNET and exit_signal:SIGCHLD
...
[D][2022-11-29T23:57:01+0000][268] createCgroup():52 Create '/sys/fs/cgroup/mygroup//NSJAIL.269' for pid=269
[D][2022-11-29T23:57:01+0000][268] addPidToProcList():86 Adding pid='269' to cgroup.procs
[D][2022-11-29T23:57:01+0000][268] writeBufToFile():109 Written '3' bytes to '/sys/fs/cgroup/mygroup//NSJAIL.269/cgroup.procs'
[I][2022-11-29T23:57:01+0000] Setting 'memory.max' to '1342177280'
[D][2022-11-29T23:57:01+0000][268] writeBufToFile():109 Written '10' bytes to '/sys/fs/cgroup/mygroup//NSJAIL.269/memory.max'
[D][2022-11-29T23:57:01+0000][268] createCgroup():52 Create '/sys/fs/cgroup/mygroup//NSJAIL.269' for pid=269
[D][2022-11-29T23:57:01+0000][268] addPidToProcList():86 Adding pid='269' to cgroup.procs
[D][2022-11-29T23:57:01+0000][268] writeBufToFile():109 Written '3' bytes to '/sys/fs/cgroup/mygroup//NSJAIL.269/cgroup.procs'
[I][2022-11-29T23:57:01+0000] Setting 'pids.max' to '72'
[D][2022-11-29T23:57:01+0000][268] writeBufToFile():109 Written '2' bytes to '/sys/fs/cgroup/mygroup//NSJAIL.269/pids.max'
[D][2022-11-29T23:57:01+0000][268] createCgroup():52 Create '/sys/fs/cgroup/mygroup//NSJAIL.269' for pid=269
[D][2022-11-29T23:57:01+0000][268] addPidToProcList():86 Adding pid='269' to cgroup.procs
[D][2022-11-29T23:57:01+0000][268] writeBufToFile():109 Written '3' bytes to '/sys/fs/cgroup/mygroup//NSJAIL.269/cgroup.procs'
[I][2022-11-29T23:57:01+0000] Setting 'cpu.max' to '1000000 1000000'
[D][2022-11-29T23:57:01+0000][268] writeBufToFile():109 Written '15' bytes to '/sys/fs/cgroup/mygroup//NSJAIL.269/cpu.max'
...

(note: if you cgexec into a shell as i did in "Example 3" in one of my previous comments, there are other hoops to jump through. this works out-of-the-box if you are directly cgexec-ing nsjail)

mattgodbolt commented 1 year ago

I think I see. Thanks @ndrewh. Seems unfortunate to have to do the two steps (and specify the weird mount point thing too) but looks like it can be made to work. I'll have to see if that also works on cgroupv1 (I presume it does).

ndrewh commented 1 year ago

I'll try to summarize (hopefully correctly) here in case someone finds this later:


These groups do not have to be the same. It sounds like for many applications you could just as well create two cgroups:

$ sudo cgcreate -a $USER -t $USER -g cpu,pids,memory:jailparentgroup
$ sudo cgcreate -a $USER -t $USER -g cpu,pids,memory:jailchildgroup
$ sudo cgexec -g cpu,pids,memory:jailparentgroup ./nsjail --cgroup_mem_max 10000000 --cgroup_pids_max 50 --cgroup_cpu_ms_per_sec 500 --verbose --detect_cgroupv2 --cgroupv2_mount /sys/fs/cgroup/jailchildgroup/ -R /usr -R /bin -R /lib -R /lib64 -- /bin/bash

The user would need full ownership of /sys/fs/cgroup/jailchildgroup and additionally permission on /sys/fs/cgroup/jailparentgroup/cgroup.procs -- if you cgcreate as above, you need no additional changes. (Creating separate groups as above also avoid any cgroup.subtree_control issues, since jailchildgroup would not have any processes, only sub-cgroups.).

Note I don't think this trick improves the situation in a default (but privileged) Docker container, where your best best is making sure that nsjail is the root process (and then nsjail will move itself to create a 2-group scenario similar to above).

Gregofi commented 5 months ago

Hi! I'm having similar issues described here, getting the error message

[E][2024-05-11T20:33:56+0000][148] writeBufToFile():105 Couldn't write '3' bytes to file '/sys/fs/cgroup/NSJAIL.149/cgroup.procs' (fd='6'): No such file or directory

although the file exists when checking with ls.

I am running inside Docker (26.1.2) with --privileged and mount /sys/fs/cgroup:/sys/fs/cgroup. Running nsjail with the same config under the host system directly (not in Docker) works fine. I use the detect_cgroupv2 and run on Arch Linux.

Output of mount | grep cgroup:

cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot)

Output of sudo cat /sys/fs/cgroup/NSJAIL.32/cgroup.controllers:

cpuset cpu io memory hugetlb pids rdma misc

The full config can be found here.

I tried to play around with it and couldn't figure it out. It seems to work under cgroupv1 on Debian Bookworm. Any help is greatly appreciated.

ndrewh commented 5 months ago

@Gregofi I believe cgroup.procs is for cgroupsv1. Specify --detect_cgroupv2 and it will switch to v2 if it is present.

ndrewh commented 5 months ago

@Gregofi Sorry, I realized I gave a completely bogus answer... you used detect_cgroupv2 and it still didn't work.

Couldn't write '3' bytes to file '/sys/fs/cgroup/NSJAIL.149/cgroup.procs' (fd='6'): No such file or dir

The full log might be helpful here, but it's trying to move the child into the cgroup which it just created... not sure why this would fail, since it's quite literally doing one after the other:

https://github.com/google/nsjail/blob/a00a0efabc0c1bd44e24c798a19d6e46eefedb8d/cgroup2.cc#L255-L256

Some troubleshooting guesses:

Best of luck!

Gregofi commented 5 months ago

Hi, thanks for your response. Yes, upon reading your comment I also suspect docker permissions. I tried various things. However using the explicit --user 0:0 mapping seems to lead to an error message (from https://github.com/google/nsjail/pull/219#issuecomment-1732501151) that suggests using --cgroupns host. This solved it and no error is reported.

The first error, ending with No such file or directory, is really confusing. However, I suspect that this is because of the cgroup pseudofilesystem, so not sure if it can be improved. Again, thank you very much for your help, especially when this issue wasn't really caused by nsjail.

mattgodbolt commented 3 months ago

For what it's worth we've been able to get this working in our systems now. But we've hit a new issue when updating to an even newer Ubuntu: #236