Open carlbordum opened 2 years ago
Can you show your mounts
or at least mount | grep cgroup
? Also, this isn't in a Docker container is it?
It is in a docker container. What needs to be different?
The issue persists even if we bind the cgroup mount point to the container:
docker run -v /sys/fs/cgroup:/sys/fs/cgroup --privileged --rm -it nsjailcontainer nsjail --cgroup_mem_max 104857600 --user 99999 --group 99999 --disable_proc --chroot / --time_limit 100 /bin/bash --use_cgroupv2
[I][2022-05-19T20:16:52+0000] Mode: STANDALONE_ONCE
[I][2022-05-19T20:16:52+0000] Jail parameters: hostname:'NSJAIL', chroot:'/', process:'/bin/bash', bind:[::]:0, max_conns:0, max_conns_per_ip:0, time_limit:100, personality:0, daemonize:false, clone_newnet:true, clone_newuser:true, clone_newns:true, clone_newpid:true, clone_newipc:true, clone_newuts:true, clone_newcgroup:true, clone_newtime:false, keep_caps:false, disable_no_new_privs:false, max_cpus:0
[I][2022-05-19T20:16:52+0000] Mount: '/' -> '/' flags:MS_RDONLY|MS_BIND|MS_REC|MS_PRIVATE type:'' options:'' dir:true
[I][2022-05-19T20:16:52+0000] Uid map: inside_uid:99999 outside_uid:0 count:1 newuidmap:false
[W][2022-05-19T20:16:52+0000][1] logParams():267 Process will be UID/EUID=0 in the global user namespace, and will have user root-level access to files
[I][2022-05-19T20:16:52+0000] Gid map: inside_gid:99999 outside_gid:0 count:1 newgidmap:false
[W][2022-05-19T20:16:52+0000][1] logParams():277 Process will be GID/EGID=0 in the global user namespace, and will have group root-level access to files
[E][2022-05-19T20:16:52+0000][1] writeBufToFile():100 Couldn't write '1' bytes to file '/sys/fs/cgroup/NSJAIL.6/cgroup.procs' (fd='6'): No such file or directory
[W][2022-05-19T20:16:52+0000][1] addPidToProcList():73 Could not update cgroup.procs
[E][2022-05-19T20:16:52+0000][1] initParent():411 Couldn't initialize cgroup 2 user namespace for pid=6
[F][2022-05-19T20:16:52+0000][1] runChild():469 Launching child process failed
With the mount nsjail can write to memory.max, but can't move process to the created group.
To be fair, the issue occurs also outside of nsjail. Moving process to the new cgroup manually doesn't seem to work in privileged docker container. As expected it does work outside the container. Do you know why is that and how to overcome this? What permissions are needed to move a process to cgroup?
EDIT: below you can see some diagnosis of your issues, but I am wondering: is there any particular reason you want to use nsjail with cgroups v2 instead of v1?
Docker enables lots of options that may influence whether you can or cannot do a certain operation and for example even if you use the --privileged
flag, Docker will still use Linux namespaces and specifically the cgroup namespace which will make the /sys/fs/cgroup/
to render the cgroup controllers with the groups hierarchy that were created only in this container (or rather: the namespaces that were created for it). But yeah, what @mateuszlewko showed, bind mounting the "host" cgroup mount point should help here.
Fwiw it is hard to diagnose your issues not having much details about what commands you executed or the environment you run this against. But anyway, lets try to help :).
I have tried to reproduce your issues on my side on Ubuntu 21.04 and my first issue was that /sys/fs/cgroup
is read-only:
$ ./nsjail --cgroup_mem_max 104857600 --user 99999 --group 99999 --disable_proc --chroot / --time_limit 100 /bin/bash --use_cgroupv2
[I][2022-05-20T01:27:07+0200] Mode: STANDALONE_ONCE
[I][2022-05-20T01:27:07+0200] Jail parameters: hostname:'NSJAIL', chroot:'/', process:'/bin/bash', bind:[::]:0, max_conns:0, max_conns_per_ip:0, time_limit:100, personality:0, daemonize:false, clone_newnet:true, clone_newuser:true, clone_newns:true, clone_newpid:true, clone_newipc:true, clone_newuts:true, clone_newcgroup:true, clone_newtime:false, keep_caps:false, disable_no_new_privs:false, max_cpus:0
[I][2022-05-20T01:27:07+0200] Mount: '/' -> '/' flags:MS_RDONLY|MS_BIND|MS_REC|MS_PRIVATE type:'' options:'' dir:true
[I][2022-05-20T01:27:07+0200] Uid map: inside_uid:99999 outside_uid:1000 count:1 newuidmap:false
[I][2022-05-20T01:27:07+0200] Gid map: inside_gid:99999 outside_gid:1000 count:1 newgidmap:false
[W][2022-05-20T01:27:07+0200][30182] createCgroup():49 mkdir('/sys/fs/cgroup/NSJAIL.30183', 0700) failed: Read-only file system
[E][2022-05-20T01:27:07+0200][30182] initParent():411 Couldn't initialize cgroup 2 user namespace for pid=30183
[F][2022-05-20T01:27:07+0200][1] runChild():469 Launching child process failed
$ sudo ./nsjail --cgroup_mem_max 104857600 --user 99999 --group 99999 --disable_proc --chroot / --time_limit 100 /bin/bash --use_cgroupv2
[I][2022-05-20T01:27:09+0200] Mode: STANDALONE_ONCE
[I][2022-05-20T01:27:09+0200] Jail parameters: hostname:'NSJAIL', chroot:'/', process:'/bin/bash', bind:[::]:0, max_conns:0, max_conns_per_ip:0, time_limit:100, personality:0, daemonize:false, clone_newnet:true, clone_newuser:true, clone_newns:true, clone_newpid:true, clone_newipc:true, clone_newuts:true, clone_newcgroup:true, clone_newtime:false, keep_caps:false, disable_no_new_privs:false, max_cpus:0
[I][2022-05-20T01:27:09+0200] Mount: '/' -> '/' flags:MS_RDONLY|MS_BIND|MS_REC|MS_PRIVATE type:'' options:'' dir:true
[I][2022-05-20T01:27:09+0200] Uid map: inside_uid:99999 outside_uid:0 count:1 newuidmap:false
[W][2022-05-20T01:27:09+0200][30188] logParams():265 Process will be UID/EUID=0 in the global user namespace, and will have user root-level access to files
[I][2022-05-20T01:27:09+0200] Gid map: inside_gid:99999 outside_gid:0 count:1 newgidmap:false
[W][2022-05-20T01:27:09+0200][30188] logParams():275 Process will be GID/EGID=0 in the global user namespace, and will have group root-level access to files
[W][2022-05-20T01:27:09+0200][30188] createCgroup():49 mkdir('/sys/fs/cgroup/NSJAIL.30189', 0700) failed: Read-only file system
[E][2022-05-20T01:27:09+0200][30188] initParent():411 Couldn't initialize cgroup 2 user namespace for pid=30189
[F][2022-05-20T01:27:09+0200][1] runChild():469 Launching child process failed
On my side, this is because I have both cgroups v1 and v2 and v2 is mounted in a different path:
$ mount | grep cgroup
tmpfs on /sys/fs/cgroup type tmpfs (ro,nosuid,nodev,noexec,size=4096k,nr_inodes=1024,mode=755,inode64)
cgroup2 on /sys/fs/cgroup/unified type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate)
cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,name=systemd)
cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer)
cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (rw,nosuid,nodev,noexec,relatime,net_cls,net_prio)
cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset)
cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory)
cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,perf_event)
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpu,cpuacct)
cgroup on /sys/fs/cgroup/pids type cgroup (rw,nosuid,nodev,noexec,relatime,pids)
cgroup on /sys/fs/cgroup/hugetlb type cgroup (rw,nosuid,nodev,noexec,relatime,hugetlb)
cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio)
cgroup on /sys/fs/cgroup/rdma type cgroup (rw,nosuid,nodev,noexec,relatime,rdma)
cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices)
I was able to resolve this issue with the --cgroupv2_mount=/sys/fs/cgroup/unified
flag:
$ sudo ./nsjail --cgroup_mem_max 104857600 --user 99999 --group 99999 --disable_proc --chroot / --time_limit 100 /bin/bash --use_cgroupv2 --cgroupv2_mount=/sys/fs/cgroup/unified
[I][2022-05-20T01:29:02+0200] Mode: STANDALONE_ONCE
[I][2022-05-20T01:29:02+0200] Jail parameters: hostname:'NSJAIL', chroot:'/', process:'/bin/bash', bind:[::]:0, max_conns:0, max_conns_per_ip:0, time_limit:100, personality:0, daemonize:false, clone_newnet:true, clone_newuser:true, clone_newns:true, clone_newpid:true, clone_newipc:true, clone_newuts:true, clone_newcgroup:true, clone_newtime:false, keep_caps:false, disable_no_new_privs:false, max_cpus:0
[I][2022-05-20T01:29:02+0200] Mount: '/' -> '/' flags:MS_RDONLY|MS_BIND|MS_REC|MS_PRIVATE type:'' options:'' dir:true
[I][2022-05-20T01:29:02+0200] Uid map: inside_uid:99999 outside_uid:0 count:1 newuidmap:false
[W][2022-05-20T01:29:02+0200][30304] logParams():265 Process will be UID/EUID=0 in the global user namespace, and will have user root-level access to files
[I][2022-05-20T01:29:02+0200] Gid map: inside_gid:99999 outside_gid:0 count:1 newgidmap:false
[W][2022-05-20T01:29:02+0200][30304] logParams():275 Process will be GID/EGID=0 in the global user namespace, and will have group root-level access to files
[I][2022-05-20T01:29:02+0200] Setting 'memory.max' to '104857600'
[E][2022-05-20T01:29:02+0200][30304] writeBufToFile():95 Couldn't open '/sys/fs/cgroup/unified/NSJAIL.30305/memory.max' for writing: No such file or directory
[W][2022-05-20T01:29:02+0200][30304] writeToCgroup():61 Could not update memory.max
[E][2022-05-20T01:29:02+0200][30304] initParent():411 Couldn't initialize cgroup 2 user namespace for pid=30305
[F][2022-05-20T01:29:02+0200][1] runChild():469 Launching child process failed
But as we can see, now I am getting the error that @carlbordum was getting:
[E][2022-05-20T01:29:02+0200][30304] writeBufToFile():95 Couldn't open '/sys/fs/cgroup/unified/NSJAIL.30305/memory.max' for writing: No such file or directory
So what happens here? Well, while the cgroup v2 memory controller indeed does expose such file it does not exist on my side because... I don't have a memory cgroup v2 controllers enabled or even available! :(
We can see that here, as according to this kernel documentation page the cgroup.controllers
file should list us the available controllers (e.g. memory io cpu
):
$ cat /sys/fs/cgroup/unified/cgroup.controllers
$
But it shows nothing instead! So why is that? Why are there no cgroupv2 controllers available?
If I understand correctly, this is related to what they write here:
cgroup2 filesystem has the magic number 0x63677270 (“cgrp”). All controllers which support v2 and are not bound to a v1 hierarchy are automatically bound to the v2 hierarchy and show up at the root. Controllers which are not in active use in the v2 hierarchy can be bound to other hierarchies. This allows mixing v2 hierarchy with the legacy v1 multiple hierarchies in a fully backward compatible way.
A controller can be moved across hierarchies only after the controller is no longer referenced in its current hierarchy. Because per-cgroup controller states are destroyed asynchronously and controllers may have lingering references, a controller may not show up immediately on the v2 hierarchy after the final umount of the previous hierarchy. Similarly, a controller should be fully disabled to be moved out of the unified hierarchy and it may take some time for the disabled controller to become available for other hierarchies; furthermore, due to inter-controller dependencies, other controllers may need to be disabled too.
While useful for development and manual configurations, moving controllers dynamically between the v2 and other hierarchies is strongly discouraged for production use. It is recommended to decide the hierarchies and controller associations before starting using the controllers after system boot.
During transition to v2, system management software might still automount the v1 cgroup filesystem and so hijack all controllers during boot, before manual intervention is possible. To make testing and experimenting easier, the kernel parameter cgroup_no_v1= allows disabling controllers in v1 and make them always available in v2.
It seems that a given controller may be bound either to v1 or to v2 but never to both of them. I guess this kinda makes sense, and just to recap, my memory controller is indeed bound to v1 as what my mount
output showed:
cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory)
So if you are in the same situation as me, I guess the easiest is to change kernel boot parameters and add cgroup_no_v1=memory
there or/and other controller excludes (not sure which are all of those that nsjail use). As I guess removing all processes from cgroup v1 may be hard at runtime (e.g. since lots of this may be managed by systemd and idk if it supports v1->v2 migration).
@robertswiecki with what we see above, I guess we could improve nsjail UX by:
<cgroup-v2-path>/cgroup.controllers
file and erroring out with a nice log that one has to enable specific cgroup v2 controllers first? And maybe linking to the kernel docs.Wow, I am completely blown away be your helpfulness for such a poor "bug" report.
I am specifically working on this little project. It is very reproducible, so I was confused about why it stopped working, but I think its because my systems now run the cgroupv2 controller.
Is there any decent way to run nsjail
commands that work with both cgroupv1 and cgroupv2 or does my program need to detect it an inject different flags?
edit: if you want, you can clone the project and run docker-compose up -d
and docker-compose logs cody
to see the error.
@disconnect3d thank you for such cool in-depth analysis.
I'm vaguely familiar with cgroups2 myself, but I guess I can take a look at what can be improved here.
Though, if anyone will beat me to that, I won't complain :)
We're also seeing issues in trying to get nsjail running for Compiler Explorer on newer cgroupss (on Ubuntu 22.04):
[D][2022-11-07T21:41:28-0600][6205] bool cgroup::createCgroup(const string&, pid_t)():41 Create '/sys/fs/cgroup/memory/ce-compile/NSJAIL.6207' for pid=6207
[W][2022-11-07T21:41:28-0600][6205] bool cgroup::createCgroup(const string&, pid_t)():43 mkdir('/sys/fs/cgroup/memory/ce-compile/NSJAIL.6207', 0700) failed: No such file or directory
(or with --use_cgroupv2
)
[D][2022-11-07T21:51:35-0600][7547] bool cgroup2::createCgroup(const string&, pid_t)():47 Create '/sys/fs/cgroup/NSJAIL.7548' for pid=7548
[W][2022-11-07T21:51:35-0600][7547] bool cgroup2::createCgroup(const string&, pid_t)():49 mkdir('/sys/fs/cgroup/NSJAIL.7548', 0700) failed: Permission denied
is what we see, which may be similar. We wouldn't choose to run cgroups2 but Ubuntu 22.04 seems to have made it the default, and it's easier not to special case boot params to get it back to the old system.
In my case:
$ cat /sys/fs/cgroup/cgroup.controllers
cpuset cpu io memory hugetlb pids rdma misc
$ mount | grep cg
cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot)
so it doesn't quite seem the same issue as others have seen, though I'm having trouble with the cgcreate
equivalent, so now I might "just" be hitting a hole in the tooling.
That said: with a bit more hacking and fiddling with settings I supplied a different cgroupts2 mount (including the cgroup
parents name I had cgcreate
-d, it got further:
[D][2022-11-07T22:06:19-0600][8754] bool cgroup2::createCgroup(const string&, pid_t)():47 Create '/sys/fs/cgroup/ce-compile/NSJAIL.8755' for pid=8755
[D][2022-11-07T22:06:19-0600][8754] bool cgroup2::addPidToProcList(const string&, pid_t)():70 Adding pid='8755' to cgroup.procs
[E][2022-11-07T22:06:19-0600][8754] bool util::writeBufToFile(const char*, const void*, size_t, int)():100 Couldn't write '4' bytes to file '/sys/fs/cgroup/ce-compile/NSJAIL.8755/cgroup.procs' (fd='6'): Permission denied
[W][2022-11-07T22:06:19-0600][8754] bool cgroup2::addPidToProcList(const string&, pid_t)():73 Could not update cgroup.procs
[E][2022-11-07T22:06:19-0600][8754] bool subproc::initParent(nsjconf_t*, pid_t, int)():392 Couldn't initialize cgroup 2 user namespace for pid=8755
[F][2022-11-07T22:06:19-0600][1] bool subproc::runChild(nsjconf_t*, int, int, int, int)():448 Launching child process failed
and ...
$ ls -l /sys/fs/cgroup/ce-compile/NSJAIL.8755/
total 0
-r--r--r-- 1 matthew matthew 0 Nov 7 22:06 cgroup.controllers
-r--r--r-- 1 matthew matthew 0 Nov 7 22:06 cgroup.events
-rw-r--r-- 1 matthew matthew 0 Nov 7 22:06 cgroup.freeze
--w------- 1 matthew matthew 0 Nov 7 22:06 cgroup.kill
-rw-r--r-- 1 matthew matthew 0 Nov 7 22:06 cgroup.max.depth
-rw-r--r-- 1 matthew matthew 0 Nov 7 22:06 cgroup.max.descendants
-rw-r--r-- 1 matthew matthew 0 Nov 7 22:06 cgroup.procs
-r--r--r-- 1 matthew matthew 0 Nov 7 22:06 cgroup.stat
-rw-r--r-- 1 matthew matthew 0 Nov 7 22:06 cgroup.subtree_control
-rw-r--r-- 1 matthew matthew 0 Nov 7 22:06 cgroup.threads
-rw-r--r-- 1 matthew matthew 0 Nov 7 22:06 cgroup.type
-rw-r--r-- 1 matthew matthew 0 Nov 7 22:06 cpu.pressure
-r--r--r-- 1 matthew matthew 0 Nov 7 22:06 cpu.stat
-rw-r--r-- 1 matthew matthew 0 Nov 7 22:06 io.pressure
-rw-r--r-- 1 matthew matthew 0 Nov 7 22:06 memory.pressure
so it looks like I ought to be able to write to the file.
At least for my use-case (running nsjail
in LISTEN mode as the root process in a Docker container), once I specified --use_cgroupv2
, the issue was that /sys/fs/cgroup/cgroup.subtree_control
was empty. This means that every cgroup created by nsjail cannot inherit any of the controllers (even though they are all present in /sys/fs/cgroup/cgroup.controllers
).
I worked some on a fix for this in my fork. All we really need to do is look at the root cgroup.subtree_control and make sure the controllers we need are there. If they aren't there, we need to add them (in fact, this is exactly what redpwn/jail does). A minor issue is that in order to modify cgroup.subtree_control
, you have to move all processes from the root cgroup (this is apparently a thing called the "no internal processes" rule). Currently my patch only handles the case where nsjail is the only process in the root cgroup. This is good enough for my use-case, but probably not good enough for others.
My patch is here, and works for my use case, but likely needs some work to be useful to others: https://github.com/google/nsjail/compare/master...ndrewh:cgroupsv2-fix
(Footnote: If you're going to try my fork, nsjail needs to be the root process in the cgroup, you can accomplish this by invoking nsjail using the execve-variant of the CMD
dockerfile directive. ie. Use CMD ['/usr/bin/nsjail', ...]
not CMD nsjail ...
)
@ndrewh I'm not running on docker in my case; this is "just" on a plain Ubuntu 22.04 system
I looked at this a little more, since I know I've run into issues running on stock 22.04 as well. I tried this on a 22.04 desktop in virtualbox. I did see slightly different initial behavior in AWS, but I think what's here should still be helpful.
Linux 5.15.0-53-generic #59-Ubuntu SMP Mon Oct 17 18:53:30 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
For some reason, on boot the cpu
controller is missing from cgroup.subtree_control
(why? i have no idea):
$ cat /sys/fs/cgroup/cgroup.subtree_control
memory pids
If you just straight up run nsjail
now in the root cgroup (as sudo, so it can create it's child cgroup), --cgroup_mem_max
works fine, but if you set --cgroup_cpu_ms_per_sec
you'll get:
$ sudo ./nsjail -R /bin/ -R /lib/ -R /lib64 -R /usr/ -R /sbin/ --use_cgroupv2 --cgroup_cpu_ms_per_sec 500 -- /bin/bash -i
[I][2022-11-16T15:29:05-0500] Setting 'cpu.max' to '500000 1000000'
[E][2022-11-16T15:29:05-0500][4983] writeBufToFile():96 Couldn't open '/sys/fs/cgroup/NSJAIL.4984/cpu.max' for writing: No such file or directory
[W][2022-11-16T15:29:05-0500][4983] writeToCgroup():61 Could not update cpu.max
[E][2022-11-16T15:29:05-0500][4983] initParent():425 Couldn't initialize cgroup 2 user namespace for pid=4984
[F][2022-11-16T15:29:05-0500][1] runChild():483 Launching child process failed
Fix:
echo "+cpu" > /sys/fs/cgroup/cgroup.subtree_control
(If you try the same thing on my fork, it'll do this last line for you. Whether this is desirable behavior in general or not, I am not sure.)
/sys/fs/cgroup/cgroup.procs
)OK, but what if instead of using the root cgroup, we want to make a new cgroup (as @mattgodbolt was trying), and give our user permissions to use it?
sudo cgcreate -a $USER -t $USER -g memory,cpu:jailtest
I think the permissions error @mattgodbolt was running into is due to the fact you don't have permission to move processes out of the root cgroup? We can fix that:
sudo chown andrew:root /sys/fs/cgroup/cgroup.procs
Now nsjail can move it's children into the appropriate cgroup, and we get a little further:
$ ./nsjail --cgroup_mem_max 1000000 -R /bin/ -R /lib/ -R /lib64 -R /usr/ -R /sbin/ --use_cgroupv2 --cgroup_cpu_ms_per_sec 500 --cgroupv2_mount /sys/fs/cgroup/jailtest/ -- /bin/bash -i
...
[E][2022-11-16T16:58:12-0500][3096] writeBufToFile():96 Couldn't open '/sys/fs/cgroup/jailtest//NSJAIL.3097/memory.max' for writing: No such file or directory
[W][2022-11-16T16:58:12-0500][3096] writeToCgroup():61 Could not update memory.max
[E][2022-11-16T16:58:12-0500][3096] initParent():425 Couldn't initialize cgroup 2 user namespace for pid=3097
We need just one more thing, since /sys/fs/cgroup/jailtest/cgroup.subtree_control
is empty:
echo "+cpu +memory" > /sys/fs/cgroup/jailtest/cgroup.subtree_control
Now nsjail works :)
sudo cgcreate -a $USER -t $USER -g memory,cpu:jailtest3
sudo cgexec -g memory,cpu:jailtest3 sudo -s -u andrew
andrew@andrew2204:~/nsjail$
Now we are in the child cgroup... lets try to run nsjail
$ ./nsjail --cgroup_mem_max 1000000 -R /bin/ -R /lib/ -R /lib64 -R /usr/ -R /sbin/ --use_cgroupv2 --cgroup_cpu_ms_per_sec 500 --cgroupv2_mount /sys/fs/cgroup/jailtest3/ -- /bin/bash -i
...
[I][2022-11-16T17:13:58-0500] Setting 'memory.max' to '1000000'
[E][2022-11-16T17:13:58-0500][3251] writeBufToFile():96 Couldn't open '/sys/fs/cgroup/jailtest3//NSJAIL.3252/memory.max' for writing: No such file or directory
[W][2022-11-16T17:13:58-0500][3251] writeToCgroup():61 Could not update memory.max
[E][2022-11-16T17:13:58-0500][3251] initParent():425 Couldn't initialize cgroup 2 user namespace for pid=3252
Same issue... lets do the same thing, right?
$ echo "+cpu +memory" > /sys/fs/cgroup/jailtest3/cgroup.subtree_control
bash: echo: write error: Device or resource busy
Why can't we do this? It's because the "no internal processes rule" won't let us have controllers in cgroup.subtree_control
if our cgroup currently has processes. First, lets see how to fix this manually:
$ cat /sys/fs/cgroup/jailtest3/cgroup.procs
3281
3282
3283
3299
$ mkdir /sys/fs/cgroup/jailtest3/lol/
$ echo "3281" > /sys/fs/cgroup/jailtest3/lol/cgroup.procs
$ echo "3282" > /sys/fs/cgroup/jailtest3/lol/cgroup.procs
$ echo "3283" > /sys/fs/cgroup/jailtest3/lol/cgroup.procs
$ echo "+cpu +memory" > /sys/fs/cgroup/jailtest3/cgroup.subtree_control
(Since we spawned a shell, we have a couple of processes in the jailtest3
cgroup -- these have to be moved before we can add to subtree_control)
Now nsjail works!
The point of my PR is to add controllers to cgroup.subtree_control
(the very last command I ran in each of these examples) if they are not present. For the last example, it doesn't seem like nsjail
ought to move all those processes into a subgroup blindly -- so my PR only handles the case where nsjail is the only process in the group.
Thanks all for working on this and contributing. My cgroup1/2-foo is not great, but from what I can tell it works as expected.
./nsjail --config configs/bash-with-fake-geteuid.cfg --detect_cgroupv2 --cgroup_cpu_ms_per_sec 100 --cgroupv2_mount /sys/fs/cgroup/user.slice/user-1000.slice/user\@1000.service/
...
[JAILED-BASH:21:33:03:sh-5.2.2:/tmp]# openssl speed
And I can see that only 10% of a single CPU core is used (via top and with openssl speed
results). So.. promising.
Without cgroups
[JAILED-BASH:21:35:48:sh-5.2.2:/tmp]# openssl speed
Doing md5 for 3s on 16 size blocks: 19115100 md5's in 3.00s
With cgroups
[JAILED-BASH:21:36:05:sh-5.2.2:/tmp]# openssl speed
Doing md5 for 3s on 16 size blocks: 2666542 md5's in 0.39s
Not exactly 10%, but close enough, assuming the cores were not isolated for the test.
Some progress at least, with the c7c0adfffe79ebebfacca003f3cd8e27ef909185
version, and having run sudo cgcreate -a matthew -t matthew -g memory,pids,cpu:ce-sandbox
I now get:
$ nsjail --detect_cgroupv2 --config etc/nsjail/execute.cfg --cgroupv2_mount=/sys/fs/cgroup/ce-sandbox/ --verbose -- /bin/bash
...
[D][2022-11-28T17:16:31-0600][554658] addProc():243 Added pid=554659 with start time 1669677391 to the queue for IP: '[STANDALONE MODE]'
[D][2022-11-28T17:16:31-0600][554658] createCgroup():41 Create '/sys/fs/cgroup/memory/ce-compile/NSJAIL.554659' for pid=554659
[W][2022-11-28T17:16:31-0600][554658] createCgroup():43 mkdir('/sys/fs/cgroup/memory/ce-compile/NSJAIL.554659', 0700) failed: No such file or directory
[E][2022-11-28T17:16:31-0600][554658] initParent():429 Couldn't initialize cgroup user namespace for pid=554659
[F][2022-11-28T17:16:31-0600][1] runChild():483 Launching child process failed
It's not clear to my why I need to pass --cgroupv2_mount
(but it makes things "better"). And I don't know what the error with "no such file or directory" means in this context; it doesn't seem to map to anything I can see from earlier comments. Overall context is me trying to update my project's setup for cgroup1 to cgroup2, after upgrading and now being left in an unfortunate state of not being able to run the jailing scripts on my local machine that my deployed system uses :) If a kind soul on this repo can help, I'd be super grateful.
@mattgodbolt An unfortunate fix:
$ nsjail --config etc/nsjail/execute.cfg --cgroupv2_mount=/sys/fs/cgroup/ce-sandbox/ --verbose --detect_cgroupv2 -- /bin/bash
yup, argument order apparently matters (relative to the --config
?), i guess :/
BTW - all of these options can be specified in the cfg file as well. Thank you for Compiler Explorer! ❤️
Yup, the current way of parsing args is to run file config parsing at the moment the --config
is spotted in the cmd-line.
I thought it was a clever way of doing things, but clever is not always the best :).
However, changing it now would a). break backwards compatibility c). would be not-so easy to implement, b/c two passes of cmdline arguments would be needed (first file, then args) - or some way of caching them.
thanks all:
[D][2022-11-29T07:17:29-0600][556719] addPidToProcList():117 Adding pid='556720' to cgroup.procs
[E][2022-11-29T07:17:29-0600][556719] writeBufToFile():105 Couldn't write '6' bytes to file '/sys/fs/cgroup/ce-sandbox//NSJAIL.556720/cgroup.procs' (fd='6'): Permission denied
[W][2022-11-29T07:17:29-0600][556719] addPidToProcList():120 Could not update cgroup.procs
[E][2022-11-29T07:17:29-0600][556719] initParent():425 Couldn't initialize cgroup 2 user namespace for pid=556720
is now more in line with the other stuff here I think?
BTW - all of these options can be specified in the cfg file as well.
right! I'm just trying to use the existing config as it's far easier to supply a couple extra cmdline flags on a v2 system than it is to have two config files, one for v1 and one for v2 (Ideally I can support both as we transition).
Thank you for Compiler Explorer!
You're so welcome! nsjail
is a big part of what makes it (mostly) secure :)
--
Yup, the current way of parsing args is to run file config parsing at the moment the --config is spotted in the cmd-line.
makes sense to me! thanks! :)
@mattgodbolt
The point of --detect_cgroupv2
(at least, as I intended it) was to allow you to specify options for both v1 and v2, and nsjail would infer which options are valid at runtime. So if you specify the 'cgroupv2_mount' and 'detect_cgroupv2: true' in the config file, it should be backwards-compatible. It will check if the v2 mount is a valid cgroupv2 filesystem and will use v2 only if it is.
As for the permissions error, I think you're closest to the "Example 2" in my previous comment. My guess is nsjail does not have permission to move the child out of the current cgroup. You can fix this by either (1) spawning nsjail inside a cgroup it has permissions to move children out of (e.g. via cgexec or Docker), or (2) modifying the permissions on the cgroup.procs file for nsjail's current cgroup (probably either the root one or the one associated with your terminal).
Awesome! Thanks that clears up a few things. I'll try fiddling with settings on 22.04 to see if I can work out what environmental things need changing both for me as a user and then also in the VM for the site (which can be more bespoke)
Cheers!
@ndrewh I was able to get things working with that chown
! yay!
That works for my specific use case, but more genreally on a multi-tenant system is there any way thi can be made to work do you think? Is that an Ubuntu issue?
@mattgodbolt I don't think it's a ubuntu issue, I think it's just that you need nsjail to be in a cgroup that it has permissions to move it's child processes out of.
I think the following should work on a multi-tenant system:
Make a new cgroup:
sudo cgcreate -a $USER -t $USER -g memory,cpu,pids:mygroup
Run nsjail in that new cgroup
cgexec -g memory,cpu,pids:mygroup nsjail --config etc/nsjail/execute.cfg --cgroupv2_mount=/sys/fs/cgroup/mygroup/ --verbose --detect_cgroupv2 -- /bin/bash
...
[I][2022-11-29T23:57:01+0000] Detected cgroups version: 2
[I][2022-11-29T23:57:01+0000] nsjail is moving itself to a new child cgroup: /sys/fs/cgroup/mygroup//NSJAIL_SELF.268
...
[D][2022-11-29T23:57:01+0000][268] createCgroup():52 Create '/sys/fs/cgroup/mygroup//NSJAIL_SELF.268' for pid=268
[D][2022-11-29T23:57:01+0000][268] addPidToProcList():86 Adding pid='0' to cgroup.procs
[D][2022-11-29T23:57:01+0000][268] writeBufToFile():109 Written '1' bytes to '/sys/fs/cgroup/mygroup//NSJAIL_SELF.268/cgroup.procs'
[D][2022-11-29T23:57:01+0000][268] enableCgroupSubtree():61 Enable cgroup.subtree_control +'memory' to '/sys/fs/cgroup/mygroup/' for pid=268
[D][2022-11-29T23:57:01+0000][268] writeBufToFile():109 Written '7' bytes to '/sys/fs/cgroup/mygroup//cgroup.subtree_control'
[D][2022-11-29T23:57:01+0000][268] enableCgroupSubtree():61 Enable cgroup.subtree_control +'pids' to '/sys/fs/cgroup/mygroup/' for pid=268
[D][2022-11-29T23:57:01+0000][268] writeBufToFile():109 Written '5' bytes to '/sys/fs/cgroup/mygroup//cgroup.subtree_control'
[D][2022-11-29T23:57:01+0000][268] enableCgroupSubtree():61 Enable cgroup.subtree_control +'cpu' to '/sys/fs/cgroup/mygroup/' for pid=268
[D][2022-11-29T23:57:01+0000][268] writeBufToFile():109 Written '4' bytes to '/sys/fs/cgroup/mygroup//cgroup.subtree_control'
...
[D][2022-11-29T23:57:01+0000][268] runChild():467 Creating new process with clone flags:CLONE_NEWNS|CLONE_NEWCGROUP|CLONE_NEWUTS|CLONE_NEWIPC|CLONE_NEWUSER|CLONE_NEWPID|CLONE_NEWNET and exit_signal:SIGCHLD
...
[D][2022-11-29T23:57:01+0000][268] createCgroup():52 Create '/sys/fs/cgroup/mygroup//NSJAIL.269' for pid=269
[D][2022-11-29T23:57:01+0000][268] addPidToProcList():86 Adding pid='269' to cgroup.procs
[D][2022-11-29T23:57:01+0000][268] writeBufToFile():109 Written '3' bytes to '/sys/fs/cgroup/mygroup//NSJAIL.269/cgroup.procs'
[I][2022-11-29T23:57:01+0000] Setting 'memory.max' to '1342177280'
[D][2022-11-29T23:57:01+0000][268] writeBufToFile():109 Written '10' bytes to '/sys/fs/cgroup/mygroup//NSJAIL.269/memory.max'
[D][2022-11-29T23:57:01+0000][268] createCgroup():52 Create '/sys/fs/cgroup/mygroup//NSJAIL.269' for pid=269
[D][2022-11-29T23:57:01+0000][268] addPidToProcList():86 Adding pid='269' to cgroup.procs
[D][2022-11-29T23:57:01+0000][268] writeBufToFile():109 Written '3' bytes to '/sys/fs/cgroup/mygroup//NSJAIL.269/cgroup.procs'
[I][2022-11-29T23:57:01+0000] Setting 'pids.max' to '72'
[D][2022-11-29T23:57:01+0000][268] writeBufToFile():109 Written '2' bytes to '/sys/fs/cgroup/mygroup//NSJAIL.269/pids.max'
[D][2022-11-29T23:57:01+0000][268] createCgroup():52 Create '/sys/fs/cgroup/mygroup//NSJAIL.269' for pid=269
[D][2022-11-29T23:57:01+0000][268] addPidToProcList():86 Adding pid='269' to cgroup.procs
[D][2022-11-29T23:57:01+0000][268] writeBufToFile():109 Written '3' bytes to '/sys/fs/cgroup/mygroup//NSJAIL.269/cgroup.procs'
[I][2022-11-29T23:57:01+0000] Setting 'cpu.max' to '1000000 1000000'
[D][2022-11-29T23:57:01+0000][268] writeBufToFile():109 Written '15' bytes to '/sys/fs/cgroup/mygroup//NSJAIL.269/cpu.max'
...
(note: if you cgexec
into a shell as i did in "Example 3" in one of my previous comments, there are other hoops to jump through. this works out-of-the-box if you are directly cgexec-ing nsjail)
I think I see. Thanks @ndrewh. Seems unfortunate to have to do the two steps (and specify the weird mount point thing too) but looks like it can be made to work. I'll have to see if that also works on cgroupv1 (I presume it does).
I'll try to summarize (hopefully correctly) here in case someone finds this later:
--cgroupv2_mount
is the root at which nsjail will create its individual child process cgroups. nsjail needs to have permission to create cgroups (ie. make subdirectories) at this path, and the cgroup needs to have either no processes in it, or just nsjail (in the case where nsjail is in this group, nsjail will move itself into a subgroup for technical reasons).
The cgroup that nsjail is running in is also important, because nsjail needs to have permission to remove its child processes from that cgroup (nsjail needs to have permissions for its current cgroup's cgroup.procs file). By cgexec
ing you only need to chown the cgroup.procs file for that cgroup.
These groups do not have to be the same. It sounds like for many applications you could just as well create two cgroups:
jailparentgroup
that you run nsjail in (via cgexec)jailchildgroup
that you pass as --cgroupv2_mount=/sys/fs/cgroup/jailchildgroup/
which is where nsjail would make subgroups with all the restrictions and move the child processes.$ sudo cgcreate -a $USER -t $USER -g cpu,pids,memory:jailparentgroup
$ sudo cgcreate -a $USER -t $USER -g cpu,pids,memory:jailchildgroup
$ sudo cgexec -g cpu,pids,memory:jailparentgroup ./nsjail --cgroup_mem_max 10000000 --cgroup_pids_max 50 --cgroup_cpu_ms_per_sec 500 --verbose --detect_cgroupv2 --cgroupv2_mount /sys/fs/cgroup/jailchildgroup/ -R /usr -R /bin -R /lib -R /lib64 -- /bin/bash
The user would need full ownership of /sys/fs/cgroup/jailchildgroup
and additionally permission on /sys/fs/cgroup/jailparentgroup/cgroup.procs
-- if you cgcreate
as above, you need no additional changes. (Creating separate groups as above also avoid any cgroup.subtree_control issues, since jailchildgroup
would not have any processes, only sub-cgroups.).
Note I don't think this trick improves the situation in a default (but privileged) Docker container, where your best best is making sure that nsjail is the root process (and then nsjail will move itself to create a 2-group scenario similar to above).
Hi! I'm having similar issues described here, getting the error message
[E][2024-05-11T20:33:56+0000][148] writeBufToFile():105 Couldn't write '3' bytes to file '/sys/fs/cgroup/NSJAIL.149/cgroup.procs' (fd='6'): No such file or directory
although the file exists when checking with ls
.
I am running inside Docker (26.1.2) with --privileged
and mount /sys/fs/cgroup:/sys/fs/cgroup
.
Running nsjail
with the same config under the host system directly (not in Docker) works fine. I use the detect_cgroupv2
and run on Arch Linux.
Output of mount | grep cgroup
:
cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot)
Output of sudo cat /sys/fs/cgroup/NSJAIL.32/cgroup.controllers
:
cpuset cpu io memory hugetlb pids rdma misc
The full config can be found here.
I tried to play around with it and couldn't figure it out. It seems to work under cgroupv1 on Debian Bookworm. Any help is greatly appreciated.
@Gregofi I believe cgroup.procs
is for cgroupsv1. Specify --detect_cgroupv2
and it will switch to v2 if it is present.
@Gregofi Sorry, I realized I gave a completely bogus answer... you used detect_cgroupv2 and it still didn't work.
Couldn't write '3' bytes to file '/sys/fs/cgroup/NSJAIL.149/cgroup.procs' (fd='6'): No such file or dir
The full log might be helpful here, but it's trying to move the child into the cgroup which it just created... not sure why this would fail, since it's quite literally doing one after the other:
https://github.com/google/nsjail/blob/a00a0efabc0c1bd44e24c798a19d6e46eefedb8d/cgroup2.cc#L255-L256
Some troubleshooting guesses:
/sys/fs/cgroup/cgroup.procs
and /sys/fs/cgroup/NSJAIL.XXX/cgroup.procs
(despite only one being accessed). (see here). I suspect that mapping as a volume into docker is screwing up the perms.Best of luck!
Hi, thanks for your response. Yes, upon reading your comment I also suspect docker permissions. I tried various things. However using the explicit --user 0:0
mapping seems to lead to an error message (from https://github.com/google/nsjail/pull/219#issuecomment-1732501151) that suggests using --cgroupns host
. This solved it and no error is reported.
The first error, ending with No such file or directory
, is really confusing. However, I suspect that this is because of the cgroup pseudofilesystem, so not sure if it can be improved. Again, thank you very much for your help, especially when this issue wasn't really caused by nsjail.
For what it's worth we've been able to get this working in our systems now. But we've hit a new issue when updating to an even newer Ubuntu: #236
For example, when I run
nsjail
with--use_cgroupv2 --cgroupv2_mount /sys/fs/cgroup/NSJAIL
, I still see errors likeIf I udnerstand cgroups v2 correctly, it should look for
/sys/fs/cgroup/NSJAIL/memory.max
, not/sys/fs/cgroup/NSJAIL/NSJAIL.10/memory.max
./sys/fs/cgroup/NSJAIL
exists.