google / nsjail

A lightweight process isolation tool that utilizes Linux namespaces, cgroups, rlimits and seccomp-bpf syscall filters, leveraging the Kafel BPF language for enhanced security.
https://nsjail.dev
Apache License 2.0
2.98k stars 274 forks source link

Ubuntu 24.x "permission denied" in `mount(/, /)` #236

Closed mattgodbolt closed 3 months ago

mattgodbolt commented 3 months ago

We have nsjail working on Ubuntu 20.x with cgroupsv2 (despite initially hitting issues around #196); but on an upgraded machine now running 24.x we see this (tail of a log):

[I][2024-08-03T13:09:49-0500] Uid map: inside_uid:10240 outside_uid:1000 count:1 newuidmap:false
[I][2024-08-03T13:09:49-0500] Gid map: inside_gid:10240 outside_gid:1000 count:1 newgidmap:false
[D][2024-08-03T13:09:49-0500][2054703] setSigHandler():68 Setting sighandler for signal SIGINT (2)
[D][2024-08-03T13:09:49-0500][2054703] setSigHandler():68 Setting sighandler for signal SIGQUIT (3)
[D][2024-08-03T13:09:49-0500][2054703] setSigHandler():68 Setting sighandler for signal SIGUSR1 (10)
[D][2024-08-03T13:09:49-0500][2054703] setSigHandler():68 Setting sighandler for signal SIGALRM (14)
[D][2024-08-03T13:09:49-0500][2054703] setSigHandler():68 Setting sighandler for signal SIGCHLD (17)
[D][2024-08-03T13:09:49-0500][2054703] setSigHandler():68 Setting sighandler for signal SIGTERM (15)
[D][2024-08-03T13:09:49-0500][2054703] setSigHandler():68 Setting sighandler for signal SIGTTIN (21)
[D][2024-08-03T13:09:49-0500][2054703] setSigHandler():68 Setting sighandler for signal SIGTTOU (22)
[D][2024-08-03T13:09:49-0500][2054703] setSigHandler():68 Setting sighandler for signal SIGPIPE (13)
[I][2024-08-03T13:09:49-0500] Detected cgroups version: 2
[D][2024-08-03T13:09:49-0500][2054703] runChild():471 Creating new process with clone flags:CLONE_NEWNS|CLONE_NEWCGROUP|CLONE_NEWUTS|CLONE_NEWIPC|CLONE_NEWUSER|CLONE_NEWPID|CLONE_NEWNET and exit_signal:SIGCHLD
[D][2024-08-03T13:09:49-0500][2054703] addProc():251 Added pid=2054704 with start time 1722708589 to the queue for IP: '[STANDALONE MODE]'
[D][2024-08-03T13:09:49-0500][2054703] createCgroup():64 Create '/sys/fs/cgroup/ce-compile/NSJAIL.2054704' for pid=2054704
[D][2024-08-03T13:09:49-0500][2054703] addPidToProcList():47 Adding pid='2054704' to cgroup.procs
[D][2024-08-03T13:09:49-0500][2054703] writeBufToFile():115 Written '7' bytes to '/sys/fs/cgroup/ce-compile/NSJAIL.2054704/cgroup.procs'
[I][2024-08-03T13:09:49-0500] Setting 'memory.max' to '1342177280'
[D][2024-08-03T13:09:49-0500][2054703] writeBufToFile():115 Written '10' bytes to '/sys/fs/cgroup/ce-compile/NSJAIL.2054704/memory.max'
[D][2024-08-03T13:09:49-0500][2054703] createCgroup():64 Create '/sys/fs/cgroup/ce-compile/NSJAIL.2054704' for pid=2054704
[D][2024-08-03T13:09:49-0500][2054703] addPidToProcList():47 Adding pid='2054704' to cgroup.procs
[D][2024-08-03T13:09:49-0500][2054703] writeBufToFile():115 Written '7' bytes to '/sys/fs/cgroup/ce-compile/NSJAIL.2054704/cgroup.procs'
[I][2024-08-03T13:09:49-0500] Setting 'pids.max' to '72'
[D][2024-08-03T13:09:49-0500][2054703] writeBufToFile():115 Written '2' bytes to '/sys/fs/cgroup/ce-compile/NSJAIL.2054704/pids.max'
[D][2024-08-03T13:09:49-0500][2054703] createCgroup():64 Create '/sys/fs/cgroup/ce-compile/NSJAIL.2054704' for pid=2054704
[D][2024-08-03T13:09:49-0500][2054703] addPidToProcList():47 Adding pid='2054704' to cgroup.procs
[D][2024-08-03T13:09:49-0500][2054703] writeBufToFile():115 Written '7' bytes to '/sys/fs/cgroup/ce-compile/NSJAIL.2054704/cgroup.procs'
[I][2024-08-03T13:09:49-0500] Setting 'cpu.max' to '1000000 1000000'
[D][2024-08-03T13:09:49-0500][2054703] writeBufToFile():115 Written '15' bytes to '/sys/fs/cgroup/ce-compile/NSJAIL.2054704/cpu.max'
[D][2024-08-03T13:09:49-0500][2054703] writeBufToFile():115 Written '4' bytes to '/proc/2054704/setgroups'
[D][2024-08-03T13:09:49-0500][2054703] gidMapSelf():171 Writing '10240 1000 1
' to '/proc/2054704/gid_map'
[D][2024-08-03T13:09:49-0500][2054703] writeBufToFile():115 Written '13' bytes to '/proc/2054704/gid_map'
[D][2024-08-03T13:09:49-0500][2054703] uidMapSelf():143 Writing '10240 1000 1
' to '/proc/2054704/uid_map'
[D][2024-08-03T13:09:49-0500][2054703] writeBufToFile():115 Written '13' bytes to '/proc/2054704/uid_map'
[D][2024-08-03T13:09:49-0500][1] setResGid():65 setresgid(10240)
[D][2024-08-03T13:09:49-0500][1] initNsFromChild():289 setgroups(0, [])
[D][2024-08-03T13:09:49-0500][1] initNsFromChild():296 setgroups(0, []) failed: Operation not permitted
[D][2024-08-03T13:09:49-0500][1] setResUid():81 setresuid(10240)
[D][2024-08-03T13:09:49-0500][1] mkdirAndTest():296 Created accessible directory in '/run/user/1000/nsjail'
[D][2024-08-03T13:09:49-0500][1] mkdirAndTest():296 Created accessible directory in '/run/user/1000/nsjail/root'
[E][2024-08-03T13:09:49-0500][1] initCloneNs():391 mount('/', '/', NULL, MS_REC|MS_PRIVATE, NULL): Permission denied
[F][2024-08-03T13:09:49-0500][1] runChild():487 Launching child process failed
[W][2024-08-03T13:09:49-0500][2054703] runChild():507 Received error message from the child process before it has been executed
[E][2024-08-03T13:09:49-0500][2054703] standaloneMode():275 Couldn't launch the child process
[D][2024-08-03T13:09:49-0500][2054703] main():376 Returning with 255

seemingly it can't mount the root directory (?) which seems surprising. The command is:

nsjail --verbose --config etc/nsjail/compilers-and-tools.cfg -- /bin/bash

and the referenced cfg file is https://github.com/compiler-explorer/compiler-explorer/blob/main/etc/nsjail/compilers-and-tools.cfg (with the log_level set to DEBUG).

Additionally these commands were run before, to get the cgroups to work:

sudo cgcreate -a $USER:$USER -g memory,pids,cpu:ce-compile
sudo chown $USER:root /sys/fs/cgroup/cgroup.procs
mattgodbolt commented 3 months ago

I just thought to check dmesg and:

[266713.881047] audit: type=1400 audit(1722708589.045:468): apparmor="AUDIT" operation="userns_create" class="namespace" info="Userns create - transitioning profile" profile="unconfined" pid=2054703 comm="nsjail" requested="userns_create" target="unprivileged_userns"
[266713.893050] audit: type=1400 audit(1722708589.057:469): apparmor="DENIED" operation="capable" class="cap" profile="unprivileged_userns" pid=2054704 comm="nsjail" capability=6  capname="setgid"
[266713.893126] audit: type=1400 audit(1722708589.057:470): apparmor="DENIED" operation="mount" class="mount" info="failed mntpnt match" error=-13 profile="unprivileged_userns" name="/" pid=2054704 comm="nsjail" flags="rw, rprivate"
mattgodbolt commented 3 months ago

Looks like this came in 23.10: https://ubuntu.com/blog/ubuntu-23-10-restricted-unprivileged-user-namespaces; now trying to work out how to disable it. Leaving this issue open in case folks who have more experience have advice.

mattgodbolt commented 3 months ago

Per the above link; this is a workaround:

sudo sysctl -w kernel.apparmor_restrict_unprivileged_unconfined=0
sudo sysctl -w kernel.apparmor_restrict_unprivileged_userns=0

But the general advice is to make an apparmor profile. Perhaps this is something nsjail can do?

disconnect3d commented 3 months ago

Unfortunately nsjail does not support AppArmor profiles at this moment (I believe they would be happy to do so). If you are running things via Docker (I guess you are not, but still maybe worth documenting it here) you can use --security-opt apparmor=unconfined.

I also believe there should be some way to disable AppArmor just for a single process. An alternative is to create an empty profile for it as well.

Some commands from here may be helpful: https://www.cyberciti.biz/faq/ubuntu-linux-howto-disable-apparmor-commands/

mattgodbolt commented 3 months ago

Thanks @disconnect3d . We're not running in Docker. I'm running on a vanilla install of Ubuntu 24.40 here, with only the setup commands above. An empty profile sounds OK. too; thanks. Just worth knowing about this gotcha (maybe updating some docs somewhere?)

Will close now as the sysctl disable "works" as would disabling AA entirely and probably some kind of per-process disablement too.