Closed nidoro closed 2 years ago
Hey,
Nsjail logs can tell you what mount options it passes to mount
, e.g. if you use the example from Nsjail readme, you will be able to see (in the last line below) that the /proc
is mounted with an empty string as the mount options:
$ ./nsjail -Mo --user 0 --group 99999 -R /bin/ -R /lib -R /lib64/ -R /usr/ -R /sbin/ -T /dev -R /dev/urandom --keep_caps -- /bin/bash -i
[2017-05-24T17:08:02+0200] Mode: STANDALONE_ONCE
[2017-05-24T17:08:02+0200] Jail parameters: hostname:'NSJAIL', chroot:'(null)', process:'/bin/bash', bind:[::]:0, max_conns_per_ip:0, time_limit:0, personality:0, daemonize:false, clone_newnet:true, clone_newuser:true, clone_newns:true, clone_newpid:true, clone_newipc:true, clonew_newuts:true, clone_newcgroup:false, keep_caps:true, tmpfs_size:4194304, disable_no_new_privs:false, pivot_root_only:false
[2017-05-24T17:08:02+0200] Mount point: src:'none' dst:'/' type:'tmpfs' flags:MS_RDONLY|0 options:'' isDir:True
[2017-05-24T17:08:02+0200] Mount point: src:'none' dst:'/proc' type:'proc' flags:MS_RDONLY|0 options:'' isDir:True
Now, while procfs has indeed options that can control who can see what (mainly its hidepid=0|1|2
along with gid=N
options) this is not what hides "all processes running in the system" in here. What hides the processes the jailed process can see is the fact that the procfs mount occurred in a new PID namespace (which is determined by the clone_newpid:true
flag shown in the "Jail parameters" line).
FYI namespaces are a linux kernel feature that allows you to isolate what resources a given process can see. For example if you create a new PID namespace and start a process, it will be a PID=1 inside of this namespace. Of course in the outside namespace it will have other PID.
You can see this clearly in the listing below, where we use the unshare
program that allows us to experiment and create e.g. only a PID namespace and experiment with it (note: echo $$
displays parent PID in bash):
dc@jhtc:~$ bash
dc@jhtc:~$ echo $$
5522
dc@jhtc:~$ cp /bin/bash ./mybash && chmod a+x mybash
dc@jhtc:~$ sudo unshare --fork --pid ./mybash
root@jhtc:~# echo $$
1
root@jhtc:~# mkdir newprocfs
root@jhtc:~# mount -t proc proc ./newprocfs
root@jhtc:~# ls ./newprocfs
1 cmdline driver ioports kmsg meminfo partitions stat tty
14 consoles execdomains ipmi kpagecgroup misc sched_debug swaps uptime
acpi cpuinfo fb irq kpagecount modules schedstat sys version
asound crypto filesystems kallsyms kpageflags mounts scsi sysrq-trigger version_signature
buddyinfo devices fs kcore loadavg mtrr self sysvipc vmallocinfo
bus diskstats interrupts keys locks net slabinfo thread-self vmstat
cgroups dma iomem key-users mdstat pagetypeinfo softirqs timer_list zoneinfo
root@jhtc:~# cat ./newprocfs/1/cmdline
./mybash
root@jhtc:~# ls /proc/
1 11974 15668 2018 25 32329 42 483 519 56 7436 898 cmdline keys scsi
10 12 158 2019 2551 33 43 4831 52 57 76 9 consoles key-users self
10028 120 16 20889 25696 3351 4344 4832 520 58 77 9063 cpuinfo kmsg slabinfo
10355 121 16771 21 25897 3369 4350 4848 522 59 78 9064 crypto kpagecgroup softirqs
10356 122 18 22 26 34 4351 486 5222 6 8 907 devices kpagecount stat
10614 123 18144 22475 27 343 4352 4879 5223 60 8064 919 diskstats kpageflags swaps
10716 124 18162 22476 274 3442 4353 488 5277 61 8216 9493 dma loadavg sys
10753 125 1856 22485 27427 3443 4370 489 5328 6142 8345 9494 driver locks sysrq-trigger
11 12541 1887 22486 275 36 44 49 5346 62 8352 951 execdomains mdstat sysvipc
1132 13 1892 22487 28 37 45 490 54 63 857 9627 fb meminfo thread-self
11415 131 1896 22488 28456 3747 4539 491 549 6321 859 9628 filesystems misc timer_list
11599 1326 1897 22489 28688 38 46 494 55 638 860 9796 fs modules tty
11601 13875 1898 22490 28876 39 467 4988 5508 64 865 9797 interrupts mounts uptime
11602 13911 19 22491 30 4 4776 4997 5522 65 866 9812 iomem mtrr version
11603 13942 19174 22492 30526 40 48 50 5535 66 871 acpi ioports net version_signature
11604 14 19201 22493 30849 401 4813 500 5537 67 873 asound ipmi pagetypeinfo vmallocinfo
11605 140 19846 22528 30850 4013 4814 501 5538 6984 890 buddyinfo irq partitions vmstat
11606 15 2 22610 31 402 4815 51 5539 7 891 bus kallsyms sched_debug zoneinfo
1193 15243 20 24 32 4048 4816 5109 5556 7310 893 cgroups kcore schedstat
Now, if I do ps auxf
in another console, I can see that in the outside (initial) PID namespace, the real PID of mybash
is 5539 (but inside of its new PID namespace it is always referred to or targeted as PID=1 as we saw in the listing):
Also note that the old mount - /proc
- created in the initial namespaces when my system was booted is still accessible within the new PID namespace and it renders the PIDs from within that initial PID namespace. However, a newly created mount will render PIDs from the new namespace:
dc@jhtc:~$ ls ./newprocfs/
ls: cannot read symbolic link './newprocfs/self': No such file or directory
ls: cannot read symbolic link './newprocfs/thread-self': No such file or directory
1 consoles execdomains ipmi kpagecgroup misc sched_debug swaps uptime
acpi cpuinfo fb irq kpagecount modules schedstat sys version
asound crypto filesystems kallsyms kpageflags mounts scsi sysrq-trigger version_signature
buddyinfo devices fs kcore loadavg mtrr self sysvipc vmallocinfo
bus diskstats interrupts keys locks net slabinfo thread-self vmstat
cgroups dma iomem key-users mdstat pagetypeinfo softirqs timer_list zoneinfo
cmdline driver ioports kmsg meminfo partitions stat tty
dc@jhtc:~$ cat ./newprocfs/1/cmdline
./mybash
If you want to learn more about namespaces, I would recommend the following material:
Now, to answer your other questions:
How does the --disable_proc option relate to the mounting of the /proc? I can see that with that with --disable_proc enabled, /proc is not mounted. I'm not a sandbox guy, so I don't understand why /proc is mounted in one case and not in the other.
Well, if you pass the --disable_proc
flag the /proc
is NOT mounted. Nsjail also uses a mount namespace and makes it so that the whole filesystem the jailed process see is different from the original filesystem. As a result, the /proc
path within the jail is not the same /proc
path as in the initial namespaces (often referred to as "host", when talking about jails or containers, kinda confusingly).
Does environment variables work when --disable_proc is set? I'm getting "Library not found" errors from processes within the sandbox, even though I'm properly setting the PATH variable with the --env option. If I don't use --disable_proc, everything works fine.
Environment variables works, because there is no reason they would not as Nsjail passes them to the execve
syscall when executing the jailed process. Note however, that it does not copy the envvars of the environment you are in and add envvars on top of that, but instead only sets the envvars you explicitly passed. This can be observed here:
dc@jhtc:~$ nsjail -Mo -R /bin/ -R /lib -R /lib64/ -R /usr/ -R /sbin/ -T /dev --keep_caps --disable_proc --env MYENV=1234 -- /usr/bin/env
[I][2021-12-02T08:02:44+0000] Mode: STANDALONE_ONCE
[I][2021-12-02T08:02:44+0000] Jail parameters: hostname:'NSJAIL', chroot:'', process:'/usr/bin/env', bind:[::]:0, max_conns_per_ip:0, time_limit:0, personality:0, daemonize:false, clone_newnet:true, clone_newuser:true, clone_newns:true, clone_newpid:true, clone_newipc:true, clone_newuts:true, clone_newcgroup:true, keep_caps:true, disable_no_new_privs:false, max_cpus:0
[I][2021-12-02T08:02:44+0000] Mount: '/' flags:MS_RDONLY type:'tmpfs' options:'' dir:true
[I][2021-12-02T08:02:44+0000] Mount: '/bin/' -> '/bin/' flags:MS_RDONLY|MS_BIND|MS_REC|MS_PRIVATE type:'' options:'' dir:true
[I][2021-12-02T08:02:44+0000] Mount: '/lib' -> '/lib' flags:MS_RDONLY|MS_BIND|MS_REC|MS_PRIVATE type:'' options:'' dir:true
[I][2021-12-02T08:02:44+0000] Mount: '/lib64/' -> '/lib64/' flags:MS_RDONLY|MS_BIND|MS_REC|MS_PRIVATE type:'' options:'' dir:true
[I][2021-12-02T08:02:44+0000] Mount: '/usr/' -> '/usr/' flags:MS_RDONLY|MS_BIND|MS_REC|MS_PRIVATE type:'' options:'' dir:true
[I][2021-12-02T08:02:44+0000] Mount: '/sbin/' -> '/sbin/' flags:MS_RDONLY|MS_BIND|MS_REC|MS_PRIVATE type:'' options:'' dir:true
[I][2021-12-02T08:02:44+0000] Mount: '/dev' flags: type:'tmpfs' options:'size=4194304' dir:true
[I][2021-12-02T08:02:44+0000] Uid map: inside_uid:1000 outside_uid:1000 count:1 newuidmap:false
[I][2021-12-02T08:02:44+0000] Gid map: inside_gid:1000 outside_gid:1000 count:1 newgidmap:false
[I][2021-12-02T08:02:45+0000] Executing '/usr/bin/env' for '[STANDALONE MODE]'
MYENV=1234
[I][2021-12-02T08:02:45+0000] pid=5737 ([STANDALONE MODE]) exited with status: 0, (PIDs left: 0)
Regarding this part:
I'm getting "Library not found" errors from processes within the sandbox, even though I'm properly setting the PATH variable with the --env option. If I don't use --disable_proc, everything works fine.
This is weird and it is hard to tell much without seeing your specific case. It could be that one of your processes inspects the environment variables of itself or another spawned process through the /proc/$PID/environ
file? Also, don't forget that the jailed process observes a completely different filesystem mounts unless you mount all things explicitly within the jail (so that the PATH
envvar needs to be set accordingly).
Thank you for taking your time to write such a complete answer. Your examples were very helpful. And thank you for pointing out resources for further reading.
First, The reason why I got confused about what --disable-proc
does is because I musunderstood one of the examples in the README.md
page. The example says:
Execute echo command directly, **without a supervising process**
nsjail -Me --chroot / --disable_proc -- /bin/echo "ABC"
Then I thought that --disable_proc
was the option that was making the process run without a supervising process, when in fact it was -Me
. Now I understand that what --disable_proc
does is disable procfs mounting.
Second, after reading your answer about environment variables I tried to investigate the reason why my sandbox configuration was working as I expected without --disable_proc
and failing when I added --disable_proc
. Now I got it working in both cases, but I'm still confused on how --disable_proc
relates to my problem --- see below.
The following command line is a minimal working example of my use case.
/path/to/nsjail
-Mo
--chroot "/home/davidoro/nsjail-test"
-R "/lib64"
-R "/lib/x86_64-linux-gnu"
-R "/home/davidoro/ampl_linux-intel64:/cmd-executable"
-- /cmd-executable/cplex
The executable I'm trying to run --- cplex
--- works fine with that command. However, when I add --disable_proc
, the execution of cplex
fails with an error:
/cmd-executable/cplex: error while loading shared libraries: libcplex2010.so: cannot open shared object file: No such file or directory
The linker is having trouble finding libcplex2010.so
, which is in the same directory of the executable, i.e., /cmd-executable
. I fixed the problem adding --env "LD_LIBRARY_PATH=/cmd-executable"
to the command. But now I'm confused on why it worked in the first place.
--disable_proc
?Also, allow me to ask two other questions:
What's the motivation for /proc
being mounted by default?
What are the advantages/disadvantages of having (or not) a supervising process? When should one choose -Me
rather than -Mo
?
--disable_proc
as ld.so probably handles $ORIGIN
by using readlink
on /proc/self/exe
.-Me
for example when you plug nsjail into something that is not aware of it and could e.g. send signals to the spawned process.Ah, makes sense.
Ok, that is all for now. Thank you both for the help.
Hello,
I would like to know more about the default behavior for mounting
/proc
. I know that it is mounted by default, but what exactly are the the default flags for it? I tried the verbose command line option but all it says is that theMS_RDONLY
flag is set. I suspect, however, that there is more than that, because if I inspect the/proc
filesystem within the sandbox, I'm not able to read all processes running in the system. Are they hidden by default?How does the
--disable_proc
option relate to the mounting of the/proc
? I can see that with that with--disable_proc
enabled,/proc
is not mounted. I'm not a sandbox guy, so I don't understand why/proc
is mounted in one case and not in the other.Does environment variables work when
--disable_proc
is set? I'm getting "Library not found" errors from processes within the sandbox, even though I'm properly setting the PATH variable with the --env
option. If I don't use--disable_proc
, everything works fine.Thank you for the tool.