google / nsjail

A lightweight process isolation tool that utilizes Linux namespaces, cgroups, rlimits and seccomp-bpf syscall filters, leveraging the Kafel BPF language for enhanced security.
https://nsjail.dev
Apache License 2.0
2.98k stars 274 forks source link

Questions about /proc mount options and --disable_proc #188

Closed nidoro closed 2 years ago

nidoro commented 2 years ago

Hello,

  1. I would like to know more about the default behavior for mounting /proc. I know that it is mounted by default, but what exactly are the the default flags for it? I tried the verbose command line option but all it says is that the MS_RDONLY flag is set. I suspect, however, that there is more than that, because if I inspect the /proc filesystem within the sandbox, I'm not able to read all processes running in the system. Are they hidden by default?

  2. How does the --disable_proc option relate to the mounting of the /proc? I can see that with that with --disable_proc enabled, /proc is not mounted. I'm not a sandbox guy, so I don't understand why /proc is mounted in one case and not in the other.

  3. Does environment variables work when --disable_proc is set? I'm getting "Library not found" errors from processes within the sandbox, even though I'm properly setting the PATH variable with the --env option. If I don't use --disable_proc, everything works fine.

Thank you for the tool.

disconnect3d commented 2 years ago

Hey,

Nsjail logs can tell you what mount options it passes to mount, e.g. if you use the example from Nsjail readme, you will be able to see (in the last line below) that the /proc is mounted with an empty string as the mount options:

$ ./nsjail -Mo --user 0 --group 99999 -R /bin/ -R /lib -R /lib64/ -R /usr/ -R /sbin/ -T /dev -R /dev/urandom --keep_caps -- /bin/bash -i
[2017-05-24T17:08:02+0200] Mode: STANDALONE_ONCE
[2017-05-24T17:08:02+0200] Jail parameters: hostname:'NSJAIL', chroot:'(null)', process:'/bin/bash', bind:[::]:0, max_conns_per_ip:0, time_limit:0, personality:0, daemonize:false, clone_newnet:true, clone_newuser:true, clone_newns:true, clone_newpid:true, clone_newipc:true, clonew_newuts:true, clone_newcgroup:false, keep_caps:true, tmpfs_size:4194304, disable_no_new_privs:false, pivot_root_only:false
[2017-05-24T17:08:02+0200] Mount point: src:'none' dst:'/' type:'tmpfs' flags:MS_RDONLY|0 options:'' isDir:True
[2017-05-24T17:08:02+0200] Mount point: src:'none' dst:'/proc' type:'proc' flags:MS_RDONLY|0 options:'' isDir:True

Now, while procfs has indeed options that can control who can see what (mainly its hidepid=0|1|2 along with gid=N options) this is not what hides "all processes running in the system" in here. What hides the processes the jailed process can see is the fact that the procfs mount occurred in a new PID namespace (which is determined by the clone_newpid:true flag shown in the "Jail parameters" line).

FYI namespaces are a linux kernel feature that allows you to isolate what resources a given process can see. For example if you create a new PID namespace and start a process, it will be a PID=1 inside of this namespace. Of course in the outside namespace it will have other PID.

You can see this clearly in the listing below, where we use the unshare program that allows us to experiment and create e.g. only a PID namespace and experiment with it (note: echo $$ displays parent PID in bash):

dc@jhtc:~$ bash
dc@jhtc:~$ echo $$
5522
dc@jhtc:~$ cp /bin/bash ./mybash && chmod a+x mybash
dc@jhtc:~$ sudo unshare --fork --pid ./mybash
root@jhtc:~# echo $$
1
root@jhtc:~# mkdir newprocfs
root@jhtc:~# mount -t proc proc ./newprocfs
root@jhtc:~# ls ./newprocfs
1          cmdline    driver       ioports    kmsg         meminfo       partitions   stat           tty
14         consoles   execdomains  ipmi       kpagecgroup  misc          sched_debug  swaps          uptime
acpi       cpuinfo    fb           irq        kpagecount   modules       schedstat    sys            version
asound     crypto     filesystems  kallsyms   kpageflags   mounts        scsi         sysrq-trigger  version_signature
buddyinfo  devices    fs           kcore      loadavg      mtrr          self         sysvipc        vmallocinfo
bus        diskstats  interrupts   keys       locks        net           slabinfo     thread-self    vmstat
cgroups    dma        iomem        key-users  mdstat       pagetypeinfo  softirqs     timer_list     zoneinfo
root@jhtc:~# cat ./newprocfs/1/cmdline
./mybash
root@jhtc:~# ls /proc/
1      11974  15668  2018   25     32329  42    483   519   56    7436  898        cmdline      keys          scsi
10     12     158    2019   2551   33     43    4831  52    57    76    9          consoles     key-users     self
10028  120    16     20889  25696  3351   4344  4832  520   58    77    9063       cpuinfo      kmsg          slabinfo
10355  121    16771  21     25897  3369   4350  4848  522   59    78    9064       crypto       kpagecgroup   softirqs
10356  122    18     22     26     34     4351  486   5222  6     8     907        devices      kpagecount    stat
10614  123    18144  22475  27     343    4352  4879  5223  60    8064  919        diskstats    kpageflags    swaps
10716  124    18162  22476  274    3442   4353  488   5277  61    8216  9493       dma          loadavg       sys
10753  125    1856   22485  27427  3443   4370  489   5328  6142  8345  9494       driver       locks         sysrq-trigger
11     12541  1887   22486  275    36     44    49    5346  62    8352  951        execdomains  mdstat        sysvipc
1132   13     1892   22487  28     37     45    490   54    63    857   9627       fb           meminfo       thread-self
11415  131    1896   22488  28456  3747   4539  491   549   6321  859   9628       filesystems  misc          timer_list
11599  1326   1897   22489  28688  38     46    494   55    638   860   9796       fs           modules       tty
11601  13875  1898   22490  28876  39     467   4988  5508  64    865   9797       interrupts   mounts        uptime
11602  13911  19     22491  30     4      4776  4997  5522  65    866   9812       iomem        mtrr          version
11603  13942  19174  22492  30526  40     48    50    5535  66    871   acpi       ioports      net           version_signature
11604  14     19201  22493  30849  401    4813  500   5537  67    873   asound     ipmi         pagetypeinfo  vmallocinfo
11605  140    19846  22528  30850  4013   4814  501   5538  6984  890   buddyinfo  irq          partitions    vmstat
11606  15     2      22610  31     402    4815  51    5539  7     891   bus        kallsyms     sched_debug   zoneinfo
1193   15243  20     24     32     4048   4816  5109  5556  7310  893   cgroups    kcore        schedstat

Now, if I do ps auxf in another console, I can see that in the outside (initial) PID namespace, the real PID of mybash is 5539 (but inside of its new PID namespace it is always referred to or targeted as PID=1 as we saw in the listing): image

Also note that the old mount - /proc - created in the initial namespaces when my system was booted is still accessible within the new PID namespace and it renders the PIDs from within that initial PID namespace. However, a newly created mount will render PIDs from the new namespace:

dc@jhtc:~$ ls ./newprocfs/
ls: cannot read symbolic link './newprocfs/self': No such file or directory
ls: cannot read symbolic link './newprocfs/thread-self': No such file or directory
1          consoles   execdomains  ipmi       kpagecgroup  misc          sched_debug  swaps          uptime
acpi       cpuinfo    fb           irq        kpagecount   modules       schedstat    sys            version
asound     crypto     filesystems  kallsyms   kpageflags   mounts        scsi         sysrq-trigger  version_signature
buddyinfo  devices    fs           kcore      loadavg      mtrr          self         sysvipc        vmallocinfo
bus        diskstats  interrupts   keys       locks        net           slabinfo     thread-self    vmstat
cgroups    dma        iomem        key-users  mdstat       pagetypeinfo  softirqs     timer_list     zoneinfo
cmdline    driver     ioports      kmsg       meminfo      partitions    stat         tty
dc@jhtc:~$ cat ./newprocfs/1/cmdline
./mybash

If you want to learn more about namespaces, I would recommend the following material:

Now, to answer your other questions:

How does the --disable_proc option relate to the mounting of the /proc? I can see that with that with --disable_proc enabled, /proc is not mounted. I'm not a sandbox guy, so I don't understand why /proc is mounted in one case and not in the other.

Well, if you pass the --disable_proc flag the /proc is NOT mounted. Nsjail also uses a mount namespace and makes it so that the whole filesystem the jailed process see is different from the original filesystem. As a result, the /proc path within the jail is not the same /proc path as in the initial namespaces (often referred to as "host", when talking about jails or containers, kinda confusingly).

Does environment variables work when --disable_proc is set? I'm getting "Library not found" errors from processes within the sandbox, even though I'm properly setting the PATH variable with the --env option. If I don't use --disable_proc, everything works fine.

Environment variables works, because there is no reason they would not as Nsjail passes them to the execve syscall when executing the jailed process. Note however, that it does not copy the envvars of the environment you are in and add envvars on top of that, but instead only sets the envvars you explicitly passed. This can be observed here:

dc@jhtc:~$ nsjail -Mo -R /bin/ -R /lib -R /lib64/ -R /usr/ -R /sbin/ -T /dev --keep_caps --disable_proc --env MYENV=1234 -- /usr/bin/env
[I][2021-12-02T08:02:44+0000] Mode: STANDALONE_ONCE
[I][2021-12-02T08:02:44+0000] Jail parameters: hostname:'NSJAIL', chroot:'', process:'/usr/bin/env', bind:[::]:0, max_conns_per_ip:0, time_limit:0, personality:0, daemonize:false, clone_newnet:true, clone_newuser:true, clone_newns:true, clone_newpid:true, clone_newipc:true, clone_newuts:true, clone_newcgroup:true, keep_caps:true, disable_no_new_privs:false, max_cpus:0
[I][2021-12-02T08:02:44+0000] Mount: '/' flags:MS_RDONLY type:'tmpfs' options:'' dir:true
[I][2021-12-02T08:02:44+0000] Mount: '/bin/' -> '/bin/' flags:MS_RDONLY|MS_BIND|MS_REC|MS_PRIVATE type:'' options:'' dir:true
[I][2021-12-02T08:02:44+0000] Mount: '/lib' -> '/lib' flags:MS_RDONLY|MS_BIND|MS_REC|MS_PRIVATE type:'' options:'' dir:true
[I][2021-12-02T08:02:44+0000] Mount: '/lib64/' -> '/lib64/' flags:MS_RDONLY|MS_BIND|MS_REC|MS_PRIVATE type:'' options:'' dir:true
[I][2021-12-02T08:02:44+0000] Mount: '/usr/' -> '/usr/' flags:MS_RDONLY|MS_BIND|MS_REC|MS_PRIVATE type:'' options:'' dir:true
[I][2021-12-02T08:02:44+0000] Mount: '/sbin/' -> '/sbin/' flags:MS_RDONLY|MS_BIND|MS_REC|MS_PRIVATE type:'' options:'' dir:true
[I][2021-12-02T08:02:44+0000] Mount: '/dev' flags: type:'tmpfs' options:'size=4194304' dir:true
[I][2021-12-02T08:02:44+0000] Uid map: inside_uid:1000 outside_uid:1000 count:1 newuidmap:false
[I][2021-12-02T08:02:44+0000] Gid map: inside_gid:1000 outside_gid:1000 count:1 newgidmap:false
[I][2021-12-02T08:02:45+0000] Executing '/usr/bin/env' for '[STANDALONE MODE]'
MYENV=1234
[I][2021-12-02T08:02:45+0000] pid=5737 ([STANDALONE MODE]) exited with status: 0, (PIDs left: 0)

Regarding this part:

I'm getting "Library not found" errors from processes within the sandbox, even though I'm properly setting the PATH variable with the --env option. If I don't use --disable_proc, everything works fine.

This is weird and it is hard to tell much without seeing your specific case. It could be that one of your processes inspects the environment variables of itself or another spawned process through the /proc/$PID/environ file? Also, don't forget that the jailed process observes a completely different filesystem mounts unless you mount all things explicitly within the jail (so that the PATH envvar needs to be set accordingly).

nidoro commented 2 years ago

Thank you for taking your time to write such a complete answer. Your examples were very helpful. And thank you for pointing out resources for further reading.

First, The reason why I got confused about what --disable-proc does is because I musunderstood one of the examples in the README.md page. The example says:

Execute echo command directly, **without a supervising process**
nsjail -Me --chroot / --disable_proc -- /bin/echo "ABC"

Then I thought that --disable_proc was the option that was making the process run without a supervising process, when in fact it was -Me. Now I understand that what --disable_proc does is disable procfs mounting.

Second, after reading your answer about environment variables I tried to investigate the reason why my sandbox configuration was working as I expected without --disable_proc and failing when I added --disable_proc. Now I got it working in both cases, but I'm still confused on how --disable_proc relates to my problem --- see below.

The following command line is a minimal working example of my use case.

/path/to/nsjail 
  -Mo
  --chroot "/home/davidoro/nsjail-test"
  -R "/lib64"
  -R "/lib/x86_64-linux-gnu"
  -R "/home/davidoro/ampl_linux-intel64:/cmd-executable"
  --  /cmd-executable/cplex

The executable I'm trying to run --- cplex --- works fine with that command. However, when I add --disable_proc, the execution of cplex fails with an error:

/cmd-executable/cplex: error while loading shared libraries: libcplex2010.so: cannot open shared object file: No such file or directory

The linker is having trouble finding libcplex2010.so, which is in the same directory of the executable, i.e., /cmd-executable. I fixed the problem adding --env "LD_LIBRARY_PATH=/cmd-executable" to the command. But now I'm confused on why it worked in the first place.

  1. Shouldn't it have failed with or without --disable_proc?

Also, allow me to ask two other questions:

  1. What's the motivation for /proc being mounted by default?

  2. What are the advantages/disadvantages of having (or not) a supervising process? When should one choose -Me rather than -Mo?

happyCoder92 commented 2 years ago
  1. It's failing with --disable_proc as ld.so probably handles $ORIGIN by using readlink on /proc/self/exe.
  2. This is often needed and with other default options only exposes processes in the sandbox.
  3. You get additional info/logging from the supervising process. You should choose -Me for example when you plug nsjail into something that is not aware of it and could e.g. send signals to the spawned process.
nidoro commented 2 years ago

Ah, makes sense.

Ok, that is all for now. Thank you both for the help.