A lightweight process isolation tool that utilizes Linux namespaces, cgroups, rlimits and seccomp-bpf syscall filters, leveraging the Kafel BPF language for enhanced security.
Sys-V shared memory (shmget, shmat, etc.), will not be immediately cleaned up by the kernel upon the exit of the jailed process (Linux cleans it up lazily using a workqueue), and will remain resident in RAM and unclaimable by other processes. Reclamation can take several seconds to occur, especially if there is a large number of shared memory regions or IPC namespaces to clean-up. When jails can be created several times per second (as is the case with LISTEN mode), the jails can easily reserve shared memory at a rate higher than it is cleaned up, consuming all of RAM (regardless of the per-jail cgroup limits) and eventually causing processes outside of the jail to get killed by the oom-killer.
The shared memory regions can be immediately reclaimed by other processes if deliberately destroyed e.g. with ipcrm -a.
It's not clear exactly how this should be fixed within nsjail, simply because the only process running within the namespace is, by design, the target process. Once that process exits, we need to run cleanup inside the namespace. This seems a bit tricky---my thought is for nsjail to spawn another process, have it setns into the IPC namespace of the child before the child execve's, and then once the child exits, it can cleanup IPC resources (e.g. as in ipcrm).
Disclaimer: This was reported using the process in the project's security.md, but was found to be "not severe enough for us to track it as a security bug". Therefore, I am filing it as a functional bug. My report and reproducer are duplicated here.
Sys-V shared memory (shmget, shmat, etc.), will not be immediately cleaned up by the kernel upon the exit of the jailed process (Linux cleans it up lazily using a workqueue), and will remain resident in RAM and unclaimable by other processes. Reclamation can take several seconds to occur, especially if there is a large number of shared memory regions or IPC namespaces to clean-up. When jails can be created several times per second (as is the case with LISTEN mode), the jails can easily reserve shared memory at a rate higher than it is cleaned up, consuming all of RAM (regardless of the per-jail cgroup limits) and eventually causing processes outside of the jail to get killed by the oom-killer.
The shared memory regions can be immediately reclaimed by other processes if deliberately destroyed e.g. with
ipcrm -a
.It's not clear exactly how this should be fixed within nsjail, simply because the only process running within the namespace is, by design, the target process. Once that process exits, we need to run cleanup inside the namespace. This seems a bit tricky---my thought is for nsjail to spawn another process, have it setns into the IPC namespace of the child before the child execve's, and then once the child exits, it can cleanup IPC resources (e.g. as in ipcrm).
Disclaimer: This was reported using the process in the project's security.md, but was found to be "not severe enough for us to track it as a security bug". Therefore, I am filing it as a functional bug. My report and reproducer are duplicated here.