Open kolyshkin opened 3 days ago
PIDNS is disabled only when buildkitd
is executed with --oci-worker-no-process-sandbox
:
This was required for allowing to run the moby/buildkit
image without --privileged
, by eliminating the necessity of mounting /proc
for the executor containers (RUN
instruction containers).
Since now we have systemdpaths=unconfined
option in both Kube and Docker, probably we can deprecate --oci-worker-no-process-sandbox
in favor of it.
But I'm not sure if systemdpaths=unconfined
is really more secure than --privileged
.
BTW, having unkillable processes is not a serious concern for BuildKit, as long as the following conditions are satisfied:
buildkitd
process is containerized with PIDNS, andbuildkitd
is running in the (pseudo-)daemonless modeBecause all the RUN
processes are killed on the death of the buildkitd
process.
When runc 1.2.x is used by rootless buildkit, runc spits the following warning (the reproducer can be taken from here: https://github.com/moby/buildkit/issues/5491):
To reiterate, such config is unsound, here's why.
The lack of private pid namespace means that runc can not rely on kernel's special handling of pid 1 (when killed, all other processes in the same cgroup are also killed). In such case, in order to kill a container runc needs to collect all the pids from the cgroup, freeze the cgroup (so no new forks are happening), send sigkill to all processes and thaw the cgroup (so that those processes can actually be killed). Or, if cgroup.kill is available, it is used (which basically tells the kernel to kill all processes in the cgroup, and don't allow any new forks).
Now, if runc doesn't have neither private pid namespace nor cgroup for a container, there is no way to properly stop a container (kill all container processes).
Can you please fix this, or suggest how runc should handle such a case? I mean, I can think of other ways to figure out which processes belong to the container, but they are all rather expensive and not race-free.