moby / buildkit

concurrent, cache-efficient, and Dockerfile-agnostic builder toolkit
https://github.com/moby/moby/issues/34227
Apache License 2.0
8.21k stars 1.16k forks source link

Using runc with no pid namespace nor cgroups #5512

Open kolyshkin opened 3 days ago

kolyshkin commented 3 days ago

When runc 1.2.x is used by rootless buildkit, runc spits the following warning (the reproducer can be taken from here: https://github.com/moby/buildkit/issues/5491):

Creating a rootless container with no cgroup and no private pid namespace. Such configuration is strongly discouraged (as it is impossible to properly kill all container's processes) and will result in an error in a future runc version.

To reiterate, such config is unsound, here's why.

The lack of private pid namespace means that runc can not rely on kernel's special handling of pid 1 (when killed, all other processes in the same cgroup are also killed). In such case, in order to kill a container runc needs to collect all the pids from the cgroup, freeze the cgroup (so no new forks are happening), send sigkill to all processes and thaw the cgroup (so that those processes can actually be killed). Or, if cgroup.kill is available, it is used (which basically tells the kernel to kill all processes in the cgroup, and don't allow any new forks).

Now, if runc doesn't have neither private pid namespace nor cgroup for a container, there is no way to properly stop a container (kill all container processes).

Can you please fix this, or suggest how runc should handle such a case? I mean, I can think of other ways to figure out which processes belong to the container, but they are all rather expensive and not race-free.

AkihiroSuda commented 3 days ago

PIDNS is disabled only when buildkitd is executed with --oci-worker-no-process-sandbox:

https://github.com/moby/buildkit/blob/0655923d7e2884a0d514313fd688178a6da57b43/docs/rootless.md?plain=1#L71-L80

This was required for allowing to run the moby/buildkit image without --privileged, by eliminating the necessity of mounting /proc for the executor containers (RUN instruction containers).

Since now we have systemdpaths=unconfined option in both Kube and Docker, probably we can deprecate --oci-worker-no-process-sandbox in favor of it. But I'm not sure if systemdpaths=unconfined is really more secure than --privileged.

AkihiroSuda commented 3 days ago

BTW, having unkillable processes is not a serious concern for BuildKit, as long as the following conditions are satisfied:

Because all the RUN processes are killed on the death of the buildkitd process.