just-containers / s6-overlay

s6 overlay for containers (includes execline, s6-linux-utils & a custom init)
Other
3.77k stars 212 forks source link

s6-overlay-suexec: fatal: insufficient privileges (is the suid bit set?) #392

Closed jinnko closed 2 years ago

jinnko commented 2 years ago

The changes introduced in v3 have resulted in a regression when the docker daemon is running with the userns-remap option. This is similar to the conditions that were fixed in https://github.com/just-containers/s6-overlay/issues/309, however the failure is now in a different place.

When running dockerd with userns-remap, then starting up a container with --userns=host, I get the following error message:

s6-overlay-suexec: fatal: insufficient privileges (is the suid bit set?)

With the docker daemon running with userns-remap, the container build is run in a user namespace that differs from the host. Files created during the build are owned by the re-mapped IDs.

When those images are run in the remapped user namespace, the result is they are mapped correctly and are owned by root.

However if those containers are run with --userns host we see the following:

/ # ls -l /package/admin/s6-overlay-helpers-0.0.1.0/command/s6-overlay-suexec
-rwsr-xr-x    1 493216   493216       37960 Jan 28 13:11 /package/admin/s6-overlay-helpers-0.0.1.0/command/s6-overlay-suexec

Note the UID & GID should be 0 or root, but they are the remapped UID/GID as configured in /etc/subuid and /etc/subgid as shown below.

Relevant config files

Example /etc/docker/daemon.json:

{ "userns-remap": "dockremap" }

And the remap files:

$ grep dockremap /etc/subuid /etc/subgid
/etc/subuid:dockremap:493216:65536
/etc/subgid:dockremap:493216:65536

The Dockerfile (done in a RUN segment so the latest release files can be retrieved, unpacked, and sha256 verified in a single layer):

FROM alpine:latest

RUN set -x \
 && apk --no-cache add jq \
 && cd /tmp \
 \
 && BINARIES=$(wget -O - -q https://api.github.com/repos/just-containers/s6-overlay/releases/latest | jq -Mr '.assets[] | select(.name|match("s6-overlay-x86_64-[.0-9-]+.tar.xz$")) | .browser_download_url') \
 && wget -q "$BINARIES" \
 && wget -q "$BINARIES.sha256" \
 \
 && SCRIPTS=$(wget -O - -q https://api.github.com/repos/just-containers/s6-overlay/releases/latest | jq -Mr '.assets[] | select(.name|match("s6-overlay-noarch-[.0-9-]+.tar.xz$")) | .browser_download_url') \
 && wget -q "$SCRIPTS" \
 && wget -q "$SCRIPTS.sha256" \
 \
 && sha256sum -c *.tar.xz.sha256 \
 && ls *tar.xz | xargs -n1 tar -C / -Jxvf \
 && apk del jq \
 && rm -f /tmp/s6-overlay-*

ENTRYPOINT ["/init"]

Commands

The build command:

docker build -t s6-v3 .

The run command:

docker run -it --rm --userns host s6-v3
skarnet commented 2 years ago

You see, that's exactly the reason why I dislike kitchen sink tools like Docker that have zillions of ways to achieve the same thing in slightly different ways.

Normally, in a multiprocess container, the container is run with its entrypoint running as root, then the init system performs privilege separation and runs every subservice under a different uid. That's how s6 operates traditionally; that's what s6-overlay tries to recreate; s6 is secure enough that the supervision tree can run as root, provided that services drop their root privileges in the run script. It works and nobody has ever complained about it.

Then people wanted support for USER containers. In this mode, the entrypoint runs as an unprivileged user, and the whole process tree in the container runs at that user. Sure, more processes are unprivileged, but there is no uid separation, so it's arguable what is more secure - and it really depends on the number and nature of the subservices. Well, it's still a reasonable request, so we supported USER containers. Unfortunately, since we cannot be sure that the USER will be the same from one invocation to the next, we still need root privileges for a couple operations (in preinit), and relinquish them forever afterwards. It was a lot of work, but it's now operational.

And now, you are reporting another way for containers to drop privileges - this time, uids are remapped on the fly, and what is supposed to be root isn't root anymore, by magic! Well, no surprise that it doesn't work with the mechanism we have for USER containers.

I suppose I can turn the error in s6-overlay-suexec into a warning, but we cannot guarantee that everything will work further on in the container init sequence. It should, but userns-remap is really not how Unix was supposed to work and it breaks a number of assumptions. Best effort is the best we can do.

Also, any service attempting to do privilege separation will fail with userns-remap as it fails with USER: typically, syslogd-overlay won't support userns-remap.

jinnko commented 2 years ago

@skarnet I think you've misunderstood either the purpose or mechanisms of userns-remap.

From the running container perspective, everything will be exactly as if it's a normal system with the init entrypoint running as root. From the above linked docs:

... without the running process being aware of the limitations

It's not a mechanism for running processes as non-root users inside the container and is not the same as "USER containers".

What's different is that from the host perspective those processes are entirely unprivileged and have no access to the host's root UID despite everything looking normal within the running container. This limits the impact to the host of any potential privilege escalation bugs within the container, such as last week's sudo CVE.

The thing here that's breaking is how SUID works when the container is built on a hardened build server then run in the hosts own user namespace.

skarnet commented 2 years ago

Yes, I understand. What I don't understand is that how remapping the file ownerships, from the point of view of the container, is going to help in any way. If files belong to root, they are more restricted than if they belong to some normal user, so it's normally a good thing! Except for suid executables, like here, where remapping it to a normal user breaks everything, when keeping it owned by root would not have been dangerous since the root privilege gained is only inside the container and by definition cannot leak to the host.

In other words: I don't understand how --userns host makes any kind of sense.

If, as you say, pid 1 is still starting as root inside the container's user namespace, and it's only the files that have remapped ownership, then it's probably fixable at the cost of yet another workaround, but I do question the validity of the operation in the first place.

skarnet commented 2 years ago

I have modified s6-overlay-suexec so it should do the right thing even in the case of --userns host. Closing this; please reopen if it's still not working for you once 3.1.0.0 is out, or earlier if you happen to build from the latest source.

hasan4791 commented 1 year ago

@skarnet Where can i find more information about this comment, especially "services drop their root privileges in the run script". Lets say, in my container I've a user "abc" and would like to perform some init stuffs as "root" and run the main service as user "abc" with limited privileges. Also does the user will also have all the capabilities of the container?

Normally, in a multiprocess container, the container is run with its entrypoint running as root, then the init system performs privilege separation and runs every subservice under a different uid. That's how s6 operates traditionally; that's what s6-overlay tries to recreate; s6 is secure enough that the supervision tree can run as root, provided that services drop their root privileges in the run script. It works and nobody has ever complained about it.

skarnet commented 1 year ago

If the command running your main service is foo, then use s6-setuidgid abc foo instead, and foo will run as user abc. Don't change anything else, so your init is still performed as root.

User abc will keep having access to the whole container, except, obviously, what can only be done by root. So, for instance, if it needs to write files to a directory dir, you should ensure that dir belongs to abc beforehand.