opencontainers / runc

CLI tool for spawning and running containers according to the OCI specification
https://www.opencontainers.org/
Apache License 2.0
11.82k stars 2.1k forks source link

Rootless containers without uid mapping to root #1800

Open Madeeks opened 6 years ago

Madeeks commented 6 years ago

Hello,

would it be possible to use runc to create a rootless container with the following characteristics:

In other words, I would like to know if the use case described here for LXC is supported by runc as well.

I tried setting up the config.json with the following details:

{
    "ociVersion": "1.0.0",
    "process": {
        "terminal":  true,
        "user": {
            "uid": 23689,
            "gid": 1000
        },   
        [ ... ]
    }
    [ ... ]
    "linux": {
        "uidMappings": [
            {
                "hostID": 23689,
                "containerID": 23689,
                "size": 1
            }
        ],
        "gidMappings": [
            {
                "hostID": 1000,
                "containerID": 1000,
                "size": 1
            }
        ],
        "namespaces": [
            {
                "type": "pid"
            },
            {
                "type": "ipc"
            },
            {
                "type": "uts"
            },
            {
                "type": "mount"
            },
            {
                "type": "user"
            }
            ],
        [ ... ]
    }
}

However runc run returns the following message: User namespaces enabled, but no user mapping found.

Thanks for any help provided.

cyphar commented 6 years ago

At the moment this is not supported, though this is something that I agree would be useful. I haven't yet taken a look at how much work it would take (and whether LXC does anything special in this case which we would have to replicate). I can talk to @brauner out-of-band and see how he solved the "you need root in the namespace in order to set up the container" problem (maybe it was done using capabilities -- I'm not sure).

Madeeks commented 6 years ago

Thanks a lot for the reply @cyphar! This feature would be very useful to me, so I'll keep an eye out for it in the future.

llchan commented 6 years ago

This would also be useful to me. I'd like to move some existing processes into (rootless) containers, and would like for them to think they run as the same unprivileged user as before.

I'll experiment a bit, but this will likely take me out of my depth and I may need some guidance. Do we already have a general idea of how to get this to work?

cyphar commented 6 years ago

Basically the core idea is that you need to just change the current restrictions and see what breaks. Likely the main breakages will be that runc currently assumes that running as a non-uid=0 (in the container) means that you want to drop capabilities. We need to stop this from happening and likely this will be the only really big pain point.

Aside from that most of it ought to mostly work (there isn't anything particularly special about mapping 1000->1000 versus 1000->0 as an unprivileged user).

llchan commented 6 years ago

I think I have something minimally functional, but one snag I've hit is that the RHEL 7.5 kernel 3.10 doesn't allow unprivileged devpts mounts (it returns EINVAL on mount). A likely relevant conversation is https://github.com/singularityware/singularity/issues/1186. As a workaround, I currently have to set "terminal": false and allow devpts mounts to fail, which is unfortunate but better than nothing. I could always do interactive work as container root if necessary. If you have any ideas for a better workaround let me know.

cyphar commented 6 years ago

As far as I am aware, this is something that we should have already fixed a long time ago by dropping gid=5 in the default mount options configuration (it was part of the original batch of changes in #774). Have you tried removing gid=5?

But the discussion you linked appears to argue that there is a kernel-side check for devpts mounts that is based on uid? That's a bit odd, I would've imagined it's purely based on whether you have CAP_SYS_ADMIN. I'll take a look at the relevant kernel code (hopefully it's not a RHEL-only patch because it's a nightmare to get usable RHEL kernel sources).

llchan commented 6 years ago

Yeah, saw some of the commits related to that. My config does not have a gid=5 option, and I verified via strace:

mount("devpts", "/path/to/bundle/rootfs/dev/pts", "devpts", MS_NOSUID|MS_NOEXEC, "newinstance,ptmxmode=0666,mode=0620") = -1 EINVAL (Invalid argument)

After re-reading that thread and peeking at the kernel source, I don't think this is RHEL-specific, it's just that the 3.10 kernel it comes with is fairly old and requires that uid=0 and gid=0 be valid in the user namespace. See the relevant 3.10 source at https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/fs/devpts/inode.c?h=v3.10#n249 and the commit that fixes this at https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=e98d41370392dbc3e94c8802ce4e9eec9efdf92e.

Is it possible for an unprivileged user to map host root to container root in the user namespace? I'm guessing not, for security reasons?

cyphar commented 6 years ago

Is it possible for an unprivileged user to map host root to container root in the user namespace? I'm guessing not, for security reasons?

No, that's not possible (as an unprivileged user) because inside a user namespace you can change to any user (if you created the user namespace). As an unprivileged user, you can only map yourself into either uid=0 or uid=parent_uid.

llchan commented 6 years ago

Right, yeah, thats what I thought. We may just have to accept that devpts wont work with nonzero uid/gid in older kernels. We can output a hint message in the logs when the devpts mount returns EINVAL and uid/gid are nonzero.

I'll put together a PR at some point.

zokrezyl commented 5 years ago

Hi, almost created a new issue for the same thing, luckily found this one.

I think the basic solution is trivial and would cover lot's of use-cases (I will link soon the way I solved it for me). It involves three (the fourths is already present) additional steps (semi-pseudo-code):

->  unshare(CLONE_NEWUSER)
->  write("/proc/$$/uuid_map", "1000 0")
->  write("/proc/$$/uuid_map", "1000 0")
-> execve("user_process")c

1000 is the original user's id, that was mapped in the previous step to 0.

I think if that would make it into the runc with additional flag, would be great.

zokrezyl commented 4 years ago

Just another note: probably bunch of exploits could have been and could be avoided (like https://seclists.org/oss-sec/2019/q1/119), if better tooling would be provided for unprivileged containers without reinventing the wheel...

And related more about the technical solution I proposed in my previous post: Before executing the second unshare, it would be great to give the opportunity to run an executable from the containers filesystem as initialisation, thus the steps would be:

-> clone && execve user specific init process as uid 0
->  unshare(CLONE_NEWUSER)
->  write("/proc/$$/uuid_map", "1000 0")
->  write("/proc/$$/uuid_map", "1000 0")
-> execve("user_process")
zokrezyl commented 4 years ago

Found the solution. In order to implement it one needs an additional step (sub-command) like "init", let's call in "unsremap"

#define _GNU_SOURCE                                                                                                                        
#include <stdlib.h>                                                                                                                        
#include <sched.h>                                                                                                                         

void unsremap(void)                                                                                                                        
{                                                                                                                                          
    char *unshare_mode = getenv("UNSHARE_MODE");                                                                                           
    if(unshare_mode != NULL) {                                                                                                             
        unshare(CLONE_NEWUSER);                                                                                                            
    }                                                                                                                                      
}                                                                                                                                          

Am happy to provide a patch if this description is accepted

cyphar commented 4 years ago

I'm not sure it's necessary to have a separate re-exec stage, you should just be able to add an extra CLONE_NEWUSER in the existing nsexec.c setup stages (though because you have to do the mappings this may require adding a new stage to setup...). In addition, doing the unshare after all the other namespaces are set up wouldn't be a good idea -- the new user namespace wouldn't own any of them and containers wouldn't function correctly.

Also changes to the configuration format of config.json require runtime-spec changes, ideally we would specify this separately (though I'd hate for it to be done through a new flag -- maybe it could be specified by saying that you only want a single mapping for a non-root user and the user to run as is set as the same user?).

zokrezyl commented 4 years ago

In nsexec may be too early as you cannot do any mounting and other init as non root, which I believe you are doing in the init subcommand.

In addition, doing the unshare after all the other namespaces are set up wouldn't be a good idea -- the new user namespace wouldn't own any of them and containers wouldn't function correctly. Well, I am trading something against something. Obviously lot of containers will not work as they may be assuming that they are running as uid 0. However why would I need to own further the namespaces. The idea, at least my understanding is to run in highest isolation and lowest privileges. Which assumes that the processes in the resulting context should not pretend to own anything significant.

My containers do not work for the opposite reason: some executable are assuming that they are not uid 0.

cyphar commented 4 years ago

However why would I need to own further the namespaces.

You cannot configure namespaces unless you own them (more specifically, have the correct capabilities in the user namespace which owns the namespace you're trying to configure), and since the configuration is done much later during setup you would need to do the unshare at the very end of setup which would make the logic much more complicated.

My containers do not work for the opposite reason: some executable are assuming that they are not uid 0.

There are some ghetto solutions for this problem which I helped develop some time ago -- https://github.com/rootless-containers/subuidless is the latest iteration of this idea.

zokrezyl commented 4 years ago

Not sure if you understood my initial proposal. The idea is that with some magic configuration, once everything is configured by runc (namespaces, mounts etc), instead of calling the process.args from config.json you would call

['/proc/self/exe', 'unsremap', '1000', '1000'] + process.args

the unsremap subcommand