Support idmapped mount (merged in kernel 5.12)

tuananh commented 3 years ago

ref: https://github.com/containerd/containerd/pull/4734

AkihiroSuda commented 3 years ago

What do we need in runc?

cyphar commented 3 years ago

This is something that will need to be implemented in the higher-level tools (containerd, docker, podman, ...) as part of their storage drivers. There's not really much for runc to do because the higher-level tool has all of the information about images and manages their storage.

mauriciovasquezbernal commented 3 years ago

What about mounts? The current specification mentions that the Linux options are the same listed the mount(8) man page, however id mapped mounts won't be there as it uses the new mount api and the mount_setattr syscall to activate them. Perhaps an option to set the idmapped attribute when user namespaces are used should be provided?

mauriciovasquezbernal commented 3 years ago

It seems like @AkihiroSuda is ahead of the game, the same approach using in https://github.com/opencontainers/runtime-spec/pull/1090 would also work for MOUNT_ATTR_IDMAP.

AkihiroSuda commented 3 years ago

Moving to v1.2 milestone

AkihiroSuda commented 2 years ago

crun now supports "idmap" mount option https://github.com/containers/crun/commit/827b8731899d4febea3f27907bcde5e6065bb65b

AlexeyPerevalov commented 2 years ago

I would like to raise and interface issue here, which related to idmapped mounts. Since we can have shared volumes, e.g. a directory mounted from host fs. Here idmap mount also could be usefull, as example in following scenario: Log grabber runs with user1 permission in a container, and 2 log producers with user2, user3 respectively in another containers (or even pods). These log producers have common mount with log grabber, in this case reasonable to apply id map mapping to avoid DAC operations with groups or MAC. Here is the question how to pass idmaps by OCI:

As a string in Spec.Mount.Options, e.g. entry would contain something like this "idmap=1000,1001,1", it feels it was considered (https://github.com/opencontainers/runc/blob/2436322fef995f60194554c5ce2947df2e18661d/libcontainer/specconv/spec_linux.go#L128) This way probably doesn't fit for idmap, since it's not in the list of available options for fsconfig or mount syscall - this is a separate entity.
As new field Spec.Mount.IDMap in this case it's not mixed with mount options.

WDYT @rata, @AkihiroSuda, @cyphar ?

rata commented 2 years ago

I think it is not trivial to use id mapped mounts for volumes, as we need to know if the filesystems support id mapped mounts. If we just focus on runc and use it when requested (fail if requested and not supported), it might work for those use cases. But I'd like to see how that is useful to use from Kubernetes, in other words the full picture.

I'm curious what others think.

cyphar commented 2 years ago

@rata

This would (in theory) be useful for Kubernetes because Kubernetes knows what volumes are used by what containers, and having ID mapped mounts would allow Kubernetes to run containers with a more secure user namespace configuration (right now they don't use user namespaces at all -- but it would allow them to run with "isolated" user namespaces where each pod has a unique mapping, and then any directories mounted into multiple containers can be remapped to match the container IDs).

To be honest, I somewhat doubt they'll use this feature (after all, they don't even use user namespaces even though it would improve the security of their containers) but it is something you could imagine them using and being quite useful.

As for filesystem support, @brauner is slowly hacking away at getting every filesystem to support it (last I spoke to him about it, he was working on btrfs -- that might've already been merged, I'm not sure). These days, the most commonly used ones support it. IMHO there's not much reason to worry about filesystem support at the spec level.

@AlexeyPerevalov

I would probably prefer

As new field Spec.Mount.IDMap in this case it's not mixed with mount options.

Because it would be quite inadvisable to put strings you need to parse within the spec (we did this with the Intel L3 cache thing and it's really less than ideal).

rata commented 2 years ago

@cyphar

Thanks, I'm unsure if they are useful for k8s as proposed here (but heh, we can make future improvements when k8s uses this if needed).

I'm working on a k8s KEP to slowly adopt userns, that is why I was wondering if this is really useful or not. I don't think k8s has the fs of each volume handy (maybe it can get it, I don't know much of that code parts), so while it knows the volumes, k8s probably lacks the fs and therefore can't know if it will work with id mapped mounts or not. That is why I suspect this might not be really useful for k8s. Maybe the fs is info k8s can gather at pod sandbox creation time, in which case this will be useful; maybe not, in which case we will need to find another way to use it.

Yes, btrfs support is already merged. But overlayfs is in the works, that is quite important for k8s at least (e.g. to launch new containers without storage/perf overhead of the chown of the file - metacopy overlayfs param can help while id mapped mounts is not supported, but id mapped mounts should be way better :)).

AlexeyPerevalov commented 2 years ago

@rata

This would (in theory) be useful for Kubernetes because Kubernetes knows what volumes are used by what containers, and having ID mapped mounts would allow Kubernetes to run containers with a more secure user namespace configuration (right now they don't use user namespaces at all -- but it would allow them to run with "isolated" user namespaces where each pod has a unique mapping, and then any directories mounted into multiple containers can be remapped to match the container IDs).

To be honest, I somewhat doubt they'll use this feature (after all, they don't even use user namespaces even though it would improve the security of their containers) but it is something you could imagine them using and being quite useful.

@cyphar For this feature user namespace is not necessary, yes interface implies you have to give userns fd, but it could be temporary user namespace, created by clone with appropriate flag for arbitrary binary, and you might do it for each mount point, so we can have different mappings for different mount points. That's why, I think, id mapping which we have to pass by OCI for mount points is not what we have in process.linux.UID/GIDMappings.

As for filesystem support, @brauner is slowly hacking away at getting every filesystem to support it (last I spoke to him about it, he was working on btrfs -- that might've already been merged, I'm not sure). These days, the most commonly used ones support it. IMHO there's not much reason to worry about filesystem support at the spec level.

I think here we might have a fallback to recursive chown if idmapping operation fails.

@AlexeyPerevalov

I would probably prefer

As new field Spec.Mount.IDMap in this case it's not mixed with mount options.

Because it would be quite inadvisable to put strings you need to parse within the spec (we did this with the Intel L3 cache thing and it's really less than ideal).

agree

AlexeyPerevalov commented 2 years ago

I think it is not trivial to use id mapped mounts for volumes, as we need to know if the filesystems support id mapped mounts. If we just focus on runc and use it when requested (fail if requested and not supported), it might work for those use cases. But I'd like to see how that is useful to use from Kubernetes, in other words the full picture.

I'm curious what others think.

For the whole picture which includes k8s, I think, we have to highlight 2 use cases we have now:

id mapping for rootfs
id mappng for shared volumes

For rootfs, it's clear if we dont' request id mapping fairly we still can make an assumption it's necessary, just by checking image (at pull stage tar's header is checking for uid/gid and then lchown applied for every file) and desired user. It could be optional behavior, configured by containerd option. For shared volumes containerd/runc could only know desired user which will be substituted, but not the uid to map, in this situation, it should be requested by end user and it requires Pod.Spec change which is a challenge.

neersighted commented 1 year ago

Linking the latest c8d PR: https://github.com/containerd/containerd/pull/5890

rata commented 1 year ago

@neersighted thanks, that should be unrelated in a way. Here in runc we will support idmap mounts on the OCI mounts, as the runtime-spec says, and in containerd we will do it for the rootfs (the oci-spec also says runc shouldn't change the ownership of the rootfs)

neersighted commented 1 year ago

Indeed, just leaving the link here for those following along on ecosystem-wide support.

rata commented 1 year ago

We can close this issue now that this PR is merged: https://github.com/opencontainers/runc/pull/3717

opencontainers / runc

Support idmapped mount (merged in kernel 5.12) #2821