opencontainers / runc

CLI tool for spawning and running containers according to the OCI specification
https://www.opencontainers.org/
Apache License 2.0
11.89k stars 2.11k forks source link

consider unifying mountEntry.srcFD and mountEntry.idmapFD approach #3935

Open thaJeztah opened 1 year ago

thaJeztah commented 1 year ago

I'm somewhat curious why we use a string for srcFD and an int for idmapFD. Wondering if we should unify the approach (probably can be done in a follow-up)

_Originally posted by @thaJeztah in https://github.com/opencontainers/runc/pull/3717#discussion_r1259736922_

One uses a string with the path (/proc/self/fd/<FD>), whereas the other uses the bare FD (int).

We could consider using an int for both, and update mountViaFDs() (which is not exported) to internalize constructing the path if the value is not -1; https://github.com/opencontainers/runc/blob/05669c8aba9bd3ee4cb9935a74dbd96828f65fc7/libcontainer/mount_linux.go#L67-L71

cyphar commented 1 year ago

I suspect that the best solution would actually be to entirely remove the idmapping logic from C and instead implement it in Go (if we can use move_mount(2) entirely without needing to worry about backward compatibility, we can move the mountfd logic there too since clone_tree allocates an anonymous mount namespace that doesn't have the same source mount limitations as mount --bind when mounted using move_mount).

While it's not perfect, given that the mountfd stuff is really quite niche, I would be in favour of requiring the new mount API for it to be used (this would limit the feature to Linux 5.2 and newer -- @alban is this a deal-breaker?) -- but it would mean we could completely eliminate the mountfd code from the C code in runc and it would completely unify the implementations (you would open all the bind-mount sources on the host with clone_tree -- possibly adding the idmap mount attrs -- and pass them to the Go code in rootfs_linux.go. this would also critically let us avoid leaking file descriptors from the host namespace into containers entirely). It could also work for non-bind-mount mapped mounts (something that was also missed in the initial implementation) though I suspect this will have issues with sb->s_user_ns.

Unfortunately we cannot use CLONE_NEWUSER in a multi-threaded program so we would need to do some dodgy stuff to get that bit working, but I suspect that even with that complication, it would make things a lot simpler. We can probably spawn the processes using syscall.SpawnProcess since we just need the mappings, though there's the question of what process we want to spawn (and would it be more efficient to do a re-exec of ourselves and then do all the mappings in C? annoyingly you cannot temporarily create a user namespace and join the original one due to how permissions work... each userns has to be done in a separate threadgroup).

rata commented 1 year ago

@cyphar for idmap mounts forcing a minimum of 5.2 is not an issue, as we need newer kernels for idmap mounts anyways.

For opening the mount sources, I think it might be fine too. I'm changing the k8s userns implementation to idmap mounts (we hit several limitations and concerns from other SIGs when we tried to avoid it), and opening the mount sources is also a patch we added for userns support in k8s, because /var/lib/kubelet/pods has permissions for root:root and nothing for others.

I think this bug with what @thaJeztah reported is closed now that #3939 is merged (that part is unified). But do you want to open another issue with this idea of avoiding the C parts?