opencontainers / runc

CLI tool for spawning and running containers according to the OCI specification
https://www.opencontainers.org/
Apache License 2.0
11.95k stars 2.12k forks source link

mount ns problems #1315

Closed cpuguy83 closed 7 years ago

cpuguy83 commented 7 years ago

Docker (as I'm sure you know) works like this:

  1. Mount some stuff (probably in host NS)
  2. Pass mountpoint to executor (containerd)
  3. runc pivots to mountpoint

This fundamentally requires that runc (and containerd) and docker be in the same mount namespace. The problem with this model is mounts leaking to other namespaces, including, potentially, namespaces for new containers. When this happens we get nasty EBUSY errors when trying to unmount (depending on kernel version) or remove (every kernel version) these mountpoints later on (e.g. when cleaning up after a container).

One potential solution to this that I was thinking about is being able to pass down a pre-configured mount namespace to runc. It seems like this should be doable without a ton of work (already supported for exec), but I haven't gone too far into this. I think in this case runc would still do the same thing (ie, pivot_root to the passed in rootfs path), the only difference would be runc doesn't need to create the namespace but rather join the provided mount ns (I know, much harder to do). This would allow runc to keep the actual containerisation logic rather than expecting the caller to handle this, allows the caller to isolate mounts required for a new container from the rest of the system, and enables runc to be started from any mount ns w/o these kinds of issues.

Also of course open to other options, or perhaps there is already some way to work-around for this? Thanks!

runcom commented 7 years ago

/cc @rhatdan @rhvgoyal

mrunalp commented 7 years ago

@cpuguy83 We do have support for joining existing namespaces but it probably won't work for mount namespaces with some more checks and code to skip setup. :+1: to supporting this better.

crosbymichael commented 7 years ago

Its not that simple with a mount ns. A better solution with what we are going for today is mounting the overlay fs inside the container by passing it in the runtime spec and having the destination be /.

Also this won't work with userns so we will have to figure something else out but the common misconception is that mounts ns == pivot root(chroot) which is not true. So docker, containerd, runc don't need to be in the same mountns, just in the same parent/child releationship for how they are launched.

I think for containerd, to solve the userns issues, we are going to mount the overlay fs in the shim, in it's own mountns and go that route.

rhvgoyal commented 7 years ago

@cpuguy83 We were reunning docker daemon in a separate mount namespace to avoid leaking of mount points. If containerd and runc run as child of dockerd, they will share same mount namespace and then mount points will be limited to docker eco system and there is less probability of leakage.

But now --live-restore and shared volumes assume that docker is in same mount namespace as init. So slowly everybody wants to run docker in host mount namespace.

IIUC, with latest kernel, this should not be a problem? I mean I can do lazy unmount so that mount disappears from host mount namespace and then remove directory. For the case of devices, deferred removal and deferred deletion feature of devicemapper should take care of this. That is devices will be removed/deleted when none of the mount namespace is using those.

So I am not sure with latest kernel why this will continue to be a problem.

cyphar commented 7 years ago

@cpuguy83

This fundamentally requires that runc (and containerd) and docker be in the same mount namespace. The problem with this model is mounts leaking to other namespaces, including, potentially, namespaces for new containers.

Can you elaborate why MS_SLAVE and MS_PRIVATE are not sufficient to implement this? With MS_SLAVE (or MS_PRIVATE) you could set up all of your mountpoints with each container (even each containerd-shim, though I'm not sure why you'd want that) in a separate mount namespace and then just tell runC where its mounted (which will be done according the mount namespace that runC is in). While we have had problems with MS_PRIVATE in the past, we have fixed those issues.

When this happens we get nasty EBUSY errors when trying to unmount (depending on kernel version) or remove (every kernel version) these mountpoints later on (e.g. when cleaning up after a container).

If you're on 3.16 or later (IIRC -- or a kernel which backports this fix) then unmounting a mountpoint in a namespace context where the unmount wont affect another namespace which is using the mountpoint (mainly this is MS_SLAVE and MS_PRIVATE) will work without EBUSY.

As for "removing a mountpoint" -- I'm unclear what you mean. Do you mean removing the directory after it is no longer mounted? In which case, I'm wondering if that's actually an issue with the overlay filesystem you're using (you're not trying to remove one of the lowerdirs from overlayfs or something -- right). Otherwise removing a directory should work perfectly fine (that was also fixed in the kernel patch I mentioned).

crosbymichael commented 7 years ago

In the end, this is not a runc issue and has to be handled / configured by the rootfs provider.