moby / moby

The Moby Project - a collaborative project for the container ecosystem to assemble container-based systems
https://mobyproject.org/
Apache License 2.0
68.66k stars 18.64k forks source link

Mount Point Plugins #33048

Open dsheets opened 7 years ago

dsheets commented 7 years ago

Problem

Container bind mounts cannot be intercepted to perform additional setup and teardown (e.g. in the case of cross-platform Docker products like Docker for Mac).

Proposal: Mount Point Plugins

Mount point plugins enable end-users, systems developers, and Docker Inc to develop plugins that interpose on mount point setup and teardown in containers. In particular, a number of mount point plugins can be installed in the engine that are consulted, in order, to confirm or deny mounts and potentially change their source location. On container shutdown or stop, relevant mount point plugins would also be called, in reverse order, to teardown the mounts. These teardown or detachment transactions block shutdown until they complete and may change the container exit code if the teardown fails (e.g. synchronizing state fails). Mount point plugins will be able to consume the new consistency flags.

Initially, only bind mounts and volume mounts will be supported (i.e. not tmpfs, network mounts, secrets, or container layers). Mount point plugins will register filters on initialization so that only applicable plugins are consulted for any given mount. This improves performance by reducing the number of plugin round trips for un-interposed mounts.

This functionality is necessary to fix bind mount inotify events not delivered after container restart and fs events not working for services as well as Cannot add, remove, add overlapping directory. It will remove one of the use cases of the current Docker API proxy in Docker for Mac and make componentization of that product more tractable.

Finally, by enabling container file system virtualization, mount point plugins will enable a number of interesting use cases like:

I've drafted a(n unpublished) patchset implementing this functionality as a way to explore the design space and familiarize myself with the relevant Docker subsystems. I'm interested in your thoughts.

/cc @yallop @cpuguy83 @dnephin

cpuguy83 commented 7 years ago

to confirm or deny mounts

What is a denial here?

Initially, only bind mounts and volume mounts will be supported (i.e. not tmpfs, network mounts, secrets, or container layers)

Does this mean for volumes we'd only pass through volumes that are part of the local volume driver? What if the local driver has mounted a FS at the users request?

Mount point plugins will register filters on initialization so that only applicable plugins are consulted for any given mount

Can you explain this? I'm kind of thinking this would be like a middleware chain, but not sure I understand this part.

dsheets commented 7 years ago

to confirm or deny mounts

What is a denial here?

"Confirm" and "deny" are probably not the best words to describe this capability. A mount can succeed or fail both in regular ways as well as via a mount point plugin. A mount point plugin can fail a mount which simply stops the container from starting (unwinding any resources provisioned) and returns an informative error message. This is a similar type of error to an attempt to mount a path under a file (ENOTDIR) and can be used to, e.g., refuse to create a path or indicate a system error. In Docker for Mac, bind mounts can fail if they would create new directories in the Linux VM as data saved to these locations may be ephemeral and the user may have mistyped a macOS path which has been interpreted as a VM path.

Initially, only bind mounts and volume mounts will be supported (i.e. not tmpfs, network mounts, secrets, or container layers)

Does this mean for volumes we'd only pass through volumes that are part of the local volume driver? What if the local driver has mounted a FS at the users request?

In the design I'm currently toying with, all bind and volume mounts would be interposable (including the local driver). The filter system described below can be used to reduce unnecessary plugin consultation.

Mount point plugins will register filters on initialization so that only applicable plugins are consulted for any given mount

Can you explain this? I'm kind of thinking this would be like a middleware chain, but not sure I understand this part.

It would be like a middleware chain but with per-mount filters.

My current thinking is:

  1. Mount point plugins register mount point types (bind, volume, tmpfs, network, etc) they want to maybe interpose on. Plugins cannot register to interpose on all types (new types may be added that should not be interposed).

  2. Mount point plugins register volume drivers (including local) they want to maybe interpose on (they can also get requests for all volume mounts). This filter is only applicable when the volume mount point type has been selected above.

  3. Mount point plugins register local volume types (e.g. nfs, none, vfat) they want to maybe interpose on (they can also get requests for all local volume types). This filter is only applicable when the local volume driver type has been selected above.

  4. For each plugin in the chain, all mount points are filtered against the plugin's interests. If the subset of container mount points is not empty, the subset is sent to the driver as an attachment request (including a unique ID to identify the mount point set).

  5. In response to an attachment request, a mount point plugin can respond indicating whether it would like to participate in the mount point and what the new mount point path is (if any).

  6. Each mount is then annotated with the stack of plugins which are participating in the mount point. This annotation includes a mount point plugin clock value which is monotonically increasing.

  7. When it is time to teardown the mount points, the plugins that are participating in the mount points are sent detachment requests containing the mount point set ID (= container ID?) in reverse mount point clock order for all mounts being torn down.

As an example, osxfs in Docker for Mac would register filters for bind mounts and volume mounts but only local volume mounts and only of mount type none (for local bind mounts). Other plugins could then be loaded which interpose on a different subset of mount points and only the mount point plugins that are applicable for any given mount point and container would be consulted.

thaJeztah commented 7 years ago

In Docker for Mac, bind mounts can fail if they would create new directories in the Linux VM as data saved to these locations may be ephemeral and the user may have mistyped a macOS path which has been interpreted as a VM path.

FWIW we changed this behaviour at some point, but had to revert because it was too much of a breaking change 😞

dsheets commented 7 years ago

33375 contains an implementation for review.