vishvananda commented 7 years ago

Volume mountpoints in a container should have data explaining how they are used to help runtimes decide what should be done with them. The first and most obvious is a "readonly" boolean. There are a couple other flags which also could be useful.

readonly:
  true if the container only needs to read from the mountpoint (generally implemented by mounting a readonly volume)
  example: directory for process config files
temporary:
  true if the volume should be cleared between runs of the container process (generally implemented by mounting a tmpfs)
  example: directory for pid files or secrets that only exist for the current run of the process
data:
  true if the image has data in this directory that can be used to initialize the volume (generally implemented by copying data out of the container when creating the volume)
  example: directory with an initial database

The purpose of these flags might be illustrated with an example. Lets say I'm making a mysql container. I might have the following mountpoints defined:

"volumes": [
  # config files
  {"name": "config",
   "path": "/etc/mysql",
   "readonly": true, # config should be modified from the outside
   "data": true}, # because the default config lives here
  # pid files
  {"name": "run",
   "path": "/var/run/mysql",
   "temporary": true}, # for the pid file
  # uds socket
  {"name": "tmp",
   "path": "/tmp",
   "temporary": true}, # for the mysql socket file 
  # db files
  {"name": "db",
   "path": "/var/lib/mysql",
   "data": true}, # because a default db exists here
  # log files
  {"name": "log",
   "path": "/var/log/mysql"},
]

Clearly some of these could be consolidated using symlinks (tmp and run could share the same dir, as well as db and logs), but I wanted to give a verbose description.

wking commented 7 years ago

On Thu, Nov 03, 2016 at 02:18:21PM -0700, Vish Ishaya wrote:

readonly: true if the container only needs to read from the mountpoint (generally implemented by mounting a readonly volume) example: directory for process config files temporary: true if the volume should be cleared between runs of the container process (generally implemented by mounting a tmpfs) example: directory for pid files or secrets that only exist for the current run of the process data: true if the image has data in this directory that can be used to initialize the volume (generally implemented by copying data out of the container when creating the volume) example: directory with an initial database

I'd like to support all of these by distributing opaque directories with runtime-spec config.json inside (discussion in #87). Then you'd:

$ oci-unpack some-image.tar some-image $ cd some-image $ tree . ├── config.json ├── data │ └── seed.json ├── root └── rootfs

3 directories, 1 file $ cat config.json { … "mounts": [ { "type": "tmpfs", "source": "tmpfs", "destination": "/tmp", "options": ["nosuid", "strictatime", "mode=755", "size=65536k"] }, { "type": "bind", "source": "root" "destination": "/root", "options": ["rbind", "ro"] }, { "type": "bind", "source": "data" "destination": "/data", "options": ["rbind", "rw"] } ] }

Of course, my proposed sanitizer would clobber some of those ‘source’ values 1, but you could use your own sanitizer if you trusted the bundle already (e.g. because it was signed by someone you trust, #22,

176, #400).

stevvooe commented 7 years ago

These seems more like a runtime concern and there are good reasons it is the way it is today.

The Volumes field is really only used to say "this part of the filesystem is not part of the container". How that is implemented and the permissions and behavior associated with it are specified by mounts. Even in the docker API, mostly named volume mounts don't use the Volumes field on the container config, they use Binds (and now Mounts).

Volumes really only specifies what could be considered "ephemeral" locations. There are implications to how mounts and volume declarations would interact that may result in odd behavior.

Allowing the image to name volumes will lead to security problems in systems that have a single volume namespace or overlap in volume names within a namespace (imagine a malicious container that wants to mount the log volume and read your logs).

The other issue with such a design is that booleans without considering mutual exclusion create a number of combinations that don't make a lot of sense. As you add more booleans, the space of invalid option combinations increases. For example, what is the use of a temporary, readonly volume with data=false? What if the temporary space is required but a tmpfs won't provide enough storage (these are really "ephemeral", rather than tmpfs)?

More than just booleans, such a design would probably also call for parameterization. How would these parameters interact with mount specifications?

Given this, we may want to be more explicit about how directory permissions are maintained between unpacking the image and mounting the volume over a directory location.

NB. The term data used in the example has been historically called copy, which instructs the runtime to copy data from the image into the freshly minted volume.

philips commented 7 years ago

I agree, I don't think this is a good feature. And even if it might be we need more people asking for it.

philips commented 7 years ago

I am moving this to post v1.0

vishvananda commented 7 years ago

Volumes really only specifies what could be considered "ephemeral" locations. There are implications to how mounts and volume declarations would interact that may result in odd behavior.

Allowing the image to name volumes will lead to security problems in systems that have a single volume namespace or overlap in volume names within a namespace (imagine a malicious container that wants to mount the log volume and read your logs).

I'm not sure I follow your logic here. I'm not suggesting the image spec defines anything about what the runtime names the volumes. My goal here is for the builder of the image to specify locations in the image filesystem that represent a interfaces (either input-->config or output-->logs). The current definition is not rich enough. Either we need to put enough information into the image-spec so that a runtime could determine how to set up mounts properly, or (as @wking suggests) we allow a runtime spec to be shipped.

I personally like the separation between Binds defined by the image-spec and Mounts defined by the runtime-spec, but the Bind needs some hints to tell the runtime how the mount is going to be used. For example if the bind has configuration data, then the runtime likely wants to copy the data out into the volume it creates, and the mount should be read-only.

The other issue with such a design is that booleans without considering mutual exclusion create a number of combinations that don't make a lot of sense. As you add more booleans, the space of invalid option combinations increases. For example, what is the use of a temporary, readonly volume with data=false? What if the temporary space is required but a tmpfs won't provide enough storage (these are really "ephemeral", rather than tmpfs)?

I agree that a list of booleans might not be the best description, maybe we define a simple enum that describes how the volume should be used.

BindType {
  Config --> runtime could implement with copy out and read-only mount
  Tmp --> runtime could implement with tmpfs
  Data --> runtime could implement with a r/w volume
}

If we don't think that we can agree on a common set of bind types, then it probably would be enough to allow some arbitrary metadata in the Bind (or as it is currently written: Volume) definition.

opencontainers / image-spec

Volume mountpoints should have additional flags #443

176, #400).