Closed amitshukla closed 8 years ago
@amitshukla In the POC, we separated the concepts of volumes and mounts (the original idea was [here](https://github.com/docker/cucaracha-design/issues/1#issuecomment-155652699 and we played with it [here]%28https://github.com/docker/swarm-v2-poc/blob/master/bundle/config.go#L104%29).
Effectively, you end up with volume declaring and mounts that reference volumes:
services:
redis:
container:
image: redis
volumes:
redis-data:
mounts:
- /foo/bar:/var/lib/redis
volumes:
redis-data:
driver: glusterfs
We depart from the :
syntax for mount declarations. There is a lot of configuration that can go into a mount, especially when in a cluster and we don't want to be embedding that in microformats. The current syntax lends itself when mounting from one mount namespace to another, but the source of the mount may not even have a namespace in a cluster model. We would continue to support something where <volume>:<mount>
for simple cases, but we need to expand this out when there is more specific configuration:
volumes:
redis-data:
mounts:
- redis-data
source: /
path: /var/lib/redis
user: stevvooe
group: redis
mask: 750
We also need to consider that most service definition files won't actually declare their volumes, but rather reference them. Let's take the case of an application that may run in development or production. In development, I might create a volume to store the database that is just local. When moving to production, it may be a volume managed by another team that is used for the database.
Another thing to be considered when declaring a volume for a cluster is visibility. Put differently, which nodes can a job be routed to will be able to mount the volume? For a local volume, this may only be a single node. For a storage cluster, we they may be visible to all nodes. I think we can repurpose node attribute filters to support this:
volumes:
redis-data:
driver: glusterfs
visible:
network: gluster
labels:
- gluster
The above needs work, since much of this should simply belong to the driver. However, there should be enough to inform the scheduler of where the volume is accessible.
And yet another concept that I think will completely knockout configuration management is the concept of a read-only volume:
services:
registry:
container:
image: registry:2.3
volumes:
myapp-conf:
mounts:
- config.yml:/etc/docker/registry/config.yml
- path: ca.pem:/certs/ca.pem
mask: 600
- path: token-ca.pem:/certs/token.pem
mask: 600
myapp-secrets:
mounts:
- key.pem:/certs/key.pem
volumes:
myapp-conf:
driver: tar
url: http://whereever.example.com/config.tar.gz
digest: sha256:.... # verify it!
myapp-secrets:
driver: secrets
We demonstrate two ideas above. The first is downloading a tar from a url and mounting it. We can verify the content with a hash, so it can be literally anywhere (registry, anyone?). We replace the configuration files with content from within the tar. How we distribute the content in the cluster is left to the cluster, but it may fetch it directly or distribute it internally.
The other concept is key distribution. Not too much to say here other than that it is secret!
I don't like the following, but there are some problems of embedded volumes we have to solve:
services:
redis:
image: redis
volumeMounts:
- path: /var/lib/redis
type: redisdata
- path: /home
type: tmp
- path: /cgroups
type: host-bind
options:
from: /cgroups
volumes:
redisdata:
driver: glusterfs
opts:
glusterserver: gluster.company.com
This is the same problem we have in networking. /cc @mrjana
Proposal updated to account for feedback
@amitshukla First reactions on naming:
targetPath
can be target
and sourcePath
can be source
. These fields are of type path, so I think adding the Hungarian notation hurts clarity.perTaskFolder
is odd. I know this is an active area of discussion and I don't have a better suggestion, other than maybe arity
. We could actually move this to the volume declaration and then let the actual volume control the mounts (n==0 - unlimited, n>0, mounts n times).opts
- options
. Abrevvs hlp dnt hep rdablty.For behavior, we need to reference the volume name:
mount:
name: host # logical name, points to volume below
source: /
target: /data # path mounted - MyClusterVolume:/glustervol/ to /data/
mask: rw # ro or rw
#...
volumes:
name: host
driver: host
Remote and local volumes should behave identically, except that you should have to declare a host volume unless it is home to a specific node. Put differently, there can be an implicit volume that is named host
. I have defined it in the example above, but that definition is not required.
@anusha-ragunathan brought up the concept of a volume controller. Effectively, there are two elements to an operating volume plugin system. The first is the ability to deploy the plugins across cluster machines. This can be done with a GlobalJob
that ensures plugins are running. The second would be some sort of ServiceJob
running the volume controller for the cluster. This may be managed by swarm or external service.
I'll let @anusha-ragunathan expand on anything I may have missed.
@stevvooe : You got it almost right. Along with the plugins, there are agents that need to be deployed on each node.
Lets take the example of Flocker (our most popular volume plugin) which has several components:
perTaskFolder is odd. I know this is an active area of discussion and I don't have a better suggestion, other than maybe arity. We could actually move this to the volume declaration and then let the actual volume control the mounts (n==0 - unlimited, n>0, mounts n times).
Agree with the oddness - and I think we shouldn't do it, that's up to the driver to decide if it wants to do a per-mount-folder.
What I think we need is some sort of volume template in the task spec in order to create volumes on the fly when tasks are allocated (e.g. Every mongo gets its own ebs).
However, I don't even know if the volumes API supports actual provisioning.
@icecrime?
CC: @bfirsh
@aluzzardi what do you think of @stevvooe 's proposal to have even bind mounted volumes defined cluster wide?
@anusha-ragunathan I don't like the current Flocker model - they are working around the fact that Engine is not cluster aware and thus wrapping it. A better model for them will be to plug into Swarm and accomplish the same goals, but without wrapping Docker.
@aluzzardi perTaskFolder
is intended to solve the Sticky volume with remote volume; but each task gets a new sub-directory. We can omit support for this case and leave it up to the user to implement.
@anusha-ragunathan @amitshukla I think we should consider the whole thing a plugin and the plugin has two components: the central controller running only in managers(as a plugin to the manager) and the agent running as a plugin to the engine. This is exactly what we are doing for networking plugins. This way there is no wrapping around and we control the platform/framework
@amitshukla : I used Flocker as an example to illustrate that there is a controller (along with agents and plugins). How swarm v2 will handle the wrapping is not the point here.
@mrjana : Agreed. Except that today, there are agents that run separate from the actual Docker plugin (atleast in the Flocker case)
Agreed. Except that today, there are agents that run separate from the actual Docker plugin (atleast in the Flocker case)
Possible and we don't mandate anything here. From our pov, there is a controller/manager plugin and an agent plugin, however the vendor may want to architect their solution.
Yeah, I think we're good on that.
Something like flocker could potentially:
@aluzzardi That sounds like a reasonable plan. Assuming these details would then be injected to the driver via options.
@amitshukla I don't think we should make a distinction between volumes based on their locality properties. The chief reasoning here is that we can't really make assumptions about locality of a volume.
In a discussion yesterday, we developed the concept of implicit volumes in the context of host-homed volumes. These are volumes that may be host bound but only require a volume definition. Effectively, there are three classes:
ephemeral
: Volumes are bound to a specific host and created as part of the task. This may be implemented on a host disk or as a ramfs. When the task is finalized, the volume is removed.local
: A local volume is part of a particular host but may be created if not present on the tasks assigned host. Repeatability of running a task can be provided by requiring a particular task to be created on hosts with the target volume (via label selectors).bind
: A bind volume allows one to bind host paths into the container's filesystem. This provides support for plugins or other processes that require privileged access to host-homed resources.When we say these are implicit volumes, we mean the following mount configuration is sufficient:
mount:
name: postgres-data
type: local
source: / # "/" is relative to volume root, and implied
target: /var/lib/postgres
When assigned to a node, the volume would be created on the node, if it does not exist, and will be reported to the cluster. The name must be unique in the namespace. If the volume is already created, the task will be routed to that node.
What if we want to have several tasks mount multiple copies of the same local volume? Glad you asked!
Let's say we have QFS chunkservers, where we have a bunch instances with persistent data, but only care that each volume instance has some task. We can parameterize the volume name based on the instance id:
mount:
name: chunkserver-${instance.id}-data
type: local
target: /var/lib/qfsm
Now, we have a simple solution to a very complex problem without the involvement of an external volume plugin. One can provision new chunk servers by adding instances and decommission them by removing the volumes.
To make this proposal sound, I suggest we come up with a Flocker real-world example.
services:
flocker-control:
container:
image: flocker/control
instances: 3
flocker-agent:
container:
image: flocker/agent
deployment:
global: true
flocker-plugin:
plugin:
image: flocker/plugin
deployment:
global: true
services:
mongodb:
container:
image: mongodb
volumeMounts:
data:
target: /var/lib/mongodb
template:
driver: flocker
options:
size: 10GB
profile: gold
services:
noidea:
container:
volumeMounts:
data:
target: /var/lib/mongodb
type: foo
volumes:
foo:
driver: flocker
options:
size: 10GB
Please don't bike shed on the syntax, this is not a proposal (literally spent 2 minutes writing the examples and put no thoughts whatsoever on it), I'm just describing what a proposal might look like
The flocker website has many examples of combining that with docker and swarm - we should use those and convert them to our syntax.
@aluzzardi SGTM
I think the my proposal in https://github.com/docker/swarm-v2/issues/187#issuecomment-204088053 is mostly compatible.
Considering this done
Assumption:
Cluster commands
swarmctl volume create
swarmctl volume rm
swarmctl volume ls
swarmctl volume inspect
We may support
swarmctl update
to rename a cluster wide volume.Cases to handle
Local Disk usage (does not need a cluster wide resource)
ScratchDir
LocalDir
-- will not be defined at the cluster level; scheduling will fail if Dir does not exist)Remote volumes: need a cluster wide volume to be defined. Can use AWS etc
swarmctl volume <name> --operation {clone, snapshot, …} --opts "k-v pairs"
Elastic volumes (automatically request new volumes as new tasks get spun up)
YAML
ScratchDir {tmp}
No cluster wide volume needs to be created. Note: the volume is scoped to a single Task, and thus it need note be named. Issue: the Agent will have to create and delete Task local volumes
LocalDir
No cluster wide volume needs to be created. Note: the volume is scoped to a single Task, and thus it need note be named. Issue: the Agent will have to create and delete Task local volumes
(mapping to actual cadvisor
docker run
command)Remote storage volume
Cluster wide volume equivalent to the below will be created:
swarmctl volume create RedisData --driver glusterfs --opts "k1=v1" --opts "k2=v2"
Workflow
At the Manager, volume mount information is stored with each Task. When a task is scheduled on a Node, this volume mounting information is communicated to the Node.
Local Disk Usage
Remote volume
When the task is rescheduled on a different node, step 3 through step 5 are repeated. CC: @aluzzardi @stevvooe @vieux