amitshukla commented 8 years ago

Assumption:

Leverage existing Engine Volume plugin infra and plugins
For the initial implementation, the volume plugin(s) will be required to be preinstalled on all nodes in the cluster.
In the longer term, we can store the node --> "volume plugins installed" information in Raft, and use it during scheduling decisions

Cluster commands

swarmctl volume create
- Create a new volume at the cluster wide volume
- Options
- Name - unique in the cluster [required]
- Volume driver [required]
- Driver options [optional]
- Implementation
- Insert into Raftstore
- Insert into memory store
swarmctl volume rm
- Remove a cluster level volume
- Options
- Name [required]
- Implementation
- Check if the volume is being used by any Tasks; return error if it is.
- Delete from Raftstore
- Delete from memory store
swarmctl volume ls
- List volumes defined at the cluster level
- Implementation
- Query the in-memory store and return the formatted information
swarmctl volume inspect
- Get information about a particular volume
- Input: Name or ID [required]
- Implementation
- Query the in-memory store and return the formatted information

We may support swarmctl update to rename a cluster wide volume.

Cases to handle

Ephemeral vol --> 1 task {ScratchDir}
Bind mount --> 1 task {LocalDir}
StickyVol --> task {Remote storage volume}
- 1 volume --> 1 task
- 1 volume --> N tasks

Local Disk usage (does not need a cluster wide resource)

ScratchDir
LocalDir -- will not be defined at the cluster level; scheduling will fail if Dir does not exist)

Remote volumes: need a cluster wide volume to be defined. Can use AWS etc

Support determined by drivers on Engine
Will support a swarmctl volume <name> --operation {clone, snapshot, …} --opts "k-v pairs"

Elastic volumes (automatically request new volumes as new tasks get spun up)

TODO
YAML

ScratchDir {tmp}

No cluster wide volume needs to be created. Note: the volume is scoped to a single Task, and thus it need note be named. Issue: the Agent will have to create and delete Task local volumes

services:
  redis:
    image: redis
    mount:
      - targetPath: /scratch          # Temp directory mounted to /scratch/
        type: tmp                          # tmp - cleared on task start/restart; always mounted rw

LocalDir

No cluster wide volume needs to be created. Note: the volume is scoped to a single Task, and thus it need note be named. Issue: the Agent will have to create and delete Task local volumes

(mapping to actual cadvisor docker run command)

sudo docker run \
  --volume=/:/rootfs:ro \
  --volume=/var/run:/var/run:rw \
  --volume=/sys:/sys:ro \
  --volume=/var/lib/docker/:/var/lib/docker:ro \
  --publish=8080:8080 \
  --detach=true \
  --name=cadvisor \
  google/cadvisor:latest

services:
  cadvisor:
    image: cadvisor
    mount:
      - targetPath: /rootfs    # path in container
        mask: ro               # ro or rw
        type: hostDir          # bind mount
        sourcePath: /          # path on host
      - targetPath: /var/run
        mask: rw
        type: hostDir
        sourcePath: /var/run
      - targetPath: /sys       
        mask: ro
        type: hostDir
        sourcePath: /sys
      - targetPath: /var/lib/docker
        mask: ro
        type: hostDir
        sourcePath: /var/lib/docker
# port mapping not shown

Remote storage volume

Cluster wide volume equivalent to the below will be created: swarmctl volume create RedisData --driver glusterfs --opts "k1=v1" --opts "k2=v2"

services:
  redis:
    image: redis
    mount:
      name: RedisData          # logical name, points to volume below
      targetPath: /data        # path mounted - MyClusterVolume:/glustervol/ to /data/
      mask: rw                 # ro or rw
      perTaskFolder: 1         # 0/1 - 0 => all tasks share same volume folder
                               #       1 => each task gets a private folder inside volume (taskid(?) used to create a folder)
      sourcePath: /            # path in the glusterfs volume

volume:
  name: RedisData              # Cluster volume name
  driver: glusterfs
    opts:
      k1: v1
      k2: v2

Workflow

At the Manager, volume mount information is stored with each Task. When a task is scheduled on a Node, this volume mounting information is communicated to the Node.

Local Disk Usage

User does not need to create a cluster wide volume
User runs a Job that references a local disk volume
Node uses the volume mounting information to mount the local host volume or to create a scratch dir
Remote volume
User creates a cluster wide volume
- For Dockercon, all nodes joining the cluster will need to contain all volume plugins (this will need to be fixed quickly afterwards)
- This volume information will be stored at the Manager
User runs a Job that references the cluster wide volume
When Task is assigned to a Node
Task information (containing volume mount info) is fetched by the Agent
Agent configures the Engine to mount the remote volume when running the container
- If perTask == 1, then the Agent creates a separate folder inside the volume for the task (perhaps named TaskID so it is portable across machines?)

When the task is rescheduled on a different node, step 3 through step 5 are repeated. CC: @aluzzardi @stevvooe @vieux

stevvooe commented 8 years ago

@amitshukla In the POC, we separated the concepts of volumes and mounts (the original idea was [here](https://github.com/docker/cucaracha-design/issues/1#issuecomment-155652699 and we played with it [here]%28https://github.com/docker/swarm-v2-poc/blob/master/bundle/config.go#L104%29).

Effectively, you end up with volume declaring and mounts that reference volumes:

services:
  redis:
    container:
      image: redis
      volumes:
        redis-data:
          mounts:
            - /foo/bar:/var/lib/redis
volumes:
  redis-data:
    driver: glusterfs

We depart from the : syntax for mount declarations. There is a lot of configuration that can go into a mount, especially when in a cluster and we don't want to be embedding that in microformats. The current syntax lends itself when mounting from one mount namespace to another, but the source of the mount may not even have a namespace in a cluster model. We would continue to support something where <volume>:<mount> for simple cases, but we need to expand this out when there is more specific configuration:

volumes:
  redis-data:
    mounts:
      - redis-data
        source: /
        path: /var/lib/redis
        user: stevvooe
        group: redis
        mask: 750

We also need to consider that most service definition files won't actually declare their volumes, but rather reference them. Let's take the case of an application that may run in development or production. In development, I might create a volume to store the database that is just local. When moving to production, it may be a volume managed by another team that is used for the database.

Another thing to be considered when declaring a volume for a cluster is visibility. Put differently, which nodes can a job be routed to will be able to mount the volume? For a local volume, this may only be a single node. For a storage cluster, we they may be visible to all nodes. I think we can repurpose node attribute filters to support this:

volumes:
  redis-data:
    driver: glusterfs
    visible:
      network: gluster
      labels:
        - gluster

The above needs work, since much of this should simply belong to the driver. However, there should be enough to inform the scheduler of where the volume is accessible.

And yet another concept that I think will completely knockout configuration management is the concept of a read-only volume:

services:
  registry:
    container:
      image: registry:2.3
      volumes:
        myapp-conf:
          mounts:
            - config.yml:/etc/docker/registry/config.yml
            - path: ca.pem:/certs/ca.pem
              mask: 600
            - path: token-ca.pem:/certs/token.pem
              mask: 600
        myapp-secrets:
          mounts:
            - key.pem:/certs/key.pem
volumes:
  myapp-conf:
    driver: tar
    url: http://whereever.example.com/config.tar.gz
    digest: sha256:.... # verify it!
  myapp-secrets:
    driver: secrets

We demonstrate two ideas above. The first is downloading a tar from a url and mounting it. We can verify the content with a hash, so it can be literally anywhere (registry, anyone?). We replace the configuration files with content from within the tar. How we distribute the content in the cluster is left to the cluster, but it may fetch it directly or distribute it internally.

The other concept is key distribution. Not too much to say here other than that it is secret!

aluzzardi commented 8 years ago

I don't like the following, but there are some problems of embedded volumes we have to solve:

services:
  redis:
    image: redis
    volumeMounts:
      - path: /var/lib/redis
        type: redisdata
      - path: /home
        type: tmp
      - path: /cgroups
        type: host-bind
        options:
          from: /cgroups

volumes:
  redisdata:
    driver: glusterfs
    opts:
      glusterserver: gluster.company.com

This is the same problem we have in networking. /cc @mrjana

amitshukla commented 8 years ago

Proposal updated to account for feedback

stevvooe commented 8 years ago

@amitshukla First reactions on naming:

Be very careful with the camelcase. I think targetPath can be target and sourcePath can be source. These fields are of type path, so I think adding the Hungarian notation hurts clarity.
perTaskFolder is odd. I know this is an active area of discussion and I don't have a better suggestion, other than maybe arity. We could actually move this to the volume declaration and then let the actual volume control the mounts (n==0 - unlimited, n>0, mounts n times).
opts - options. Abrevvs hlp dnt hep rdablty.

For behavior, we need to reference the volume name:

 mount:
      name: host          # logical name, points to volume below
      source: /
      target: /data        # path mounted - MyClusterVolume:/glustervol/ to /data/
      mask: rw                 # ro or rw
#...
volumes:
  name: host
  driver: host

Remote and local volumes should behave identically, except that you should have to declare a host volume unless it is home to a specific node. Put differently, there can be an implicit volume that is named host. I have defined it in the example above, but that definition is not required.

stevvooe commented 8 years ago

@anusha-ragunathan brought up the concept of a volume controller. Effectively, there are two elements to an operating volume plugin system. The first is the ability to deploy the plugins across cluster machines. This can be done with a GlobalJob that ensures plugins are running. The second would be some sort of ServiceJob running the volume controller for the cluster. This may be managed by swarm or external service.

I'll let @anusha-ragunathan expand on anything I may have missed.

anusha-ragunathan commented 8 years ago

@stevvooe : You got it almost right. Along with the plugins, there are agents that need to be deployed on each node.

Lets take the example of Flocker (our most popular volume plugin) which has several components:

The volume plugin which integrates with Docker-engine.
The Flocker workhorse agents (flocker-container-agent, flocker-dataset-agent) that run on each node of the cluster, alongside docker-engine.
The Flocker control service which is the brain of the Flocker cluster and that's typically installed on one node (and HA protected). This service can run standalone or integrate with a cluster scheduler like Swarm.

aluzzardi commented 8 years ago

perTaskFolder is odd. I know this is an active area of discussion and I don't have a better suggestion, other than maybe arity. We could actually move this to the volume declaration and then let the actual volume control the mounts (n==0 - unlimited, n>0, mounts n times).

Agree with the oddness - and I think we shouldn't do it, that's up to the driver to decide if it wants to do a per-mount-folder.

What I think we need is some sort of volume template in the task spec in order to create volumes on the fly when tasks are allocated (e.g. Every mongo gets its own ebs).

However, I don't even know if the volumes API supports actual provisioning.

@icecrime?

amitshukla commented 8 years ago

CC: @bfirsh

amitshukla commented 8 years ago

@aluzzardi what do you think of @stevvooe 's proposal to have even bind mounted volumes defined cluster wide?

@anusha-ragunathan I don't like the current Flocker model - they are working around the fact that Engine is not cluster aware and thus wrapping it. A better model for them will be to plug into Swarm and accomplish the same goals, but without wrapping Docker.

@aluzzardi perTaskFolderis intended to solve the Sticky volume with remote volume; but each task gets a new sub-directory. We can omit support for this case and leave it up to the user to implement.

mrjana commented 8 years ago

@anusha-ragunathan @amitshukla I think we should consider the whole thing a plugin and the plugin has two components: the central controller running only in managers(as a plugin to the manager) and the agent running as a plugin to the engine. This is exactly what we are doing for networking plugins. This way there is no wrapping around and we control the platform/framework

anusha-ragunathan commented 8 years ago

@amitshukla : I used Flocker as an example to illustrate that there is a controller (along with agents and plugins). How swarm v2 will handle the wrapping is not the point here.

@mrjana : Agreed. Except that today, there are agents that run separate from the actual Docker plugin (atleast in the Flocker case)

mrjana commented 8 years ago

Agreed. Except that today, there are agents that run separate from the actual Docker plugin (atleast in the Flocker case)

Possible and we don't mandate anything here. From our pov, there is a controller/manager plugin and an agent plugin, however the vendor may want to architect their solution.

aluzzardi commented 8 years ago

Yeah, I think we're good on that.

Something like flocker could potentially:

Install a "plugin" everywhere (using a GlobalJob)
Install an "agent" everywhere (using a GlobalJob)
Install a controller (using a ServiceJob). They get to decide how to scale/update it, like any other service

stevvooe commented 8 years ago

@aluzzardi That sounds like a reasonable plan. Assuming these details would then be injected to the driver via options.

@amitshukla I don't think we should make a distinction between volumes based on their locality properties. The chief reasoning here is that we can't really make assumptions about locality of a volume.

In a discussion yesterday, we developed the concept of implicit volumes in the context of host-homed volumes. These are volumes that may be host bound but only require a volume definition. Effectively, there are three classes:

ephemeral: Volumes are bound to a specific host and created as part of the task. This may be implemented on a host disk or as a ramfs. When the task is finalized, the volume is removed.
local: A local volume is part of a particular host but may be created if not present on the tasks assigned host. Repeatability of running a task can be provided by requiring a particular task to be created on hosts with the target volume (via label selectors).
bind: A bind volume allows one to bind host paths into the container's filesystem. This provides support for plugins or other processes that require privileged access to host-homed resources.

When we say these are implicit volumes, we mean the following mount configuration is sufficient:

mount:
  name: postgres-data
  type: local
  source: / # "/" is relative to volume root, and implied
  target: /var/lib/postgres

When assigned to a node, the volume would be created on the node, if it does not exist, and will be reported to the cluster. The name must be unique in the namespace. If the volume is already created, the task will be routed to that node.

What if we want to have several tasks mount multiple copies of the same local volume? Glad you asked!

Let's say we have QFS chunkservers, where we have a bunch instances with persistent data, but only care that each volume instance has some task. We can parameterize the volume name based on the instance id:

mount:
  name: chunkserver-${instance.id}-data
  type: local
  target: /var/lib/qfsm

Now, we have a simple solution to a very complex problem without the involvement of an external volume plugin. One can provision new chunk servers by adding instances and decommission them by removing the volumes.

aluzzardi commented 8 years ago

To make this proposal sound, I suggest we come up with a Flocker real-world example.

Define a .yml file to bootstrap flocker on the cluster. Pseudo-example:

services:
  flocker-control:
    container:
      image: flocker/control
    instances: 3
  flocker-agent:
    container:
      image: flocker/agent
    deployment:
      global: true
  flocker-plugin:
    plugin:
      image: flocker/plugin
    deployment:
      global: true

Define a .yml example for a 3 instances mongodb cluster where each receives its own volume. Pseudo:

services:
  mongodb:
    container:
      image: mongodb
      volumeMounts:
          data:
            target: /var/lib/mongodb
            template:
              driver: flocker
              options:
                size: 10GB
                profile: gold

Define a .yml where multiple instances get the same volume:

services:
  noidea:
    container:
      volumeMounts:
          data:
            target: /var/lib/mongodb
            type: foo

volumes:
  foo:
    driver: flocker
    options:
      size: 10GB

Please don't bike shed on the syntax, this is not a proposal (literally spent 2 minutes writing the examples and put no thoughts whatsoever on it), I'm just describing what a proposal might look like

The flocker website has many examples of combining that with docker and swarm - we should use those and convert them to our syntax.

stevvooe commented 8 years ago

@aluzzardi SGTM

I think the my proposal in https://github.com/docker/swarm-v2/issues/187#issuecomment-204088053 is mostly compatible.

aluzzardi commented 8 years ago

Considering this done

moby / swarmkit

Design Proposal: Volume support #187

Cases to handle

YAML

ScratchDir {tmp}

LocalDir

Remote storage volume

Workflow

Local Disk Usage

Remote volume