kata-containers / runtime

Kata Containers version 1.x runtime (for version 2.x see https://github.com/kata-containers/kata-containers).
https://katacontainers.io/
Apache License 2.0
2.1k stars 375 forks source link

Add support for ephemeral volumes #61

Closed harche closed 6 years ago

harche commented 6 years ago

Hi,

As of now all volumes are created on the host and passed to VM via 9pfs. But k8s allows you to create ephemeral volumes Also, these volumes can be backed by a ramdisk. Ephemeral volumes, as the name indicates, live and die with the pod. There is no reason to use 9pfs for this type of volume.

Kata needs to support these volumes by creating tempfs based volume inside of the VM.

The possible approach that I can think of,

  1. Detect if the volume getting attached is backed by a ramdisk (temfs)
  2. When VM boots, instruct init to create tempfs inside of VM
  3. Use the ramdisk created inside VM in step 2 with the containers of the pod.

Any thoughts?

bergwolf commented 6 years ago

The kata agent protocol already supports it. Please see https://github.com/kata-containers/agent/blob/master/protocols/grpc/agent.proto#L194

We can implement a new type of storage driver (e.g. tmpfs) that instructs the kata agent to setup a tmpfs mount point at pb.Storage.Mountpoint and reference it via the container oci spec.

These need to be implemented in both the kata agent and runtime though.

harche commented 6 years ago

@bergwolf Sounds good. I will try to submit PR supporting ephemeral volumes.

egernst commented 6 years ago

@harche - checking to see if you've made any progress here.

harche commented 6 years ago

@egernst Sorry I was away for medical reasons. Just back today. I will start working on it.

harche commented 6 years ago

Docker and Kubernetes take different approaches to attach emphemeral volumes (backed by tmpfs) to the container.

When an empheral volume is attached using kubernetes (by setting empty.medium to "Memory" in the yaml as described here), the corresponding docker container's config.v2.json looks this,

"MountPoints": {
        "/cache": {
            "Source": "/var/lib/kubelet/pods/366c3a75-4869-11e8-b479-507b9ddd5ce4/volumes/kubernetes.io~empty-dir/cache-volume",
            "Destination": "/cache",
            "RW": true,
            "Name": "",
            "Driver": "",
            "Type": "bind",
            "Propagation": "rprivate",
            "Spec": {
                "Type": "bind",
                "Source": "/var/lib/kubelet/pods/366c3a75-4869-11e8-b479-507b9ddd5ce4/volumes/kubernetes.io~empty-dir/cache-volume",
                "Target": "/cache"
            }
        },

Just to make sure the backing volume is indeed backed by tmpfs,

# mount |grep 366c3a75
tmpfs on /var/lib/kubelet/pods/366c3a75-4869-11e8-b479-507b9ddd5ce4/volumes/kubernetes.io~empty-dir/cache-volume type tmpfs (rw,relatime)
tmpfs on /var/lib/kubelet/pods/366c3a75-4869-11e8-b479-507b9ddd5ce4/volumes/kubernetes.io~secret/default-token-2w6d4 type tmpfs (rw,relatime)

However when you use docker directly to attach tmpfs by something like,

docker run   -it  --mount type=tmpfs,destination=/app,tmpfs-mode=1770  busybox sh

The corresponding config.v2.json for that container looks like,

"MountPoints": {
        "/app": {
            "Source": "",
            "Destination": "/app",
            "RW": true,
            "Name": "",
            "Driver": "",
            "Type": "tmpfs",
            "Spec": {
                "Type": "tmpfs",
                "Target": "/app",
                "TmpfsOptions": {
                    "Mode": 1016
                }
            }
        }
    }

As you can see when it comes to handling tmpfs based volumes using docker is pretty simple, but kubernetes doesn't let the container config know that the volume is of type tmpfs. Instead, it just presents it as a regular bind mount.

So from a runtime's point of view how do we come up with the solution that works well with kubernetes? Kubernetes doesn't put anything specific to tmpfs in config of the container.

One of the solution could be to parse the Mounts of the spec and filter by kubernetes.io~empty-dir. We can treat those volumes where source has that string differently and instruct agent to just create that directory inside the VM's memory instead of passing it as 9pfs. But this solution would be too specific to kubernetes.

What do you guys think?

@bergwolf @egernst @gnawux @jbryce

grahamwhaley commented 6 years ago

Thanks for the detailed info @harche ! - /cc @amshinde

amshinde commented 6 years ago

Yeah, we will need to skip these mounts similar to what we do for "/dev/shm", which docker chooses to pass as bind mount as well instead of tmpfs . @harche You can take a look here: https://github.com/kata-containers/runtime/blob/master/virtcontainers/container.go#L300

linxiulei commented 5 years ago

My bad if I missed something. EmptyDir defined in k8s supposed having three types of medium, default, tmpfs, hugepage. And the default medium should be node disk instead of memdisk, should it be better we consider mount option before pass ephemeral volumes into guest and back to use 9pfs when it was backed by default medium

@harche @amshinde

mcastelino commented 5 years ago

@linxiulei your observation is correct. The medium should be node disk. If not it will be incorrectly accounted for. If not we will end up eating RAM when runc would not have. This will also cause issues with the scheduler. I will open a bug.

/cc @harche @amshinde

amshinde commented 5 years ago

I passed in two empty directories one with medium memory and other as default. In the config.json I see, that the two directories appear as:

                {
                        "destination": "/tmp/xchange",
                        "type": "bind",
                        "source": "/var/lib/kubelet/pods/d391df17-4698-11e9-b7d7-525400472345/volumes/kubernetes.io~empty-dir/xchange-kata",
                        "options": [
                                "rw",
                                "rbind",
                                "rprivate",
                                "bind"
                        ]
                },
                {
                        "destination": "/tmp/tmpemp",
                        "type": "bind",
                        "source": "/var/lib/kubelet/pods/d391df17-4698-11e9-b7d7-525400472345/volumes/kubernetes.io~empty-dir/tmpempty-kata",
                        "options": [
                                "rw",
                                "rbind",
                                "rprivate",
                                "bind"
                        ]
                }

There is no information about the medium that is passed to the OCI layer. The only way to handle this correctly would be to actually check if the directory is mounted as a tmpfs mount or not.

$ mount | grep empty
tmpfs on /var/lib/kubelet/pods/d391df17-4698-11e9-b7d7-525400472345/volumes/kubernetes.io~empty-dir/tmpempty-kata type tmpfs (rw,relatime)