kata-containers / runtime

Kata Containers version 1.x runtime (for version 2.x see https://github.com/kata-containers/kata-containers).
https://katacontainers.io/
Apache License 2.0
2.1k stars 375 forks source link

Kubernetes EmptyDir should be not be using 9p #1472

Closed dadux closed 3 years ago

dadux commented 5 years ago

Kubernetes EmptyDir performances are very slow (9p), while there is no real need to use to use 9pfs for the default medium. The EmptyDir volumes are only intended to share data between containers within a pod, and not with the host.

This was a recent related change in https://github.com/kata-containers/runtime/issues/1341 where tmpfs is not handled correctly.

For the default medium type (disk), kata-agent should probably create a directory in the VM, and mount it to the containers ?

cc @mcastelino @amshinde ?

awprice commented 5 years ago

Ideally we think there should be a new driver type and storage handler in kata-agent that would create the empty directory in the VM to handle this.

See for the driver types https://github.com/kata-containers/agent/blob/master/device.go#L26-L33 and https://github.com/kata-containers/agent/blob/master/mount.go#L194-L201 for the storage handlers.

amshinde commented 5 years ago

@dadux Empty dirs with default medium were implemented this way as technically a directory on the host is being shared with the pod. But if can safely assume the directory on the host is never really accessed on the host side, I think we could go with the approach of instead creating this inside the guest.

@awprice Yes, we would need a new storage driver+handler to instead create the directory inside the guest and bind mount this to the container's mount namespace.

mcastelino commented 5 years ago

The open will be what backs the emptyDir? A host side ephemeral "volume" or just guest RAM? If we choose to use guest ram then the medium is effectively ignored. If sort of correctly model this we will need to create a large sparse file on the host to back the volume and pass it in via virtio-disk/scsi. That will give you performance without costing memory. And as such from a resource consumption point of view resemble runc. As the sparse file backing the volume will only grow in response to writes.

The only issue is deletes. If the files on the volumes are deleted that space may not be recovered.

awprice commented 5 years ago

@mcastelino What we are proposing is instead of creating the ephemeral directory/volume on the host side filesystem and then mounting that into the VM using 9p, we create the ephemeral directory inside the guest VM, and bind mount that ephemeral directory into the containers within the VM on the rootfs. I guess the agent inside the VM would create the directories inside the VM when creating the containers.

The ephemeral directory will reside on whatever filesystem the rootfs is, which in most cases I believe will be 9p. In our case we are using devicemapper for our rootfs and so will benefit from the performance of having the ephemeral directory on the rootfs.

As all containers in a Kubernetes pod are created in the same guest VM, I doubt there is anything else that is likely to access the files in the emptyDir on the host side.

This would solve issues with cleanup too that you mentioned above, when the VM is terminated, the ephemeral directory inside the VM will be terminated too as the rootfs is cleaned up.

mcastelino commented 5 years ago

@awprice you mean the rootfs of the VM itself or that of the container? The container rootfs is backed by the 9p/device mapper. The VM rootfs today is a NVDIMM or initrd. So the VM rootfs is not backed by any host side writable storage.

Placing the volume on the container rootfs is effectively the same as the user not using implicit ephemeral volumes https://github.com/docker-library/docker/blob/65fab2cd767c10f22ee66afa919eda80dbdc8872/18.09/dind/Dockerfile#L40

Here the implicit ephemeral volumes will end up being a directory within the container filesystem.

awprice commented 5 years ago

@mcastelino Yep it doesn't sound like rootfs of the VM is feasible as it is NVDIMM/initrd as you said. The container's rootfs doesn't sound feasible either.

This isn't about handling docker volumes, this is about handling the specific case of Kubernetes EmptyDir where the medium != Memory.

We are thinking of storing the shared directory between the containers in the pod in the sandbox filesystem, i.e. /run/kata-containers/shared/containers/<sandbox id>. This is stored on device mapper in our case otherwise it is in 9p.

awprice commented 5 years ago

I've come up with a solution for this issue, see the following PRs:

kfox1111 commented 5 years ago

Would this allow you to use docker in kata backed with this type of emptydir?

awprice commented 5 years ago

Would this allow you to use docker in kata backed with this type of emptydir?

This is specifically for emptyDir in Kubernetes - https://kubernetes.io/docs/concepts/storage/volumes/#emptydir

kfox1111 commented 5 years ago

Was trying to see if docker in kata would work better with /var/lib/docker backed by an emptydir.

Looks like, no. It still is backing the emptyDir with 9p for some reason, rather then just using emptydir storage inside the vm itself.

kfox1111 commented 5 years ago

Please reopen this issue. As mentioned in https://github.com/kata-containers/runtime/pull/1485, this doesn't avoid 9p at all.

jodh-intel commented 5 years ago

Re-opening on request.

@amshinde - could you tal?

amshinde commented 5 years ago

@kfox1111 We create the empty-dir with default medium on the sandbox rootfs, this will help in case one is using devicemapper storage which is what the solution was targeted for. In case of other storage drivers, you still end up using 9p since the rootfs itself is passed using 9p.

Do you have a proposal for solving this for other storage drivers? We can discuss possible solutions. PRs are welcome as well :)

awprice commented 5 years ago

Another option to look at is using Virtio-fs. If you switch to using Virtio-fs, emptyDirs will use virtio-fs instead of 9p.

kfox1111 commented 5 years ago

Might work as a workaround. I'll give that a try. Still wouldn't be as performant I think as having emptydirs be associated with the vm itself.

Why not make a qcow2 or raw file for the emptydir and map it into the vm?

amshinde commented 5 years ago

@kfox1111 There is ongoing work to implement empty-dirs using qcow2 images. There should be a PR for this soon. cc @egernst

kfox1111 commented 5 years ago

Awesome. Thanks for the heads up.