kata-containers / runtime

Kata Containers version 1.x runtime (for version 2.x see https://github.com/kata-containers/kata-containers).
https://katacontainers.io/
Apache License 2.0
2.1k stars 375 forks source link

[feature request] mount blockdevices in the guest VM instead of passing them through as volume #571

Closed hex2a closed 3 years ago

hex2a commented 6 years ago

Description of problem

When specifying a block device as volume, it is passed as block device, however in many cases a user might want the filesystem on the device to be mounted.

An implementation similar to runq [0] would also allow to specify the KVM cache mode. [0] https://github.com/gotoz/runq#storage

Expected result

block device is mounted inside the VM, and exposed to the container as filesystem/bind mount.

Actual result

block device is passed through as block device

bergwolf commented 6 years ago

The main question to support this is how we can integrate it with upper layers in the stack.

  1. For docker, I think we already support mounting block device as a volume. E.g., you can do docker run -d -v /dev/sdx:/mnt/data busybox. This introduces incompatible behavior w.r.t. docker's. And it would require disable_block_device_use = false in the kata containers' configuration.

  2. For kubernetes w/ containerd or cri-o, what we get from upper layer is always a mounted hostpath for any volume. So we cannot pass the underlying block device to the guest, otherwise we'll face potential kernel crash or even silent data corruption.

  3. For kubernetes w/ frakti, there are a few block devices (cinder rbd, gce pd and ceph rbd) that can be passed as block device directly to the guest. And these are implemented via special flexvolume driver that bypasses mounting the device on the host and translates the volume information into a block device based volume that frakti can understand.

WeiZhang555 commented 6 years ago

I think mounting the block device for user inside Guest could be a useful use case, I was also thinking about this for long time. The only concern is that it's not fully compatible with Mount semantics, and maybe we will have other use cases that requires a "-v /dev/sda:/dev/sda"

hex2a commented 6 years ago

@WeiZhang555 I added the runq approach for reference.

if it is passed as -v /dev/sda:/dev/sda, it is passed as block device, if it should be mounted its specified it like this:

--volume <image  name>:/dev/disk/<cache type>/<filesystem type>/<mountpoint>
WeiZhang555 commented 6 years ago

@hex2a this is a bit hacky and rely too much on the guest mountpoint pattern, I don't think it's a good way

amshinde commented 6 years ago

@WeiZhang555 @bergwolf What if we mount the volume if the destination mount point is any other path besides "/dev/*" ? I realized that this will be useful for workloads(esp database workloads) that use volumes, 9p is an issue and the Entrypoint depends on the volume to be present.

WeiZhang555 commented 6 years ago

@amshinde Not sure if this is good.

To be honest, I'm looking forwards to this feature for long time, it will be quite useful to mount the block to container instead of a block device. Just can't be sure how to make a good interface.

bergwolf commented 6 years ago

FYI, I create a new project to track the overall issue since it would require changes on many components upper in the stack (k8s/CRI/CSI/OCI/CRIO/containerd etc.).

WeiZhang555 commented 6 years ago

Nice! @bergwolf

WeiZhang555 commented 6 years ago

@amshinde

I think about this again, for a device bind-mount,

-v /dev/sdb:/mnt

How about we detect if this is block device, we always mount it inside container to a dir?

You know, if user want a raw block, he can easily use --device /dev/sdb, by this way, the container can has a raw block device /dev/sdb in container, if user use a volume, then in my understanding, he truly want a mounted directory.

What do you think?

cc @kata-containers/runtime

sboeuf commented 6 years ago

Not sure I'm following here where is the need. If a user wants to use a block device inside the container, he will pass the block device using -d or -v and do the mounting by himself. I don't see why we would include an implicit behavior because of -v, while runc does not do this. And I'm not sure about the real benefit for the user here.

WeiZhang555 commented 6 years ago

@sboeuf

I'll describe the case, for example, if user want to run an application which need to persist some data, let's say:

# docker run -d -v /datavolume/userA/somedir:/opt/mysql  mysql 

For a native docker container, the /opt/mysql is a directory which can be written by mysql application directly inside container, everything works perfectly.

So now we have kata containers, the Cloud Provider will run K8S+kata-containers for user, and the user only need to provide his image and command he/she wants to run:

  1. user upload his mysql image which was tested on his local machine with docker container.
  2. user specify image and command, let cloud provider run the container for him with kata-containers.

Let's say, using the same command:

# docker run -d -v /datavolume/userA/somedir:/opt/mysql  mysql 

both /datavolume/userA/somedir and /opt/mysql are directories.

Yes, it works well for kata-containers too, the volume is passed via 9pfs.

Since 9pfs has some issues about compatibility and performance as you know, we will suggest user won't use 9pfs. So ideal docker command should be :

# docker run --cap-add SYS_ADMIN -d -v /dev/sdc:/dev/sdc  mysql sh -c "mount /dev/sdc /opt/mysql && mysqld ... "

/dev/sdc is block.

with this, we bypass 9p and instead user virtio-blk, right? This lead to another two problems:

  1. user need to modify their image or command to adapt to kata-containers, and has to be aware of the block, and we have to explain to user why they can't simply use a directory volume, and why this doesn't represent kata-containers is harder to use than docker container.

  2. we need to give user's container more privilege, which is SYS_ADMIN in this case. More privilege indicates less secure.

So what I suggest is:

# docker run -d -v /dev/sdc:/opt/mysql mysql

/dev/sdc is block and /opt/mysql is dir

  1. user doesn't need to modify his image/command, he also doesn't need to care about choice between 9pfs or virtio-blk.
  2. no new privileged needed.

That's a lot of benefits for me.

amshinde commented 6 years ago

@sboeuf All the points that @WeiZhang555 mentioned, plus this scenario as well:

$ sudo docker run -v /dev/sdc:/dev/sdc --cap-add SYS_ADMIN mysql bash -c "mount /dev/sdc /opt/mysql && mysql"

Doing things this way, you need to modify the CMD that needs to be run. Your Entrypoint script may need to perform setup which depends on the volume being present in the first place. In that case you have no other option but to modify the image.

raravena80 commented 6 years ago

Can we make this work with k8s too? 😄

amshinde commented 6 years ago

@raravena80 We need to see how we can make this work with k8s CSI. I haven't looked at it a whole lot, but this is in our roadmap. @bergwolf has created a github project to track this.

wilsonwang371 commented 4 years ago

what is the current status of this bug now?