Closed hex2a closed 3 years ago
The main question to support this is how we can integrate it with upper layers in the stack.
For docker, I think we already support mounting block device as a volume. E.g., you can do
docker run -d -v /dev/sdx:/mnt/data busybox
. This introduces incompatible behavior w.r.t. docker's. And it would require disable_block_device_use = false
in the kata containers' configuration.
For kubernetes w/ containerd or cri-o, what we get from upper layer is always a mounted hostpath for any volume. So we cannot pass the underlying block device to the guest, otherwise we'll face potential kernel crash or even silent data corruption.
For kubernetes w/ frakti, there are a few block devices (cinder rbd, gce pd and ceph rbd) that can be passed as block device directly to the guest. And these are implemented via special flexvolume driver that bypasses mounting the device on the host and translates the volume information into a block device based volume that frakti can understand.
I think mounting the block device for user inside Guest could be a useful use case, I was also thinking about this for long time. The only concern is that it's not fully compatible with Mount
semantics, and maybe we will have other use cases that requires a "-v /dev/sda:/dev/sda"
@WeiZhang555
I added the runq
approach for reference.
if it is passed as -v /dev/sda:/dev/sda
, it is passed as block device, if it should be mounted its specified it like this:
--volume <image name>:/dev/disk/<cache type>/<filesystem type>/<mountpoint>
@hex2a this is a bit hacky and rely too much on the guest mountpoint pattern, I don't think it's a good way
@WeiZhang555 @bergwolf What if we mount the volume if the destination mount point is any other path besides "/dev/*" ? I realized that this will be useful for workloads(esp database workloads) that use volumes, 9p is an issue and the Entrypoint depends on the volume to be present.
@amshinde Not sure if this is good.
To be honest, I'm looking forwards to this feature for long time, it will be quite useful to mount the block to container instead of a block device. Just can't be sure how to make a good interface.
FYI, I create a new project to track the overall issue since it would require changes on many components upper in the stack (k8s/CRI/CSI/OCI/CRIO/containerd etc.).
Nice! @bergwolf
@amshinde
I think about this again, for a device bind-mount,
-v /dev/sdb:/mnt
How about we detect if this is block device, we always mount it inside container to a dir?
You know, if user want a raw block, he can easily use --device /dev/sdb
, by this way, the container can has a raw block device /dev/sdb
in container, if user use a volume, then in my understanding, he truly want a mounted directory.
What do you think?
cc @kata-containers/runtime
Not sure I'm following here where is the need. If a user wants to use a block device inside the container, he will pass the block device using -d
or -v
and do the mounting by himself. I don't see why we would include an implicit behavior because of -v
, while runc
does not do this.
And I'm not sure about the real benefit for the user here.
@sboeuf
I'll describe the case, for example, if user want to run an application which need to persist some data, let's say:
# docker run -d -v /datavolume/userA/somedir:/opt/mysql mysql
For a native docker container, the /opt/mysql
is a directory which can be written by mysql application directly inside container, everything works perfectly.
So now we have kata containers, the Cloud Provider will run K8S+kata-containers for user, and the user only need to provide his image and command he/she wants to run:
mysql
image which was tested on his local machine with docker container.image
and command
, let cloud provider run the container for him with kata-containers.Let's say, using the same command:
# docker run -d -v /datavolume/userA/somedir:/opt/mysql mysql
both /datavolume/userA/somedir
and /opt/mysql
are directories.
Yes, it works well for kata-containers too, the volume is passed via 9pfs.
Since 9pfs has some issues about compatibility and performance as you know, we will suggest user won't use 9pfs. So ideal docker command should be :
# docker run --cap-add SYS_ADMIN -d -v /dev/sdc:/dev/sdc mysql sh -c "mount /dev/sdc /opt/mysql && mysqld ... "
/dev/sdc
is block.
with this, we bypass 9p and instead user virtio-blk, right? This lead to another two problems:
user need to modify their image or command to adapt to kata-containers, and has to be aware of the block, and we have to explain to user why they can't simply use a directory volume, and why this doesn't represent kata-containers is harder to use than docker container.
we need to give user's container more privilege, which is SYS_ADMIN in this case. More privilege indicates less secure.
So what I suggest is:
# docker run -d -v /dev/sdc:/opt/mysql mysql
/dev/sdc
is block and /opt/mysql
is dir
That's a lot of benefits for me.
@sboeuf All the points that @WeiZhang555 mentioned, plus this scenario as well:
$ sudo docker run -v /dev/sdc:/dev/sdc --cap-add SYS_ADMIN mysql bash -c "mount /dev/sdc /opt/mysql && mysql"
Doing things this way, you need to modify the CMD that needs to be run. Your Entrypoint script may need to perform setup which depends on the volume being present in the first place. In that case you have no other option but to modify the image.
Can we make this work with k8s too? 😄
@raravena80 We need to see how we can make this work with k8s CSI. I haven't looked at it a whole lot, but this is in our roadmap. @bergwolf has created a github project to track this.
what is the current status of this bug now?
Description of problem
When specifying a block device as volume, it is passed as block device, however in many cases a user might want the filesystem on the device to be mounted.
An implementation similar to runq [0] would also allow to specify the KVM cache mode. [0] https://github.com/gotoz/runq#storage
Expected result
block device is mounted inside the VM, and exposed to the container as filesystem/bind mount.
Actual result
block device is passed through as block device