kata-containers / kata-containers

Kata Containers is an open source project and community working to build a standard implementation of lightweight Virtual Machines (VMs) that feel and perform like containers, but provide the workload isolation and security advantages of VMs. https://katacontainers.io/
Apache License 2.0
5.12k stars 1.01k forks source link

[RFC] Multi virtio-fs devices for kata-containers #7380

Open Apokleos opened 11 months ago

Apokleos commented 11 months ago

Motivation

Kata Containers uses virtio-fs to create volumes that allow containers to share host resources with guests. Currently, there is only one virtio-fs device serving all the volumes in the guest, and its configuration is limited to the sandbox level. This greatly restricts scalability and applicability. In some scenarios, volumes require special configurations, and one configuration may affect another. For example:

The common feature in these scenarios is the need for different configurations for each virtio-fs volume, without affecting other volumes. To that end, we propose the multi-virtiofs solution as a solution to address this issue.

Design/Goal

The multi-virtiofs solution is designed to cater to devices with special configurations, providing the ability to use independent configurations without affecting other devices. Furthermore, the solution supports separate virtio-fs devices with similar functional volumes. It's important to note that while there may be multiple devices in the solution, only one device is designated as the default device and remains constant. Any additional devices are added following the setup of the default device. Users must specify information about the extra virtio-fs devices, which are set up based on the provided information. Once the default and extra devices are added, sandbox level volumes and other special volumes for container are created based on user-specified information. The goal of the multi-virtiofs solution is to provide a flexible and efficient way to handle different virtio-fs volumes with separate configurations, ensuring one volume's configuration doesn't affect others. It also aims to support multiple virtio-fs devices with similar functional volumes. Considering compatibility with the default virtio-fs device, extra virtio-fs devices will be handled separately, as additional steps are required to handle the given annotation. If multiple virtio-fs devices with volumes in the guest have been set up, the following diagram will illustrate their relationships:

image

Pros/Cons

Pros

Cons

Implement and Use cases

extra-virtiofs

With the help of annotation, kata-containers knows how to set up an extra virtiofs device or volumes. 1. Annotation: io.katacontainers.config.hypervisor.extra_virtiofs 2. Content Format: <virtiofs device 01>:<arg01,arg02,...>;<virtiofs device 02>:<arg01,arg02,...>;... more details:

scenarios

Multi-virtiofs devices use scenarios need to be decided according to the combination of Annotation, which mainly includes the following two cases:

Notice: kata-containers already implements sandbox bind mounts, Multi-virtiofs needs to be compatible with the existed configuration method. And each scenario has its own annotation combination:

1. Container Special Volume

annotation: io.katacontainers.config.runtime.special_volumes

content format: <virtiofs device 01>:<container_path01,container_path02,...>;<virtiofs device 02>:<container_path03,container_path04,...>;...

more details:

special volumes for Case2

--annotation "io.katacontainers.config.hypervisor.extra_virtiofs=virtiofs_without_cache:-o open,cache=none,no_writeback --thread-pool-size=1" \
--annotation "io.katacontainers.config.runtime.special_volumes=virtiofs_without_cache:container_path_01" 

# ctr run --mount type=bind, src=host_path,dest=container_path_01 ...

In this case, the annotation will be converted to the configuration.toml. Alternatively, users of the ctr/nerdctl tool can directly fill in the same configuration in the extra_virtiofs and special_volumes items of the configuration.toml, and then add the corresponding container path container_path_01 to the dest argument of the ctr run --mount src=host_path_01, dest=container_path_01 which should be the same path as the one specified in the special_volume and the host_path_01 is the specified host path.

2. Sandbox Bind Mount

annotation: io.katacontainers.config.runtime.sandbox_bind_mounts

content format: <virtiofs device 01>:<host_path01@ro host_path02 ...>;<virtiofs device 02>:<host_path03@rw host_path04 ...>; <host_path05 host_path06@rw host_path07@ro ...>

more details:

Sandbox bind mounts for Case1

io.katacontainers.config.hypervisor.extra_virtiofs="fs_cache_none:--drop-sys-resource --thread-pool-size=1 -o no_open,no_writeback,no_readdir"
io.katacontainers.config.runtime.sandbox_bind_mounts="fs_cache_none:/data/rafs/<uid>/shared_rafs"

There's a difference between the Special Volume and Sandbox Bind Mount configurations:

Limitations

container special volumes

If a Kata Pod contains multiple containers, and each container has specified a volume with the same intra-container path, but each volume requires a different virtio-fs configuration, this scenario is not supported. For instance, imagine a Pod with two containers named container_01 and container_02. Both containers have specified a volume path that is the same, /container_path0. However, container_01 requires a separate device that is configured with "&cache=always", while container_02 requires a separate device that is configured with "cache=none". In this case, an error will occur.

Sandbox bind mounts

The device and path lists are separated by a colon, while the path and read/write attributes are also separated by a colon. This can lead to confusion during processing, so we need to change the expression of the path read/write attributes from a colon to "@". To maintain compatibility with the existing configuration as much as possible, all bind mounts for devices in the sandbox should use spaces (" ") as separators. However, configurations without specified device names are allowed. For instance, "host_path04@rw host_path05@ro host_path06" will use the default virtio-fs device and in this case, to separate the previous virtio-fs device from the default one, a semicolon (;) is used. For example: "virtiofs_name01:host_path01@rw host_path02;host_path04@rw host_path05@ro host_path06".

References

related issues:

https://github.com/kata-containers/kata-containers/issues/1464 https://github.com/kata-containers/kata-containers/issues/6597

virtiofs

https://virtio-fs.gitlab.io/design.html

c3d commented 11 months ago

Neat proposal, @Apokleos.

Setting a shared directory to no_readdir mode will disable users from seeing the list of files in the directory. This means that any readdir/readdirplus requests for that directory will return empty dirents. However, it's important to note that this setting will affect other virtio-fs' volumes.

Although it's pretty clear from context, I would add "affect other virtiofs volumes on the same virtiofs device".

If the virtio-fs device is set up with a special configuration of cache=none,open, which does not support mmap, it can cause certain applications that rely on mmap to fail or behave incorrectly.

Could you explain why you'd want to do that? Maybe you can generalize the statement to say that cache settings that are appropriate for the system volumes are not necessarily the ones you'd want for data volumes.

An application requires the use of the virtio-fs device's DAX to share files between the host and guest, enabling memory sharing or data cache construction

Is it possible to set different DAX windows for different virtiofs devices that can potentially share the same host cache pages? I must admit that I never tried that.

<virtiofs device 01>:<arg01,arg02,...>;<virtiofs device 02>:<arg01,arg02,...>;...

Wouldn't you want to use standard arrays here so that this integrates better in yaml files?

Finally, a question on the usage model and the distinction between container special volumes and sandbox bind mounts. IIUC, the proposal only passes the virtiofs configuration via annotations, and it's disconnected from the standard volumes specification. IOW, I take my existing workload, and I can tag an annotation to state how the various volumes are dispatched across various virtiofs devices. Is that reading correct?

Assuming I got that right, this raises the question of what to do if the annotation talks about /my-volume/foo and the volume description refers to /my-v0lumes/f00. Notice the subtle typos between the two. On one hand, you could error out on that case, but that means an analysis of the various volume paths that is not necessarily super-straightforward, and more importantly, it disables use cases where you use a naming convention for your DAX volumes and some other process injects the annotation everywhere.

Apokleos commented 11 months ago

@c3d Thanks for your time and feedback, and sorry for the delayed response.

Neat proposal, @Apokleos.

Setting a shared directory to no_readdir mode will disable users from seeing the list of files in the directory. This means that any readdir/readdirplus requests for that directory will return empty dirents. However, it's important to note that this setting will affect other virtio-fs' volumes.

Although it's pretty clear from context, I would add "affect other virtiofs volumes on the same virtiofs device".

Yeah, I agree with the description "affect other virtiofs volumes on the same virtiofs device" and I'll correct it.

If the virtio-fs device is set up with a special configuration of cache=none,open, which does not support mmap, it can cause certain applications that rely on mmap to fail or behave incorrectly.

Could you explain why you'd want to do that? Maybe you can generalize the statement to say that cache settings that are appropriate for the system volumes are not necessarily the ones you'd want for data volumes.

Here, it's indeed the case of your suggestion about " appropriate for the system volumes are not necessarily the ones you'd want for data volumes" and I will generalize it with the statement.

An application requires the use of the virtio-fs device's DAX to share files between the host and guest, enabling memory sharing or data cache construction

Is it possible to set different DAX windows for different virtiofs devices that can potentially share the same host cache pages? I must admit that I never tried that.

It is theoretically possible, although we rarely use it, but that's not the point I'm trying to make. My point is that in the case of a single virtiofs device, some workloads need to be set to DAX and have exclusive access to host page cache, while others do not need this setting. currently, it doesn't support.

<virtiofs device 01>:<arg01,arg02,...>;<virtiofs device 02>:<arg01,arg02,...>;...

Wouldn't you want to use standard arrays here so that this integrates better in yaml files?

Good question! the format of annotation is just for users, but it will be translated into standard arrays in configuration. user annotation: :<arg01,arg02,...> ;:<arg01,arg02,...>;... configuration: ["virtiofs_01:arg01,arg02,..", "virtiofs _02:arg01,arg02,..."]

Finally, a question on the usage model and the distinction between container special volumes and sandbox bind mounts. IIUC, the proposal only passes the virtiofs configuration via annotations, and it's disconnected from the standard volumes specification. IOW, I take my existing workload, and I can tag an annotation to state how the various volumes are dispatched across various virtiofs devices. Is that reading correct?

I'm sorry for that I left out any details in my previous message, which may have caused some confusion. I have added some more explanation in Section 1. Container Special Volumeto help clarify it. And the special volumes are implemented as standard volumes.

Assuming I got that right, this raises the question of what to do if the annotation talks about /my-volume/foo and the volume description refers to /my-v0lumes/f00. Notice the subtle typos between the two. On one hand, you could error out on that case, but that means an analysis of the various volume paths that is not necessarily super-straightforward, and more importantly, it disables use cases where you use a naming convention for your DAX volumes and some other process injects the annotation everywhere.

If the case looks like as below, It will give us the error message: "No virtiofs device was found for container special volume /my-v0lumes/f00"


--annotation "io.katacontainers.config.hypervisor.extra_virtiofs=virtiofs_device01:-o open,cache=none,no_writeback --thread-pool-size=1" \
--annotation "io.katacontainers.config.runtime.special_volumes=virtiofs_device01:/my-volume/foo" 

ctr run --mount type=bind, src=host_path,dest=/my-v0lumes/f00 ...


Obviously, this approach is relatively direct and in line with user expectations. Do you have a better way to handle it?"