Requirements for merged kata runtime

gnawux commented 6 years ago

Here is a draft of requirements (working in progress).

https://docs.google.com/document/d/109pxj-90Ly58ma8CoeRKcMoPWBD0G911E53MeK2zhhA/edit?usp=sharing

Currently, it listed some of the working scenarios for the runtime, and more sections should be added.

Once we have agreement on the requirement, the doc should be put into repo as a markdown doc.

gnawux commented 6 years ago

I added @tallclair @WeiZhang555 @sameo in the doc with edit permission, and you can invite others.

sameo commented 6 years ago

@gnawux Maybe it's a good idea to start discussing about those requirements here instead?

Here is my proposal:

[DRAFT v5] [See changelog at the bottom of this page]

Mandatory requirements

The Kata Containers runtime MUST fulfill all requirements below:

OCI compatibility

The Kata Containers runtime MUST implement the OCI runtime specification and support all the OCI runtime operations.

`runc` CLI compatibility

In theory, being OCI compatible should be enough. In practice the a Kata Containers runtime should comply the latest stable runc CLI. In particular, it MUST implement with the following runc commands:

create
delete
exec
kill
list
pause
ps
state
start
version

and the following command line options:

--console-socket
--pid-file

CRI and Kubernetes support

The Kata Containers project MUST provide 2 interfaces for CRI shims to be able to manage hardware virtualization based Kubernetes pods and containers:

An OCI and runc compatible command line interface, as decribed in the previous section. This interface is used by e.g. the CRI-O and cri-containerd CRI implementations
A hardware virtualization runtime library API for CRI shims to consume and provide a more CRI native implementation. The frakti CRI shim is one example of such consumer.

Multiple hardware architectures support

The Kata Containers runtime MUST NOT be architecture specific. It should be able to support multiple hardware architectures and provide a pluggable and flexible design for adding support for additional ones.

Multiple hypervisor support

The Kata Containers runtime MUST NOT be tied to any specific hardware virtualization technology, hypervisor or virtual machine monitor implementation. It should support multiple hypervisors and provide a pluggable and flexible design for adding support for additional ones.

Nesting

The Kata Containers runtime MUST support nested virtualization environments.

Networking

The Kata Containers runtime MUST be able to support any CNI plugin.
The Kata Containers runtime MUST be able to support both legacy and IPv6 networks.

I/O

Devices direct assignment

In order for containers to directly consume host hardware resources, the Kata Containers runtime MUST provide containers with secure pass through to generic devices like e.g. GPUs, SRIOV, RDMA, QAT, by leveraging I/O virtualization technologies (IOMMU, interrupt remapping, etc...).

Acceleration

The Kata Containers runtime MUST support accelerated and user space based I/O operations for networking (e.g. DPDK) and storage through vhost-user sockets.

Scalability

The Kata Containers runtime MUST support scalable I/O through the SRIOV technology.

Virtualization overhead reduction

One of the compelling aspects of containers is its minimal overhead compared to bare metal applications. A container runtime should strive for keeping that overhead to a minimum in order to provide the expected user experience. As a consequence the Kata Containers runtime implementation should be optimized for:

Minimal workload boot and shutdown times
Minimal workload memory footprint
Maximal networking throughput
Minimal networking latency

Testing and debugging

Continuous Integration

Each Kata Containers runtime pull request MUST pass a set of container related tests:

Unit tests: (runtime unit tests coverage ?)
Functional tests: The entire runtime CLI and APIs
Integration tests: Docker and Kubernetes.

Debugging

The Kata Containers runtime implementation MUST use structured logging in order to namespaced log messages to facilitate debugging.

Optional Requirements

TBD

ChangeLog

v1 -> v2

Changed the Runtime API section wording as suggested by @bergwolf
Split requirements into 2 sections: mandatory and optional
Made the CRI support section a little clearer

v2 -> v3

Merged the Runtime API section into the CRI support one

v3 -> v4

Added the omitted state runc command
Added runc cli options

v4 -> v5

Added a networking section: CNI + IPv6
Added an I/O section: Acceleration and scalability
Renamed the Performance optimization section to Virtualization overhead reduction
Added a logging subsection

sameo commented 6 years ago

cc @jessfraz @sboeuf @jodh-intel @egernst @devimc @amshinde @mcastelino @grahamwhaley

gnawux commented 6 years ago

@sameo you could just edit the google doc, it is not easy to collaborate writing a document in an issue.

sameo commented 6 years ago

@gnawux Your suggestion about opening a PR for adding this as a document would be good. Folks can then comment on it and it can be rebased to take input into account.

gnawux commented 6 years ago

@sameo feel it is a bit early to put a PR, others may input contents directly on the google doc.

However, if you think it is the time, I could file a PR and merge both our drafts into it.

sameo commented 6 years ago

@gnawux Since this is going to be part of our documentation, using a google doc for this sounds like an avoidable additional step. If this document would for sure not be merged to our documentation then a google doc would make sense, but here a PR makes more sense imho.

gnawux commented 6 years ago

@sameo Though I think it is better to make the doc looks ready before convert to a PR, It's not a big problem. I will submit the PR firstly for discussion.

gnawux commented 6 years ago

@sameo and all

Do you guys think the requirement docs should be put into docs directory of this repo, or the document repo? If the latter, do we have a dir hierarchy of the repo yet?

jessfraz commented 6 years ago

I kinda think the last part of that doc is very accurate, can we remove as many shim layers as possible and try to keep it simple :)

tallclair commented 6 years ago

Can you clarify the section on CRI support? It sounds like it's not saying that the runtime should be a complete implementation of the CRI, but rather that an implementation should be possible using it? Is OCI not sufficient for that?

egernst commented 6 years ago

I think the document is trending towards describing architecture rather than defining requirements (ie, OCI compliance, provide an API/library suitable for a CRI like frakti, device hotplug, etc.). If/when I get permissions for editing I can help make some edits/add clarity.

gnawux commented 6 years ago

@jessfraz yes, this is what we do in production, and we should keep the scenario working in kata.

@egernst :+1: permission granted.

sameo commented 6 years ago

@tallclair Yes, that's what I was trying to say indeed. OCI is enough for CRI implementations like CRI-O or cri-containerd but e.g. Frakti does not rely on the OCI interface but rather on the runV API. The current docker-shim also does not rely on OCI. So for implementing a CRI server, OCI may be sufficient but it's not necessary.

bergwolf commented 6 years ago

@sameo @tallclair while OCI can implement CRI support, its runtime spec may not be efficient enough for Kata Containers due to lack of proper storage description. When Kata Containers support cri-containerd/CRI-O via OCI (aka. the runc cli interfaces), it relies on 9pfs (which itself is slow and problematic -- we even have to hack 9pfs kernel module to reach POSIX compliance) to map local storage to the guest, and there is no description for remote storage in the spec.

OTOH, the runV library API is more native for VM-based containers and favors CRI from design. In fact it was designed together with CRI. In that sense, runV is compatible with the OCI spec and provides extended APIs to better suit the need of CRI and VM-based containers. With the runv API, frakti is able to use both local block storage and remote storage more efficiently.

So to amend the last paragraph of @sameo 's requirements list, I would suggest we change the requirement of Runtime API from

Runtime API Some CRI implementations (e.g. Frakti) may rely on the runtime API instead of the CLI. The Kata Containers should provide a runtime API definition and a runtime library to support those cases.

To "

Runtime Library API

While CRI-O and cri-containerd rely on runc compatible CLI, some CRI implementations like frakti rely on the runtime library API instead. The Kata Containers MUST provide a runtime library API favoring CRI design to support them. "

And it should be moved up right after the CRI support section, instead of being put at the last of the list, which might give the impression that it is a minor requirement that can be dismissed.

resouer commented 6 years ago

@tallclair as we mentioned CRI things, I would like to provide a detailed research of potential integration options of Kata and CRI shims in next sig-node meeting.

Let's put all these problems on the table and see how they can be fixed.

sameo commented 6 years ago

@bergwolf

When Kata Containers support cri-containerd/CRI-O via OCI (aka. the runc cli interfaces), it relies on 9pfs (which itself is slow and problematic -- we even have to hack 9pfs kernel module to reach POSIX compliance) to map local storage to the guest.

That is not entirely correct I'm afraid. With cc-runtime we do hotplug block based devices as local storage when the container overlay filesystem allows for it. So 9pfs is a fallback, not the default.

there is no description for remote storage in the spec.

Yes, there is no such description in the OCI spec. But out of curiosity, is that specified in the current CRI spec? I don't see it but I may very well be missing something.

So to amend the last paragraph of @sameo 's requirements list, I would suggest we change the requirement of Runtime API from

Done, thanks for the input. I did not change the order because I think all those requirements are mandatory. Instead I created an Optional requirements section to explicitly state which ones are mandatory and the runtime API is part of it.

bergwolf commented 6 years ago

@sameo, thanks for updating the list! I still think we can put all CRI related requirements together instead of scattering them all over the doc. But that can be done in future updating.

That is not entirely correct I'm afraid. With cc-runtime we do hotplug block based devices as local storage when the container overlay filesystem allows for it. So 9pfs is a fallback, not the default.

Well, it's true that you can hotplug block devices if they are specified in theSpec.Linux.Devices section of the OCI spec. That is true for any pluggable devices. And what you present to the container process is the device itself rather than any file system directory.

The problem is that with an OCI spec, rootfs and volumes are specified in Spec.Root and Spec.Mount. Then you do not know if the rootfs and volumes are block based devices or not, not to mention which device you need to hotplug to the guest.

But out of curiosity, is that specified in the current CRI spec? I don't see it but I may very well be missing something.

That is not defined by the CRI spec but can be supported via flexvolume. And there is ongoing change to support it in the CSI spec.

sameo commented 6 years ago

@bergwolf

I still think we can put all CRI related requirements together instead of scattering them all over the doc.

Yes, that makes sense. I've merged the runtime api into the CRI support section. Please let me know how it looks now.

Well, it's true that you can hotplug block devices if they are specified in theSpec.Linux.Devices section of the OCI spec. That is true for any pluggable devices. And what you present to the container process is the device itself rather than any file system directory.

The problem is that with an OCI spec, rootfs and volumes are specified in Spec.Root and Spec.Mount. Then you do not know if the rootfs and volumes are block based devices or not, not to mention which device you need to hotplug to the guest.

Sorry for the confusion, I should not have mentioned the hotplugging side of things here. We do vmm hotplug, but only for efficiency reasons. But cc-runtime and virtcontainers dynamically detect if a rootfs is block based device or not and present it as a disk inside the VM or as a 9pfs mount point respectively. As you pointed out, the performance and posix compatibility are significantly different between the 2. So my point was that you can do proper block based IO with or without the current OCI spec help.

sameo commented 6 years ago

@egernst @tallclair I've updated the CRI support section. Hopefully it reads better now.

WeiZhang555 commented 6 years ago

I read this whole thread and I think there are some requirements none of you mentioned.

first big part is "docker API" support, though "docker API" is also going through "runc compatible CLI " API, there're still some tricks to make "kata-runtime" work from docker command line. One example is "how to compose a POD", CRI is using specific labels to classify POD or Container, but docker isn't. And "K8S->docker->kata" is still an important scenario though we already has CRI-O as a replacement.
K8S Ecosystem support:
2.1 CNI network support 2.2 Monitoring: how to use cAdvisor to monitor kata container resource usage. 2.3 Logs: K8S monitor container logs. As I know, it's using volume as log transfter channel, if so kata-runtime should support this natively, but I can't be sure

These are also important for working with K8S, and I know both cc and runv have experience on these. It's better to add some illustrations for them.

sameo commented 6 years ago

@WeiZhang555

And "K8S->docker->kata" is still an important scenario

Do you mean we want to support the dockershim CRI implementation, i.e. "K8S->dockershim->kata" ?

jodh-intel commented 6 years ago

Looks good. I'm a little unclear about the OCI compat / runc compat sections, specifically why those particular runc commands are listed but not some others. For example, delete isn't mentioned, nor is state. There other runc commands, but they do not form part of the OCI runtime spec.

Another point - runc has particular options (like --console-socket and --pid-file) which are not part of the OCI spec, so should those also be listed?

Also, how about adding something about logging / problem determination / debugging? Specifically the ability to determine easily that a problem originates in the runtime rather than one of the other components.

We might benefit from some input from @chavafg for thoughts on Testing requirements.

WeiZhang555 commented 6 years ago

@sameo dockershim could be enough as it will be formal implementation in K8S.

@jodh-intel

Also, how about adding something about logging / problem determination / debugging? Specifically the ability to determine easily that a problem originates in the runtime rather than one of the other components.

We might benefit from some input from @chavafg for thoughts on Testing requirements.

Good point! This is also important part I really care about, and can be strong guarantee for quality and give more confidence to kata container developers and users.

sameo commented 6 years ago

@jodh-intel delete was already there, and state was just an unintentional omission. I added it now. I also added the 2 options you mentioned.

Could you elaborate a little more on the logging/debugging ?

gnawux commented 6 years ago

@WeiZhang555 dockershim is moving out from kubelet, but I think a docker support is still a valid requirement.

jodh-intel commented 6 years ago

Thanks @sameo.

Regarding logging, ideally the runtime would use structured logging (as provided by https://godoc.org/github.com/sirupsen/logrus, for example), such that one of the log fields would specify "runtime" or "kata-runtime" to allow consumers of the system log to determine that an error was being generated by the runtime (as opposed to the shim/proxy/agent/hypervisor).

Also, every time an OCI command is called, it would be extremely useful if the runtime could also log its version and the commit it was built with. That shouldn't necessarily be required for every log call but since an OCI command delineates a block of log calls, that should be sufficient.

In summary: a quick sudo journalctl -a | grep level=error should be sufficient to establish:

a) if there were any errors. b) if errors occurred, which component first detected them.

sameo commented 6 years ago

@gnawux It seems that dockershim is indeed moving out, but afaik CRI is now mandatory for kubelet so the legacy Docker runtime is no longer supported. So I'm fine with supporting dockershim, but not Docker itself.

gnawux commented 6 years ago

@sameo what's the difference between dockershim and docker from kata's view?

One more thing, I think one of the most significant differences between different client could be the networking part. @WeiZhang555 could you have some input on the networking related requirements?

egernst commented 6 years ago

For networking, I would call out support for all CNI and CNM plugins ( - some of these won't be feasible, but the key ones need to be compatible and regularly tested)
For networking, I would call out support for vhost-user (ie: DPDK) and SRIOV support
For networking (longer term?), we should support IPV6.
For advanced I/O - generic device handling via vhost-user sockets (SCSI, block device, for example)
For advanced I/O -- generic device pass through (for devices like and including SRIOV, RDMA, QAT, Graphics, etc, etc).

gnawux commented 6 years ago

@egernst yes, at least, we need CNI at first, and 2-5 looks no problem.

WeiZhang555 commented 6 years ago

@egernst This looks fine.

CNI should be more important for K8S integration. From command line interface point of view, I saw that every "smart" or "auto-detect" implementation for CNI failed in cc and runv, it's hard to support both CNM and CNI with one implementation. So CNI first, as a suggestion, we can use a more direct way to support CNI, just provide something like kata-runtime interface/route command to manipulate network interfaces and routes for a lite VM, and provide a new CNI plugin binary as a example in kata. This had been proven to be efficient in my test.

We can discuss more about networking part later, to find a good way for doing this.

gnawux commented 6 years ago

@WeiZhang555 yes, for kubernetes faced scenarios, we don't need CNM. And all the existing CNM implementations for runV or CC could be summarized as "workaround" so far.

sameo commented 6 years ago

@gnawux

@sameo what's the difference between dockershim and docker from kata's view?

From kata's view it's almost identical as the kata runtime ends up being called as a Docker runtime. But from a development/integration perspective it's different: dockershim is a CRI implementation and by adding hardware virtualization awareness to the spec, we can have dockershim taking it into account and eventually calling Docker with e.g. the right annotations.

sameo commented 6 years ago

@egernst @WeiZhang555 @jodh-intel I think I captured all your input, please let me know if that's not the case yet.

jodh-intel commented 6 years ago

Thanks @sameo - lgtm

egernst commented 6 years ago

Adding to documentation repo -further review should happen there. Please see https://github.com/kata-containers/documentation/pull/17

kata-containers / runtime

Requirements for merged kata runtime #31