Closed gnawux closed 6 years ago
I added @tallclair @WeiZhang555 @sameo in the doc with edit permission, and you can invite others.
@gnawux Maybe it's a good idea to start discussing about those requirements here instead?
Here is my proposal:
[DRAFT v5] [See changelog at the bottom of this page]
The Kata Containers runtime MUST fulfill all requirements below:
The Kata Containers runtime MUST implement the OCI runtime specification and support all the OCI runtime operations.
runc
CLI compatibilityIn theory, being OCI compatible should be enough. In practice the a Kata Containers runtime should comply the latest stable runc
CLI. In particular, it MUST implement with the following runc
commands:
create
delete
exec
kill
list
pause
ps
state
start
version
and the following command line options:
--console-socket
--pid-file
The Kata Containers project MUST provide 2 interfaces for CRI shims to be able to manage hardware virtualization based Kubernetes pods and containers:
runc
compatible command line interface, as decribed in the previous section. This interface is used by e.g. the CRI-O
and cri-containerd
CRI implementationsfrakti
CRI shim is one example of such consumer.The Kata Containers runtime MUST NOT be architecture specific. It should be able to support multiple hardware architectures and provide a pluggable and flexible design for adding support for additional ones.
The Kata Containers runtime MUST NOT be tied to any specific hardware virtualization technology, hypervisor or virtual machine monitor implementation. It should support multiple hypervisors and provide a pluggable and flexible design for adding support for additional ones.
The Kata Containers runtime MUST support nested virtualization environments.
In order for containers to directly consume host hardware resources, the Kata Containers runtime MUST provide containers with secure pass through to generic devices like e.g. GPUs, SRIOV, RDMA, QAT, by leveraging I/O virtualization technologies (IOMMU, interrupt remapping, etc...).
The Kata Containers runtime MUST support accelerated and user space based I/O operations for networking (e.g. DPDK) and storage through vhost-user
sockets.
The Kata Containers runtime MUST support scalable I/O through the SRIOV technology.
One of the compelling aspects of containers is its minimal overhead compared to bare metal applications. A container runtime should strive for keeping that overhead to a minimum in order to provide the expected user experience. As a consequence the Kata Containers runtime implementation should be optimized for:
Each Kata Containers runtime pull request MUST pass a set of container related tests:
The Kata Containers runtime implementation MUST use structured logging in order to namespaced log messages to facilitate debugging.
TBD
state
runc commandcc @jessfraz @sboeuf @jodh-intel @egernst @devimc @amshinde @mcastelino @grahamwhaley
@sameo you could just edit the google doc, it is not easy to collaborate writing a document in an issue.
@gnawux Your suggestion about opening a PR for adding this as a document would be good. Folks can then comment on it and it can be rebased to take input into account.
@sameo feel it is a bit early to put a PR, others may input contents directly on the google doc.
However, if you think it is the time, I could file a PR and merge both our drafts into it.
@gnawux Since this is going to be part of our documentation, using a google doc for this sounds like an avoidable additional step. If this document would for sure not be merged to our documentation then a google doc would make sense, but here a PR makes more sense imho.
@sameo Though I think it is better to make the doc looks ready before convert to a PR, It's not a big problem. I will submit the PR firstly for discussion.
@sameo and all
Do you guys think the requirement docs should be put into docs directory of this repo, or the document repo? If the latter, do we have a dir hierarchy of the repo yet?
I kinda think the last part of that doc is very accurate, can we remove as many shim layers as possible and try to keep it simple :)
Can you clarify the section on CRI support? It sounds like it's not saying that the runtime should be a complete implementation of the CRI, but rather that an implementation should be possible using it? Is OCI not sufficient for that?
I think the document is trending towards describing architecture rather than defining requirements (ie, OCI compliance, provide an API/library suitable for a CRI like frakti, device hotplug, etc.). If/when I get permissions for editing I can help make some edits/add clarity.
@jessfraz yes, this is what we do in production, and we should keep the scenario working in kata.
@egernst :+1: permission granted.
@tallclair Yes, that's what I was trying to say indeed. OCI is enough for CRI implementations like CRI-O or cri-containerd but e.g. Frakti does not rely on the OCI interface but rather on the runV API. The current docker-shim also does not rely on OCI. So for implementing a CRI server, OCI may be sufficient but it's not necessary.
@sameo @tallclair while OCI can implement CRI support, its runtime spec may not be efficient enough for Kata Containers due to lack of proper storage description. When Kata Containers support cri-containerd/CRI-O via OCI (aka. the runc cli interfaces), it relies on 9pfs (which itself is slow and problematic -- we even have to hack 9pfs kernel module to reach POSIX compliance) to map local storage to the guest, and there is no description for remote storage in the spec.
OTOH, the runV library API is more native for VM-based containers and favors CRI from design. In fact it was designed together with CRI. In that sense, runV is compatible with the OCI spec and provides extended APIs to better suit the need of CRI and VM-based containers. With the runv API, frakti is able to use both local block storage and remote storage more efficiently.
So to amend the last paragraph of @sameo 's requirements list, I would suggest we change the requirement of Runtime API from
Runtime API Some CRI implementations (e.g. Frakti) may rely on the runtime API instead of the CLI. The Kata Containers should provide a runtime API definition and a runtime library to support those cases.
To "
While CRI-O and cri-containerd rely on runc compatible CLI, some CRI implementations like frakti rely on the runtime library API instead. The Kata Containers MUST provide a runtime library API favoring CRI design to support them. "
And it should be moved up right after the CRI support
section, instead of being put at the last of the list, which might give the impression that it is a minor requirement that can be dismissed.
@tallclair as we mentioned CRI things, I would like to provide a detailed research of potential integration options of Kata and CRI shims in next sig-node meeting.
Let's put all these problems on the table and see how they can be fixed.
@bergwolf
When Kata Containers support cri-containerd/CRI-O via OCI (aka. the runc cli interfaces), it relies on 9pfs (which itself is slow and problematic -- we even have to hack 9pfs kernel module to reach POSIX compliance) to map local storage to the guest.
That is not entirely correct I'm afraid. With cc-runtime we do hotplug block based devices as local storage when the container overlay filesystem allows for it. So 9pfs is a fallback, not the default.
there is no description for remote storage in the spec.
Yes, there is no such description in the OCI spec. But out of curiosity, is that specified in the current CRI spec? I don't see it but I may very well be missing something.
So to amend the last paragraph of @sameo 's requirements list, I would suggest we change the requirement of Runtime API from
Done, thanks for the input. I did not change the order because I think all those requirements are mandatory. Instead I created an Optional requirements section to explicitly state which ones are mandatory and the runtime API is part of it.
@sameo, thanks for updating the list! I still think we can put all CRI related requirements together instead of scattering them all over the doc. But that can be done in future updating.
That is not entirely correct I'm afraid. With cc-runtime we do hotplug block based devices as local storage when the container overlay filesystem allows for it. So 9pfs is a fallback, not the default.
Well, it's true that you can hotplug block devices if they are specified in theSpec.Linux.Devices
section of the OCI spec. That is true for any pluggable devices. And what you present to the container process is the device itself rather than any file system directory.
The problem is that with an OCI spec, rootfs and volumes are specified in Spec.Root
and Spec.Mount
. Then you do not know if the rootfs and volumes are block based devices or not, not to mention which device you need to hotplug to the guest.
But out of curiosity, is that specified in the current CRI spec? I don't see it but I may very well be missing something.
That is not defined by the CRI spec but can be supported via flexvolume. And there is ongoing change to support it in the CSI spec.
@bergwolf
I still think we can put all CRI related requirements together instead of scattering them all over the doc.
Yes, that makes sense. I've merged the runtime api into the CRI support section. Please let me know how it looks now.
Well, it's true that you can hotplug block devices if they are specified in theSpec.Linux.Devices section of the OCI spec. That is true for any pluggable devices. And what you present to the container process is the device itself rather than any file system directory.
The problem is that with an OCI spec, rootfs and volumes are specified in Spec.Root and Spec.Mount. Then you do not know if the rootfs and volumes are block based devices or not, not to mention which device you need to hotplug to the guest.
Sorry for the confusion, I should not have mentioned the hotplugging side of things here. We do vmm hotplug, but only for efficiency reasons. But cc-runtime
and virtcontainers
dynamically detect if a rootfs is block based device or not and present it as a disk inside the VM or as a 9pfs mount point respectively. As you pointed out, the performance and posix compatibility are significantly different between the 2. So my point was that you can do proper block based IO with or without the current OCI spec help.
@egernst @tallclair I've updated the CRI support
section. Hopefully it reads better now.
I read this whole thread and I think there are some requirements none of you mentioned.
first big part is "docker API" support, though "docker API" is also going through "runc compatible CLI " API, there're still some tricks to make "kata-runtime" work from docker command line. One example is "how to compose a POD", CRI is using specific labels to classify POD or Container, but docker isn't. And "K8S->docker->kata" is still an important scenario though we already has CRI-O as a replacement.
K8S Ecosystem support:
2.1 CNI network support
2.2 Monitoring: how to use cAdvisor
to monitor kata container resource usage.
2.3 Logs: K8S monitor container logs. As I know, it's using volume
as log transfter channel, if so kata-runtime should support this natively, but I can't be sure
These are also important for working with K8S, and I know both cc and runv have experience on these. It's better to add some illustrations for them.
@WeiZhang555
And "K8S->docker->kata" is still an important scenario
Do you mean we want to support the dockershim CRI implementation, i.e. "K8S->dockershim->kata" ?
Looks good. I'm a little unclear about the OCI compat / runc
compat sections, specifically why those particular runc
commands are listed but not some others. For example, delete
isn't mentioned, nor is state
. There other runc
commands, but they do not form part of the OCI runtime spec.
Another point - runc has particular options (like --console-socket
and --pid-file
) which are not part of the OCI spec, so should those also be listed?
Also, how about adding something about logging / problem determination / debugging? Specifically the ability to determine easily that a problem originates in the runtime rather than one of the other components.
We might benefit from some input from @chavafg for thoughts on Testing requirements.
@sameo dockershim
could be enough as it will be formal implementation in K8S.
@jodh-intel
Also, how about adding something about logging / problem determination / debugging? Specifically the ability to determine easily that a problem originates in the runtime rather than one of the other components.
We might benefit from some input from @chavafg for thoughts on Testing requirements.
Good point! This is also important part I really care about, and can be strong guarantee for quality and give more confidence to kata container developers and users.
@jodh-intel delete
was already there, and state
was just an unintentional omission. I added it now.
I also added the 2 options you mentioned.
Could you elaborate a little more on the logging/debugging ?
@WeiZhang555 dockershim is moving out from kubelet, but I think a docker support is still a valid requirement.
Thanks @sameo.
Regarding logging, ideally the runtime would use structured logging (as provided by https://godoc.org/github.com/sirupsen/logrus, for example), such that one of the log fields would specify "runtime" or "kata-runtime" to allow consumers of the system log to determine that an error was being generated by the runtime (as opposed to the shim/proxy/agent/hypervisor).
Also, every time an OCI command is called, it would be extremely useful if the runtime could also log its version and the commit it was built with. That shouldn't necessarily be required for every log call but since an OCI command delineates a block of log calls, that should be sufficient.
In summary: a quick sudo journalctl -a | grep level=error
should be sufficient to establish:
a) if there were any errors. b) if errors occurred, which component first detected them.
@gnawux It seems that dockershim is indeed moving out, but afaik CRI is now mandatory for kubelet so the legacy Docker runtime is no longer supported. So I'm fine with supporting dockershim, but not Docker itself.
@sameo what's the difference between dockershim and docker from kata's view?
One more thing, I think one of the most significant differences between different client could be the networking part. @WeiZhang555 could you have some input on the networking related requirements?
@egernst yes, at least, we need CNI at first, and 2-5 looks no problem.
@egernst This looks fine.
CNI should be more important for K8S integration.
From command line interface point of view, I saw that every "smart" or "auto-detect" implementation for CNI failed in cc and runv, it's hard to support both CNM and CNI with one implementation.
So CNI first, as a suggestion, we can use a more direct way to support CNI, just provide something like kata-runtime interface/route
command to manipulate network interfaces and routes for a lite VM, and provide a new CNI plugin binary as a example in kata. This had been proven to be efficient in my test.
We can discuss more about networking part later, to find a good way for doing this.
@WeiZhang555 yes, for kubernetes faced scenarios, we don't need CNM. And all the existing CNM implementations for runV or CC could be summarized as "workaround" so far.
@gnawux
@sameo what's the difference between dockershim and docker from kata's view?
From kata's view it's almost identical as the kata runtime ends up being called as a Docker runtime. But from a development/integration perspective it's different: dockershim is a CRI implementation and by adding hardware virtualization awareness to the spec, we can have dockershim taking it into account and eventually calling Docker with e.g. the right annotations.
@egernst @WeiZhang555 @jodh-intel I think I captured all your input, please let me know if that's not the case yet.
Thanks @sameo - lgtm
Adding to documentation repo -further review should happen there. Please see https://github.com/kata-containers/documentation/pull/17
Here is a draft of requirements (working in progress).
https://docs.google.com/document/d/109pxj-90Ly58ma8CoeRKcMoPWBD0G911E53MeK2zhhA/edit?usp=sharing
Currently, it listed some of the working scenarios for the runtime, and more sections should be added.
Once we have agreement on the requirement, the doc should be put into repo as a markdown doc.