opencontainers / runtime-spec

OCI Runtime Specification
http://www.opencontainers.org
Apache License 2.0
3.17k stars 539 forks source link

vm based args in spec?? #964

Open crosbymichael opened 6 years ago

crosbymichael commented 6 years ago

There was an existing comment in the VM PR located here that was not resolved before merge:

https://github.com/opencontainers/runtime-spec/pull/949#discussion_r178318374

The overall issues is why do vm args need to be specified in the spec when the hypervisor is the one being invoked to read/process the spec.

vbatts commented 6 years ago

so these args are for that container runtime instance. If the args are changed, then it's a new/different runtime instance, right? It seems for audit and introspection you'd want to see the args used to start that VM.

crosbymichael commented 6 years ago

What would be the difference between this and args used to exec runc then? Its weird.

crosbymichael commented 6 years ago

@sameo Could you take a look at this?

vbatts commented 6 years ago

Args to runc would be equivalent to args to kata-runtime. But these args the args to whatever backing hypervisor. Maybe not a hard requirement, but good for reproducibility and audit.

On Thu, May 24, 2018, 08:38 Michael Crosby notifications@github.com wrote:

What would be the difference between this and args used to exec runc then?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/opencontainers/runtime-spec/issues/964#issuecomment-391760640, or mute the thread https://github.com/notifications/unsubscribe-auth/AAEF6Z9CqXKElWLwy7I7Uda-AMyIRpTUks5t1tPvgaJpZM4THjlC .

crosbymichael commented 6 years ago

So these VM runtimes wrap another thing?

sameo commented 6 years ago

@crosbymichael

the hypervisor is the one being invoked to read/process the spec.

The hypervisor (KVM, Xen, ESX, etc) does not read and process the spec. The spec is processed by the runtime itself, exactly like runc. The hypervisor creates and manages the VM that's going to host the container workload/process. You could think about the hypervisor as a different isolation and resource sharing API than respectively namespaces and cgroups. So intead of calling into a set of host kernel APIs, you call into an hypervisor API.

OCI VM runtimes carry a set of default hypervisor arguments (static and dynamic) for each hypervisor they support. They're different from the set of arguments you'd pass to runc as they only specify how the hypervisor should create the VM that the runtime is going to control in order to manage the container workload inside it. Here the argument are optional because I don't think you'd want to specify them outside of tracing/debugging/auditing use cases.

Does that clarify things a little?

egernst commented 6 years ago

How useful are these args given that in many cases most of the parameters are dynamic, added via QMP (in the qemu case)?

So long as this is optional, it seems reasonable to me.

vbatts commented 6 years ago

@egernst @sameo so if these hypervisors have known flags or are flags that relate to values existing in the config (i.e. memory, cpuset shares, etc) then they would be known to the vm runtime, right? Could this 'args' perhaps be more abstracted? like into labels or annotations?

vbatts commented 6 years ago

Also, this resolution is needed to prep for a release

egernst commented 6 years ago

@sameo - For me it'd be helpful to have a more specific example use-case for this field. I'll try to add this here, PTAL.

@vbatts @crosbymichael -- In the kata case, there are many items which we end up configuring on a per-node basis through a configuration.toml. Example of this is at [1].

Some potentially relevant items which could be used, and thus configured on a per container basis optionally: -machine type -machine accelerators -iothreads -memory-prealloc -huges pages, etc.

These could be configured on a per workload basis.

[1] - https://github.com/kata-containers/runtime/blob/master/cli/config/configuration.toml.in

egernst commented 6 years ago

@bergwolf PTAL.

bergwolf commented 6 years ago

@egernst If we think of highly customized guest configs for different workload/need on a per pod sandbox basis, I'm afraid there are just too many of them for each hypervisor type. E.g., the list you gave are just part of the configurations for QEMU. They do not make sense for some other hypervisors which would have a different set of configurations.

IWO, I tend to agree with @vbatts that we put them in labels or annotations. In kata, we can define and check for those labels/annotations, and override the default per node configuration with the provided ones.

egernst commented 6 years ago

@bergwolf I agree.

sameo commented 6 years ago

I'm sure we could put some effort into abstracting some common arguments across most hypervisors, but we would need to handle a labels based overriding mechanism anyway. This is a very powerful mechanism for customizing your virtualizer per pod/workload. So bottom line for me: I agree with @vbatts and @bergwolf here.

jterry75 commented 4 years ago

We (Microsoft/hcsshim) have been pretty exclusively been using annotations to override any default behavior. But we do try and honor the spec itself if it also has fields. So for example a hypervisor container that has a Memory.Limit would be used as the hypervisor Memory.Limit as well. As you can realize from this approach however it does change the containers actual memory limit and affect its ability due to the VM itself using more memory than a true process environment. By default the Kube concept of a "runtime overhead" can help with this one but there are other examples that don't fit there. I am ok with making a common set of typed fields for any hypervisor to implement but I don't think we can ever do away with the use of annotations for customization's between all the different implementations of hypervisors.