discuss: we should not expose runtime-created container-specified resources back to upper layer

(partial copied from https://github.com/opencontainers/runc/pull/1018#issuecomment-255156763)

runtime-created container-specified example: the pid of the container process, ptymaster ...etc

upper layer should only control/fetch-info on the container via the runtime API(commandline opencontainers/runtime-spec#513 or state.json)

detachable stdio is a strong required feature, we can/should implement it in the runtime rather than in upper layer containerd-shim or ocid/conmon.

we can add runc connect-stdio or something else for connecting/re-connecting the stdio stream. we can add runc set-win-size or something else for changing the winsize when tty=true.

As I said in #1018, if you can explain why it's a bad design that would be great. Namely, can you describe a use-case or other issue that is not usable with the current design?

Ah, okay I've read the discussion that happened in #ocid. Let me respond to that. The reason why we need to have --console-socket for runC is because we need to have the ability to create the pty inside the container without keeping runC around as a daemon (which is against the design principles of runC).

Now, you say that this won't work for runV because of ptymaster being a process inside the VM which is managed by an external process. However, from what I can see it should be possible to translate everything using --console-socket -- nothing says that --console-socket has to be the actual pty master (it just has to act like one). So you could do something like this:

runV creates a container and starts it.
ptymaster sets up a PTY and communicates with some runV server on the host that allows you to manage the PTY. It then creates a fake PTY and sends it to runV.
runV returns the fake PTY to --console-socket consumer.

The trick is that by creating a fake PTY you can essentially relay everything that the consumer of --console-socket is doing to the ptymaster without breaking how runC's console handling has to work. There might be nicer ways of solving it without creating a second PTY, but I haven't looked into it (you'd need to have an FD that accepts the TTY ioctls -- which would require a kernel module and a bit too much magic for me).

Would that be acceptable? Your server that creates a fake PTY could receive SIGCHLD (you'd need to create a subprocess that has TIOCSCTTY for the fake PTY). It's not pretty and if you can come up with a better idea (that still solves the problem while maintaining runC's design principles) I'd be really happy to hear it!

Here's the IRC discussion:

<feisky>    hello, everyone
<nalind>    hello!
<mrunalp>    feisky, hey
<harry>    hi there
<feisky>    I think hooks works better for runv
<mrunalp>    feisky, I was asking if hooks work better for networking in hyper?
<mrunalp>    feisky, great!
<mrunalp>    feisky, I will check with our networking guys to see how we can get CNI working that way. I think it should be fine.
<gnawux>    Did we ever talk about the storage today?
<mrunalp>    gnawux, Yes, if you have any questions, nalind can help answer.
<dwalsh>    Not yet. nalind is working the storage on our end.
<dwalsh>    In github.com/containers/storage
<gnawux>    OCI does not specify storage, and it assumes that we have a directory for the rootfs
<mrunalp>    gnawux, Yep, and containers/storage will do that for ocid
<feisky>    Great. What's the plan of integrating the storage driver?
<Guest74450>    for now, looks like github.com/containers/storage is a fork of docker part?
<nalind>    it started there, yes
<nalind>    most of the rest has been stripped out
<gnawux>    But actually, block device works better for hypervisor. And it does not conflict with container. Could we have any extension on containers/storage part to support pass block device to runtime?
<Guest74450>    I'm harry, btw ...
<dwalsh>    gnawux, Yes we want to support different container storage then just COW that is currently in the graphdriver of docker.
<dwalsh>    We would look at Block Drivers as well as ReadOnly images on disk.
<dwalsh>    I would like to see support for remote storage also.
<nalind>    does the spec support notions other than 'mounted filesystem path'?
<gnawux>    The hyperd had the graph engine for block device and ceph/rbd, and we could contribute it if PR is welcome.
<mrunalp>    nalind, Nope, there are mounts and there is path to the rootfs.
<nalind>    gnawux: at the graphdriver level, it should be the same, so that could work
<nalind>    mrunalp: okay, so if i work back from there to a device node, is there a safe way to pass that block device to a VM if both of them want to use it?
<gnawux>    we pass device the device path or rbd description to hypervisor w/o mount them.
<mrunalp>    nalind, Not sure how hyper is doing it but annotations are one way.
<mrunalp>    gnawux, How do you pass that information to the hyper runtime?
<gnawux>    mrunalp, currently we use our internal data structure, but I think we may have an extension of oci spec
<feisky>    CRI doesn't expose rbd metadata to runtimes, so it's hard for ocid to support rbd directly. But we could support rbd graph driver
<mrunalp>    gnawux, by extension you mean something that you added in your code?
<feisky>    mrunalp, hypernetes extends cinder network plugin
<gnawux>    mrunalp, I mean some extra field in the spec
<nalind>    so instead of a filesystem path, enough info to pass to the VM? yeah, that could work. ocid would have to "know" whether it wanted to mount on the host or leave it to the runtime, right?
<mrunalp>    feisky, gnawux okay
<dwalsh>    My question is does k8s pass this information in so that OCID could do the right thing. Or does the spec need to change?
<gnawux>    when we specify the root path, add an annotation to declare it is a block device or rbd instead of a vfs path
<nalind>    dwalsh: if i'm remembering the runtime api right, k8s doesn't care about where or how things are stored, only that somebody can bring up a container based on a specified image, so it's more about ocid and its storage capabilities and how it passes informatoin to the runtime
<feisky>    dwalsh, k8s CRI doesn't pass this info now
<gnawux>    dwalsh: If we store the images in a graph engine based on block, then we could pass block device to vm based runtime. I think it doesn't need k8s pass any information
<mrunalp>    gnawux, okay, so it will be a matter of configuration. If ocid is started with vm backend, then it stores and uses images from block device?
<dwalsh>    Ok, great.
<gnawux>    mrunalp, I think the goal of ocid is support multiple runtime. Then it should be, if we store images in block device, we pass block device to vm, and mount+pass fs to container
<harryz>    Referring to the collaboration, we will send some PRs first on both ocid and runv sides to fix some inconsistency issues
<mrunalp>    gnawux, yep, should work
<mrunalp>    harryz, sounds good
<dwalsh>    gnawux, Yes the goal of OCID is to support any OCI Compliant Runtime, support the OCI Bundle Image Format, and support the Kubernetes CRI.
<harryz>    the blocker for now is runc binary and runv binary has differences, we are not sure which one to follow
<dwalsh>    Specification differences?
<harryz>    not spec, just api differences
<mrunalp>    harryz, Are you following runtime spec #513?
<mrunalp>    harryz, It is working on standardizing the OCI runtime CLI API
<feisky>    mrunalp is there any doc/issue tracking this standard
<mrunalp>    https://github.com/opencontainers/runtime-spec/pull/513
<mrunalp>    feisky, ^
<gnawux>    We are keep update for the spec, laijs has more detail. The most significant difference should be in the daemon side, there are some code dedicate for namespace/process
<gnawux>    and it's really good to have a command line spec
<harryz>    ok https://github.com/opencontainers/runtime-spec/pull/513/files is a good start
<mrunalp>    gnawux, harryz Yeah, it will make it easier to swap runtimes.
<laijs>    runs command did change fast, how long will 513 be merged?
<gnawux>    mrunalp, can't agree more
<mrunalp>    laijs, Should be in 2-3 weeks.
<laijs>    good, thanks
<feisky>    mrunalp, has ocid been tested with kubelet CRI?
<mrunalp>    feisky, That's what I am working on now.
<feisky>    cool
<laijs>    which process will own the paymaster? ocid or runc?
<laijs>    ptymaster
<mrunalp>    laijs, conmon
<feisky>    mrunalp, would you like to setup a slack channel so we don't need to change between those chat platforms?
<mrunalp>    feisky, yeah, sure
<laijs>    common is launched by ocid in the host. but in runv, ptymaster owns the ptymaster
<laijs>    hyperstart owns the ptymaster
<mrunalp>    laijs, okay and where does that run? inside the vm or on the host?
<feisky>    mrunalp, thanks. I think that will also benefit people from sig-node
<gnawux>    laijs, So it conmon may need to support proxy tty stream to hyperstart, right?
<laijs>    although we can add a proxy layer, but we hope the runtime owns the ptymaster
<gnawux>    mrunalp, hyperstart run inside the vm as init process or an independent daemon
<mrunalp>    gnawux, okay
<mrunalp>    laijs, The reason we use conmon is because we want to be able to restart ocid without affecting running containers
<gnawux>    mrunalp, is there one conmon process per sandbox?
<mrunalp>    gnawux, No, there is one per container
<gnawux>    oh. even more than my expect :)
<mrunalp>    gnawux, Yeah, that makes it simplest for us right now.
<laijs>    sorry, my questions were wrong. which process owns the right to setwinsize of the container. we hope the process is controlled by runtime.
<mrunalp>    laijs, Are you talking about SIGWINCH handling?
<mrunalp>    laijs, conmon will be relaying it to the pty slave that is duped to container process's stdio
<laijs>    the process do the ioctl TIOCSWINSZ on the pty
<mrunalp>    laijs, It will be conmon
<feisky>    So do we have more questions?
<laijs>    so we need proxy for the setwinsize in this architecture. the proxy owns a pty slave and receive the changes from conmon and send to the hyperstart via hyperstart api
<laijs>    in hyper container, every process in the host expect the qemu can be restarted can be unexpected destroyed.
<mrunalp>    laijs, okay
<laijs>    qemu+hyperstart acts conmon
<mrunalp>    laijs, could you open an issue detailing this and what will have to be added to conmon?
<laijs>    is it possible to skip conmon in this runv mod?
<mrunalp>    laijs, I am open to skipping it but if possible want to avoid special casing runv vs runc.
<gnawux>    I think it's better to discuss in issue/pr
<laijs>    ok
<harryz>    +1 for now conmon is a little bit runc specific, let's open a issue instead
<gnawux>    mrunalp, I think it's a good start today. And thank you, dwalsh nalind
<mrunalp>    gnawux, laijs, feisky, harryz: Thanks! Was good talking to you all.
<feisky>    mrunalp, nalind, dwalsh, thanks.
<nalind>    feisky, gnawux, harryz, laijs, good to speak with you as well
<mrunalp>    harryz, laijs: FWIW, conmon itself could be replaced by another process. It is configurable so you might want to look into that.
<dwalsh>    Good looking forward to working together.
<feisky>    mrunalp, btw, CRI is still not feature complete now, ping me at slack if you have any problem
<mrunalp>    feisky, Yep, I know it is still changing a lot. I will ping you.
<feisky>    👍
<mrunalp>    bbiab
<feisky>    bye

@cyphar Thank you for giving us a workaround. and I will answer your previous question:

As I said in #1018, if you can explain why it's a bad design that would be great. Namely, can you describe a use-case or other issue that is not usable with the current design?

the reason is that the runtime-specs SHOULD define EVERY way which is allowed to control/access to the container, including all the ways which can be created by all the APIs in the runtime-spec.

1) if the runtime exposes runtime-created container-specified resources back to upper layer tools, it created a tunnel(wound?) with unspecific-spec APIs.

2) runtime can not know how the upper layer tools use the resources, it is uncontrollable.

3) I don't mind too much when the oci-maintainers add more and more APIs and specs to runtime-specs if oci-maintainers add spec for the exposed resources. example: --pid-file <PATH> The runtime MUST write the container PID to this path. the host SHOULD only find the container id from the pid, [list other allowed operations/APIs here], all the other operations are NOT allowed. the host SHOULD NOT find the host-process from the pid, SHOULD NOT kill it, SHOULD NOT fetch any other info from it. if the runtime-specs has the specs for the pid of the container process and ptymaster, I will stop on this issue. otherwise it deserves to be discussed among the maintainers until solved. isn't it?

4) adds a daemon to runc VS use more and more oci-unspecific-specs (but may/may not docker-container-specific) APIs on containers. which one hurts more?

5) as I said, detachable stdio is a strong required feature, we can/should implement it in the runtime rather than in upper layer as containerd-shim or ocid/conmon. why we force all the upper layer tools to implement it?

6) reason from runv as I said in the #ocid. but containerd-shim or ocid/conmon is actually a workaround. fake-pty-proxy is another one. no one likes too much workaround.

add @crosbymichael @mrunalp @wking

add @philips @resouer

On Fri, Oct 21, 2016 at 01:30:20AM -0700, Aleksa Sarai wrote:

The trick is that by creating a fake PTY you can essentially relay everything that the consumer of --console-socket is doing to the ptymaster without breaking how runC's console handling has to work. There might be nicer ways of solving it without creating a second PTY…

We could solve this more cleanly by allowing an attached create which:

Errors out if --console-socket is set.
When process.terminal is true, relays the pseudoterminal master between the runtime's stdin/stdout. This relaying allows you to pipe bytes in and out, but does not expose the hard-to-tunnel-into-VMs ioctls.

It would be up to the caller to decide whether to use the attached or detached create, so supporting both is more work for runtimes, but it would allow callers to access to each runtime's preferred approach.

On Fri, Oct 21, 2016 at 08:50:23AM -0700, Lai Jiangshan wrote:

2) runtime can not know how the upper layer tools use the resources, it is uncontrollable.

There's nothing inherently wrong with that. The runtime is a tool for managing containers, not the only gateway to those containers. The create/start split from opencontainers/runtime-spec#384 is becase callers can reach in and tweak containers on their own after the runtime's create has exited and before the runtime's start is called. As long as the OCI specifies runtime interfaces (which runtimes are free to extend), caller can use runtimes and other tools to sanely manage containers.

… the host SHOULD only find the container id from the pid…

I'm having trouble parsing all of your point (3), but this is not true. The container ID and PID are two orthogonal things (the container ID is set by the create caller, the PID is allocated by the kernel).

… the host SHOULD NOT find the host-process from the pid, SHOULD NOT kill it, SHOULD NOT fetch any other info from …

I think this is overly restrictive. What do you gain through this restriction? Of course, container PIDs may not be a portable approach (opencontainers/runtime-spec#459), but I think addressing that issue is a better approach than faking container PIDs and then loading their usage with a lot of restrictions.

To answer your points one-by-one:

I don't agree. First of all, a PTY is not a "resource", it's an interface for communicating with a container process. Most importantly, it's actually a fairly complicated resource (with lots of different ioctl knobs). I don't think it's a "hole" to tell a consumer of a runtime "you get a bog-standard PTY which you can control using the standard controls that everyone knows how to use". I do think it's a bit silly to redefine the entirety of tty_ioctl(4) as a CLI interface just so that everything lives inside a single spec file. Hyperlinks exist for a reason, after all.
As I mentioned above, you can MITM the console management. But I want to bring up the more important point -- why does a runtime care about what the manager is read/writing to the console? The whole point of having a higher level manager is that the runtime can just deal with its job -- starting and running containers -- and the higher levels can actually manage the more complicated things (such as figuring out where the IO is coming from and how to interpret requests from consumers).
I don't really have an opinion about --pid-file. However, most process managers have such a flag so you'd need to provide a pretty compelling reason why it's critical to remove it. I don't think you've explained your argument well enough for me to be convinced.
You're acting like "how consoles work on Linux" is unspecified. It's not. The kernel code is free software, the kernel interfaces have documentation (and are guaranteed to never change) and ultimately there is enough history in the usage of those interfaces that any attempt by us to "simplify" it or even document it won't be helpful. It becomes a case of everyone wanting their special behaviour implemented in the spec -- Linux already implements it and there's literally no reason in us reimplementing over 25 years of kernel development.
I don't understand. Detachable stdio is implementable with --console-socket. In fact the reason why we went with --console-socket is because you need it in order to implement daemon-less detachable stdio. Otherwise you need to have a long-running daemon that keeps hold of the TTY. Just to be clear (maybe you misunderstood what --console-socket actually does): All that --console-socket does is send a file descriptor from the runtime to the consumer of the runtime -- where the file descriptor is a file descriptor that allows you to read/write to the TTY of the container (in the runC case that file descriptor is a bone-fide PTY). I don't understand which part of that makes detachable stdio impossible to implement -- in fact it makes it possible to implement in daemonless scenarios.
I don't agree that conmon / containerd-shim are workarounds. They're a separation of concern. Yes, you could implement all of conmon into runC -- in fact that's basically what runc run is! However, implementing everything in a single binary is a bad idea because it means that you can't build systems on top of runC -- composability is a virtue and we shouldn't throw it out of the window unless absolutely necessary. The fact that we can even think about implementing OCID as a kubelet runtime should be seen as a positive.

All of that being said, if you have a concrete proposal that can solve both the daemonless concerns of runC (and potentially other runtimes, we are not the only players in town) as well as your concerns about runV I would be very happy to hear it. I'm open to being convinced, but I really would like some more explanation for your assertions (X is bad, Y is bad, which is less bad -- A or B).

To be clear, I'm not trying to be a grumpy maintainer :wink:. I'm just trying to explain why we went with the solutions we went with and am hopeful we can find a happy medium that matches both of our projects' requirements. However, it should be noted that --console-socket came from a lot of thinking on our part and trying to figure out a solution that provided the same feature set as the previous --console API while also fixing some huge issues we had with our previous "set up the console outside the container" implementation. So any solution that we try to integrate into both of our projects has two strict requirements from our side (at least from what I can see, @crosbymichael might have a different opinion):

It must not require having a long-running daemon. One of the biggest benefits of runC is that it doesn't require a daemon, and we've spent quite a lot of time making sure that this works.
It has to provide adequate control by upper layers that are built on top of the runtime. Adequate means that you have to be able to use most of tty_ioctl(4) with the TTY, as well as have the ability to pass around control of the TTY.

For us, passing around the PTY master fd solves both of these problems -- but if you can come up with another solution that can handle this better for runV that would be great!

runc will never be a long running daemon that manages multiple containers. what makes the runc solution elegant is its simplicity and you lose that when you start talking about daemons and making it do too much.

Today i can run runc in my shell, in a systemd unit, or in a systemd unit with it detached by using the --pid-file and telling systemd where that pid file is. runc exposes simple primitives that allow the upper layers flexibility to manage the containers how they see fit while runc provides a clean separation between performing actions on the container and reporting back information when requested and how they are aggregated and managed.

This project is about the spec and the implementation of containers so we are going to lean on the feature set that we have available to build the best container spec and runtime that we can, if you are not using container technology then the burden is on you to conform to the interface. That means that you have to work around some of the container specific features because that is the entire goal of the projects.

opencontainers / runc

discuss: we should not expose runtime-created container-specified resources back to upper layer #1137