Separate container sandbox lifecycle from that of the processes inside it

vishh commented 8 years ago

As of now, the runtime lifecycle states that there is a Start action which results in creation of a container sandbox and spawning the first process inside the container. It would be better to separate lifecycle of the container sandbox from that of container processes. To be more precise, I'm proposing introducing the following oci actions:

Create - This action will create a container sandbox. On Linux, this would be all cgroups and namespaces, excepting pid namespace (assuming all cgroups and namespaces were requested as part of the Spec).
Start - This action will spawn a new process in the existing container sandbox. On Linux, a new pid namespace will be created for the first process that is being spawned. Subsequent processes will run the same pid namespace.
Cleanup - This action will kill all processes in a container sandbox. In the case of Linux, this will result in loss of pid namespace.
Delete - This action will delete the container sandbox.

This split will obviate the need for supporting hooks. Users are free to run hooks outside of the runtime as they please.

The existing Spec needs to be split into a Sandbox configuration and a Process configuration. The process configuration can be reused for the Exec use-case.

FYI: This was discussed during the OCI meeting on 1/12/16 and everyone present agreed to this proposal.

We need to prototype this separation in one or more Operating Systems before requiring it in the Spec.

Input from runtime authors on non-linux platforms will be helpful here.

ashahab-altiscale commented 8 years ago

I'd suggest the prototype should test with usernamespaces, especially in light of bugs like this one: https://github.com/opencontainers/runc/issues/101 Related patch: https://github.com/opencontainers/runc/pull/105/files

The init process of the container needs to be able to join a pre-created user namespace. If this does not work, it would not possible to create any namespaces, as user namespaces require all other namespaces to be children of the user namespace.

vishh commented 8 years ago

@ashahab-altiscale: As long as user namespaces don't require a process to pin it, I don't see any issue here.

wking commented 8 years ago

On Tue, Jan 12, 2016 at 03:13:37PM -0800, Vish Kannan wrote:

As of now, the runtime lifecycle states that there is a Start action which results in creation of a container sandbox and spawning the first process inside the container. It would be better to separate lifecycle of the container sandbox from that of container processes.

This seems like a high-level change. Should discussion happen on the list 1? There are earlier notes on this split approach in 2, as well as some previous discussion along these lines in #226. If we decide to have this conversation here (instead of the list or earlier issue), I can write up a summary of past points.

This split will obviate the need for supporting hooks. Users are free to run hooks outside of the runtime as they please.

Yeah, if we have a separate create and exec, we can drop hooks and close #265 3.

The existing Spec needs to be split into a Sandbox configuration and a Process configuration.

Not necessarily. ccon-ced (discussed in #226) wraps a runtime that uses a single spec to do this. To pull that off, you need a spec that makes it easy to turn on only the functionality you want (e.g. “create namespaces” vs. “join, [create a new PID namespace,] and exec a process”).

 Subject: Documenting container lifecycles
 Date: Mon, 14 Sep 2015 20:47:17 -0700
 Message-ID: <20150915034717.GO18018@odin.tremily.us>

jonboulle commented 8 years ago

s/Support/Separate in issue title

It would be better to separate lifecycle of the container sandbox from that of container processes.

(emphasis mine) more justification/use cases here would be useful, at the very least for posterity

This split will obviate the need for supporting hooks.

To be clear, are you suggesting removing hooks entirely from the spec?

Subsequent processes will run the same pid namespace.

It makes me pretty uneasy to have the start operation be nondeterministic - e.g. if I run two start commands simultaneously it's a race which one's process becomes PID1 (which has non-trivial implications).

wking commented 8 years ago

On Tue, Jan 12, 2016 at 03:56:28PM -0800, Jonathan Boulle wrote:

Subsequent processes will run the same pid namespace.

It makes me pretty uneasy to have the start operation be nondeterministic - e.g. if I run two start commands simultaneously it's a race which one's process becomes PID1 (which has non-trivial implications).

Yeah, -1 to having PID namespace be implicit. If you run starts (parallel or not) that both create PID namespaces, then they both get to be PID 1 in their own PID namespace. If you run a start (parallel or not) that does not create a PID namespace, it does not get to be PID 1. So I don't see a problem with racing if we have explicit PID-namespace creation.

On naming, see ‘exec’ vs. ‘start’ here in opencontainers/runc#210.

vbatts commented 8 years ago

On Tue, Jan 12, 2016 at 6:52 PM W. Trevor King notifications@github.com wrote:

On Tue, Jan 12, 2016 at 03:13:37PM -0800, Vish Kannan wrote:

As of now, the runtime lifecycle states that there is a Start action which results in creation of a container sandbox and spawning the first process inside the container. It would be better to separate lifecycle of the container sandbox from that of container processes.

This seems like a high-level change. Should discussion happen on the list [1]?

This is because we're sitting in the face-2-face and working through this presently.

wking commented 8 years ago

On Tue, Jan 12, 2016 at 04:51:36PM -0800, Vincent Batts wrote:

On Tue, Jan 12, 2016 at 6:52 PM W. Trevor King wrote:

On Tue, Jan 12, 2016 at 03:13:37PM -0800, Vish Kannan wrote:

As of now, the runtime lifecycle states that there is a Start action which results in creation of a container sandbox and spawning the first process inside the container. It would be better to separate lifecycle of the container sandbox from that of container processes.

This seems like a high-level change. Should discussion happen on the list [1]?

This is because we're sitting in the face-2-face and working through this presently.

Yeah, that's fine. But when dumping from the face-to-face into the archives, I think you should dump into a list post, not into a new issue (like this one).

wking commented 8 years ago

On Tue, Jan 12, 2016 at 03:29:28PM -0800, Vish Kannan wrote:

@ashahab-altiscale: As long as user namespaces don't require a process to pin it, I don't see any issue here.

The issue is the order in which you join or create namespaces, which impacts permissions. For example, an unprivileged user can join a user namespace they created earlier, and then create a new PID namespace underneath. But they can't create a new PID namespace before joining a user namespace they created (or creating a new user namespace). There are a number of comments to this effect in opencontainers/runc#105, and also some notes in the “Interaction of user namespaces and other types of namespaces” section of user_namespaces(7). For a command line example:

Make a new user namespace: — $ unshare --user --map-root-user sh sh-4.3# echo $$ 21171

In a different terminal, trying to create a new PID namespace and join the user namespace fails: — $ unshare --pid nsenter --user=/proc/21171/ns/user --preserve-credentials sh unshare: unshare failed: Operation not permitted

But trying to join the user namespace and then create a new PID namespace succeeds: — $ nsenter --user=/proc/21171/ns/user --preserve-credentials unshare --pid sh sh-4.3#

I think @ashahab-altiscale is just suggesting we avoid that issue by proper namespace join/create ordering when implementing create/exec tooling.

ashahab-altiscale commented 8 years ago

@wking Exactly. The order of namespace creation is important.

duglin commented 8 years ago

s/Cleanup/Stop/ but aside from that I like the idea of the split.

duglin commented 8 years ago

re: multiple "starts": based on what I've heard I don't think multiple concurrent "starts" would be allowed. I think I'm hearing that people still want a single main process in the container, which means once you've started it your only choice is to use "exec" to get a 2nd process. Once the main proc ended then you can issue "start" again tho.

wking commented 8 years ago

On Fri, Jan 22, 2016 at 07:11:42AM -0800, Doug Davis wrote:

re: multiple "starts": based on what I've heard I don't think multiple concurrent "starts" would be allowed. I think I'm hearing that people still want a single main process in the container, which means once you've started it your only choice is to use "exec" to get a 2nd process. Once the main proc ended then you can issue "start" again tho.

Is the only difference between ‘start’ and ‘exec’ whether you're creating a new PID namespace (and optionally mounting a new /proc?) or inheriting/joining an existing one? It sounds like it might be, and in that case, I'd rather just use our existing namespace create/join schema and have a single ‘exec’ command. Trying to bake the distinction into two separate commands seems like more trouble than the usability gain is worth.

Or are we just splitting commands for portability (e.g. to allow hypervisor and microkernel systems where it's difficult/impossible for the host to execute a new process in an existing PID-namespace/container)?

duglin commented 8 years ago

Changing what 'exec' will do based on whether its the first call is a bit non-deterministic. Imagine I do "runc exec" thinking it'll generate a secondary proc and as such I have different management logic for it. If it turns out it ends up being the main proc (because I didn't know the main proc just died) I might end up with very different (incorrect) results than what I expected.

wking commented 8 years ago

On Fri, Jan 22, 2016 at 08:21:45AM -0800, Doug Davis wrote:

Changing what 'exec' will do based on whether its the first call is a bit non-deterministic.

Yeah, I think we're all on board with that being a bad idea. I'm suggesting changing what ‘exec’ will do based on explicit JSON configuration (e.g. if linux.namespaces.pid.path is true or false/unset). That's how we handle create-or-join currently with the “just ‘start’” approach.

vishh commented 8 years ago

+1 on splitting the init process from secondary process.

On Fri, Jan 22, 2016 at 9:52 AM W. Trevor King notifications@github.com wrote:

On Fri, Jan 22, 2016 at 08:21:45AM -0800, Doug Davis wrote:

Changing what 'exec' will do based on whether its the first call is a bit non-deterministic.

Yeah, I think we're all on board with that being a bad idea. I'm suggesting changing what ‘exec’ will do based on explicit JSON configuration (e.g. if linux.namespaces.pid.path is true or false/unset). That's how we handle create-or-join currently with the “just ‘start’” approach.

— Reply to this email directly or view it on GitHub https://github.com/opencontainers/specs/issues/299#issuecomment-173991635 .

crosbymichael commented 8 years ago

What should the full UI be for this change?


cd redis-bundle/

# create the container
container create redis

# start the init process defined in the bundle
container start redis

# start additional processes
contianer exec redis -- bash

# kill the container

container kill redis

# delete the container
container delete redis

??

vishh commented 8 years ago

Yeah. That's the UI I'm thinking as well.

On Fri, Jan 22, 2016 at 12:06 PM Michael Crosby notifications@github.com wrote:

What should the full UI be for this change?

cd redis-bundle/

create the container

container create redis

start the init process defined in the bundle

container start redis

start additional processes

contianer exec redis -- bash

kill the container

container kill redis

delete the container

container delete redis

??

— Reply to this email directly or view it on GitHub https://github.com/opencontainers/specs/issues/299#issuecomment-174032625 .

duglin commented 8 years ago

yes but I'd prefer "container stop redis" since "stop" feels like a better opposite/book-end to "start" than "kill". Yes I know under the covers we send a kill signal, but that's a detail the end user doesn't need to know or care about.

crosbymichael commented 8 years ago

@vishh great minds think alike ;)

wking commented 8 years ago

On Fri, Jan 22, 2016 at 12:09:24PM -0800, Doug Davis wrote:

Yes I know under the covers we send a kill signal, but that's a detail the end user doesn't need to know or care about.

Unless they're putting in a SIGTERM handler for a graceful exit, and are wondering why ‘stop’ isn't triggering that ;). So I think we do want to be explicit about which singnals the runtime MUST use to achieve a given action.

duglin commented 8 years ago

I'm distinguishing between runc stop <id> and runc kill <id> <signal #>

wking commented 8 years ago

On Fri, Jan 22, 2016 at 12:06:51PM -0800, Michael Crosby wrote:

create the container

container create redis

start the init process defined in the bundle

container start redis

start additional processes

contianer exec redis -- bash

Because start/exec will use some currently-configured settings that aren't needed at create time (e.g. config.json's process, runtime.json's linux.namespaces.pid, and any config around mounting a PID-specific /proc), I recommend splitting our config entries between a create.json (only the stuff needed by ‘create’) and exec.json (only the stuff needed by start/exec). To give an explicit example of my suggestion in 1 for the above, I'd have something like:

$ cat exec.json { "process": { "terminal": false, "user": { "uid": 1000, "gid": 1000 }, "env": [ "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin", ], "cwd": "/", "args": ["redis-server", "--bind", "0.0.0.0"] }, "linux": { "namespaces": [ { "type": "pid" } ] }, "hooks": { "prestart": [ { "args": ["cgroup-manager", "add"] } ] } }

to handle and initial:

$ container exec redis

And a:

$ cat bash.json { "process": { "terminal": true, "user": { "uid": 1000, "gid": 1000 }, "env": [ "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin", ], "cwd": "/", "args": ["bash"] }, "linux": { "namespaces": [ { "type": "pid", "path": "/proc/1234/ns/pid", } ] } }

to handle a subsequent: — $ container exec --config bash.json redis

You can always write separate ‘start’ and ‘exec’ wrappers on top of an interface like that (which would work with any OCI-compliant runtime). Of course, with sufficient opt-out flexibility, you'd be able to write wrappers for create, start, and exec all based on a single underlying runtime command (which is how ccon-ced works 2).

If we go with the separate, runtime-level start/exec approach, I think we'll have trouble adding flags for all the options like “do you want this to go in the container cgroups?”, which you might want in some cases, but not in others.

wking commented 8 years ago

On Fri, Jan 22, 2016 at 12:22:17PM -0800, Doug Davis wrote:

I'm distinguishing between runc stop <id> and runc kill <id> <signal #>

If you have the latter (which I prefer), what good is the former?

ashahab-altiscale commented 8 years ago

What is the state of the system after a kill/stop with SIGTERM? I imagine the init process and it's children will die. What about the namespaces? Which fds will remain mounted vs being cleaned up? Can I start another init process with the same namespace fds(after a stop/kill)?

wking commented 8 years ago

On Fri, Jan 22, 2016 at 01:40:54PM -0800, Abin Shahab wrote:

What is the state of the system after a kill/stop with SIGTERM? I imagine the init process and it's children will die.

After a SIGTERM they should all exit cleanly. But SIGTERM is catchable, and one person's “exiting cleanly” is another's “not dying fast enough”. After SIGKILLing a container's PID 1 (assuming it has a container-specific PID namespace), they will all be dead.

What about the namespaces? Which fds will remain mounted vs being cleaned up? Can I start another init process with the same namespace fds?

Everything except the PID namespace should have been created by the ‘create’ step, so you should be able to start another PID-namespace-creating process that still uses those non-PID namespaces (and presumably also reuses cgroups, etc.).

vishh commented 8 years ago

Re-using the sandbox should be possible, except that the sandbox might contain some remnants from the previous run, for example filesystem.

On Fri, Jan 22, 2016 at 2:13 PM, W. Trevor King notifications@github.com wrote:

On Fri, Jan 22, 2016 at 01:40:54PM -0800, Abin Shahab wrote:

What is the state of the system after a kill/stop with SIGTERM? I imagine the init process and it's children will die.

After a SIGTERM they should all exit cleanly. But SIGTERM is catchable, and one person's “exitting cleanly” is anothers “not dying fast enough”. After SIGKILLing a container's PID 1 (assuming it has a container-specific PID namespace), they will all be dead.

What about the namespaces? Which fds will remain mounted vs being cleaned up? Can I start another init process with the same namespace fds?

Everything except the PID namespace should have been created by the ‘create’ step, so you should be able to start another PID-namespace-creating process that still uses those non-PID namespaces (and presumably also reuses cgroups, etc.).

— Reply to this email directly or view it on GitHub https://github.com/opencontainers/specs/issues/299#issuecomment-174069791 .

duglin commented 8 years ago

I sure hope so :-)

hqhq commented 8 years ago

Create - This action will create a container sandbox. On Linux, this would be all cgroups and namespaces, excepting pid namespace (assuming all cgroups and namespaces were requested as part of the Spec).

AFAIK, there are only two APIs which can be used to create new namespaces:

clone
unshare

They either create a new process or called by a process which will be a member of the new namespaces. So how can we create new namespaces without creating new processes? And create empty cgroups dirs without putting any processes in doesn't make much sense to me, they should be attached to start stage IMHO.

duglin commented 8 years ago

See https://github.com/opencontainers/runc/pull/506 we do create a new process during create(), but we hold on to the namespaces (except PID) so that subsequent processes can reuse those other namespaces

wking commented 8 years ago

On Mon, Jan 25, 2016 at 06:24:39AM -0800, Doug Davis wrote:

See https://github.com/opencontainers/runc/pull/506

For a more compact reference, see 1 and references to bind mounting in “The /proc/[pid]/ns/ directory” section of 2.

crosbymichael commented 7 years ago

@vishh do you think there are any actionable items left in this issue? We did the create/start/delete split and you have an option to never actually start the container's process but keep adding exec processes after it is created. Because of pid1 and the pid namespace issues on linux I don't think we will ever have a fully robust persistent sandbox on linux without higher level tooling adding to the functionality ( bind mounting namespaces and sharing between containers )

opencontainers / runtime-spec

Separate container sandbox lifecycle from that of the processes inside it #299

create the container

start the init process defined in the bundle

start additional processes

kill the container

delete the container

create the container

start the init process defined in the bundle

start additional processes