Open vishh opened 8 years ago
I'd suggest the prototype should test with usernamespaces, especially in light of bugs like this one: https://github.com/opencontainers/runc/issues/101 Related patch: https://github.com/opencontainers/runc/pull/105/files
The init process of the container needs to be able to join a pre-created user namespace. If this does not work, it would not possible to create any namespaces, as user namespaces require all other namespaces to be children of the user namespace.
@ashahab-altiscale: As long as user namespaces don't require a process to pin it, I don't see any issue here.
On Tue, Jan 12, 2016 at 03:13:37PM -0800, Vish Kannan wrote:
As of now, the runtime lifecycle states that there is a
Start
action which results in creation of a container sandbox and spawning the first process inside the container. It would be better to separate lifecycle of the container sandbox from that of container processes.
This seems like a high-level change. Should discussion happen on the list 1? There are earlier notes on this split approach in 2, as well as some previous discussion along these lines in #226. If we decide to have this conversation here (instead of the list or earlier issue), I can write up a summary of past points.
This split will obviate the need for supporting hooks. Users are free to run hooks outside of the runtime as they please.
Yeah, if we have a separate create and exec, we can drop hooks and close #265 3.
The existing Spec needs to be split into a Sandbox configuration and a Process configuration.
Not necessarily. ccon-ced (discussed in #226) wraps a runtime that uses a single spec to do this. To pull that off, you need a spec that makes it easy to turn on only the functionality you want (e.g. “create namespaces” vs. “join, [create a new PID namespace,] and exec a process”).
Subject: Documenting container lifecycles
Date: Mon, 14 Sep 2015 20:47:17 -0700
Message-ID: <20150915034717.GO18018@odin.tremily.us>
s/Support/Separate in issue title
It would be better to separate lifecycle of the container sandbox from that of container processes.
(emphasis mine) more justification/use cases here would be useful, at the very least for posterity
This split will obviate the need for supporting hooks.
To be clear, are you suggesting removing hooks entirely from the spec?
Subsequent processes will run the same pid namespace.
It makes me pretty uneasy to have the start
operation be nondeterministic - e.g. if I run two start commands simultaneously it's a race which one's process becomes PID1 (which has non-trivial implications).
On Tue, Jan 12, 2016 at 03:56:28PM -0800, Jonathan Boulle wrote:
Subsequent processes will run the same pid namespace.
It makes me pretty uneasy to have the
start
operation be nondeterministic - e.g. if I run two start commands simultaneously it's a race which one's process becomes PID1 (which has non-trivial implications).
Yeah, -1 to having PID namespace be implicit. If you run starts (parallel or not) that both create PID namespaces, then they both get to be PID 1 in their own PID namespace. If you run a start (parallel or not) that does not create a PID namespace, it does not get to be PID 1. So I don't see a problem with racing if we have explicit PID-namespace creation.
On naming, see ‘exec’ vs. ‘start’ here in opencontainers/runc#210.
On Tue, Jan 12, 2016 at 6:52 PM W. Trevor King notifications@github.com wrote:
On Tue, Jan 12, 2016 at 03:13:37PM -0800, Vish Kannan wrote:
As of now, the runtime lifecycle states that there is a
Start
action which results in creation of a container sandbox and spawning the first process inside the container. It would be better to separate lifecycle of the container sandbox from that of container processes.This seems like a high-level change. Should discussion happen on the list [1]?
This is because we're sitting in the face-2-face and working through this presently.
On Tue, Jan 12, 2016 at 04:51:36PM -0800, Vincent Batts wrote:
On Tue, Jan 12, 2016 at 6:52 PM W. Trevor King wrote:
On Tue, Jan 12, 2016 at 03:13:37PM -0800, Vish Kannan wrote:
As of now, the runtime lifecycle states that there is a
Start
action which results in creation of a container sandbox and spawning the first process inside the container. It would be better to separate lifecycle of the container sandbox from that of container processes.This seems like a high-level change. Should discussion happen on the list [1]?
This is because we're sitting in the face-2-face and working through this presently.
Yeah, that's fine. But when dumping from the face-to-face into the archives, I think you should dump into a list post, not into a new issue (like this one).
On Tue, Jan 12, 2016 at 03:29:28PM -0800, Vish Kannan wrote:
@ashahab-altiscale: As long as user namespaces don't require a process to pin it, I don't see any issue here.
The issue is the order in which you join or create namespaces, which impacts permissions. For example, an unprivileged user can join a user namespace they created earlier, and then create a new PID namespace underneath. But they can't create a new PID namespace before joining a user namespace they created (or creating a new user namespace). There are a number of comments to this effect in opencontainers/runc#105, and also some notes in the “Interaction of user namespaces and other types of namespaces” section of user_namespaces(7). For a command line example:
Make a new user namespace: — $ unshare --user --map-root-user sh sh-4.3# echo $$ 21171
In a different terminal, trying to create a new PID namespace and join the user namespace fails: — $ unshare --pid nsenter --user=/proc/21171/ns/user --preserve-credentials sh unshare: unshare failed: Operation not permitted
But trying to join the user namespace and then create a new PID namespace succeeds: — $ nsenter --user=/proc/21171/ns/user --preserve-credentials unshare --pid sh sh-4.3#
I think @ashahab-altiscale is just suggesting we avoid that issue by proper namespace join/create ordering when implementing create/exec tooling.
@wking Exactly. The order of namespace creation is important.
s/Cleanup/Stop/ but aside from that I like the idea of the split.
re: multiple "starts": based on what I've heard I don't think multiple concurrent "starts" would be allowed. I think I'm hearing that people still want a single main process in the container, which means once you've started it your only choice is to use "exec" to get a 2nd process. Once the main proc ended then you can issue "start" again tho.
On Fri, Jan 22, 2016 at 07:11:42AM -0800, Doug Davis wrote:
re: multiple "starts": based on what I've heard I don't think multiple concurrent "starts" would be allowed. I think I'm hearing that people still want a single main process in the container, which means once you've started it your only choice is to use "exec" to get a 2nd process. Once the main proc ended then you can issue "start" again tho.
Is the only difference between ‘start’ and ‘exec’ whether you're creating a new PID namespace (and optionally mounting a new /proc?) or inheriting/joining an existing one? It sounds like it might be, and in that case, I'd rather just use our existing namespace create/join schema and have a single ‘exec’ command. Trying to bake the distinction into two separate commands seems like more trouble than the usability gain is worth.
Or are we just splitting commands for portability (e.g. to allow hypervisor and microkernel systems where it's difficult/impossible for the host to execute a new process in an existing PID-namespace/container)?
Changing what 'exec' will do based on whether its the first call is a bit non-deterministic. Imagine I do "runc exec" thinking it'll generate a secondary proc and as such I have different management logic for it. If it turns out it ends up being the main proc (because I didn't know the main proc just died) I might end up with very different (incorrect) results than what I expected.
On Fri, Jan 22, 2016 at 08:21:45AM -0800, Doug Davis wrote:
Changing what 'exec' will do based on whether its the first call is a bit non-deterministic.
Yeah, I think we're all on board with that being a bad idea. I'm suggesting changing what ‘exec’ will do based on explicit JSON configuration (e.g. if linux.namespaces.pid.path is true or false/unset). That's how we handle create-or-join currently with the “just ‘start’” approach.
+1 on splitting the init process from secondary process.
On Fri, Jan 22, 2016 at 9:52 AM W. Trevor King notifications@github.com wrote:
On Fri, Jan 22, 2016 at 08:21:45AM -0800, Doug Davis wrote:
Changing what 'exec' will do based on whether its the first call is a bit non-deterministic.
Yeah, I think we're all on board with that being a bad idea. I'm suggesting changing what ‘exec’ will do based on explicit JSON configuration (e.g. if linux.namespaces.pid.path is true or false/unset). That's how we handle create-or-join currently with the “just ‘start’” approach.
— Reply to this email directly or view it on GitHub https://github.com/opencontainers/specs/issues/299#issuecomment-173991635 .
What should the full UI be for this change?
cd redis-bundle/
# create the container
container create redis
# start the init process defined in the bundle
container start redis
# start additional processes
contianer exec redis -- bash
# kill the container
container kill redis
# delete the container
container delete redis
??
Yeah. That's the UI I'm thinking as well.
On Fri, Jan 22, 2016 at 12:06 PM Michael Crosby notifications@github.com wrote:
What should the full UI be for this change?
cd redis-bundle/
create the container
container create redis
start the init process defined in the bundle
container start redis
start additional processes
contianer exec redis -- bash
kill the container
container kill redis
delete the container
container delete redis
??
— Reply to this email directly or view it on GitHub https://github.com/opencontainers/specs/issues/299#issuecomment-174032625 .
yes but I'd prefer "container stop redis" since "stop" feels like a better opposite/book-end to "start" than "kill". Yes I know under the covers we send a kill signal, but that's a detail the end user doesn't need to know or care about.
@vishh great minds think alike ;)
On Fri, Jan 22, 2016 at 12:09:24PM -0800, Doug Davis wrote:
Yes I know under the covers we send a kill signal, but that's a detail the end user doesn't need to know or care about.
Unless they're putting in a SIGTERM handler for a graceful exit, and are wondering why ‘stop’ isn't triggering that ;). So I think we do want to be explicit about which singnals the runtime MUST use to achieve a given action.
I'm distinguishing between runc stop <id>
and runc kill <id> <signal #>
On Fri, Jan 22, 2016 at 12:06:51PM -0800, Michael Crosby wrote:
create the container
container create redis
start the init process defined in the bundle
container start redis
start additional processes
contianer exec redis -- bash
Because start/exec will use some currently-configured settings that aren't needed at create time (e.g. config.json's process, runtime.json's linux.namespaces.pid, and any config around mounting a PID-specific /proc), I recommend splitting our config entries between a create.json (only the stuff needed by ‘create’) and exec.json (only the stuff needed by start/exec). To give an explicit example of my suggestion in 1 for the above, I'd have something like:
$ cat exec.json { "process": { "terminal": false, "user": { "uid": 1000, "gid": 1000 }, "env": [ "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin", ], "cwd": "/", "args": ["redis-server", "--bind", "0.0.0.0"] }, "linux": { "namespaces": [ { "type": "pid" } ] }, "hooks": { "prestart": [ { "args": ["cgroup-manager", "add"] } ] } }
to handle and initial:
$ container exec redis
And a:
$ cat bash.json { "process": { "terminal": true, "user": { "uid": 1000, "gid": 1000 }, "env": [ "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin", ], "cwd": "/", "args": ["bash"] }, "linux": { "namespaces": [ { "type": "pid", "path": "/proc/1234/ns/pid", } ] } }
to handle a subsequent: — $ container exec --config bash.json redis
You can always write separate ‘start’ and ‘exec’ wrappers on top of an interface like that (which would work with any OCI-compliant runtime). Of course, with sufficient opt-out flexibility, you'd be able to write wrappers for create, start, and exec all based on a single underlying runtime command (which is how ccon-ced works 2).
If we go with the separate, runtime-level start/exec approach, I think we'll have trouble adding flags for all the options like “do you want this to go in the container cgroups?”, which you might want in some cases, but not in others.
On Fri, Jan 22, 2016 at 12:22:17PM -0800, Doug Davis wrote:
I'm distinguishing between
runc stop <id>
andrunc kill <id> <signal #>
If you have the latter (which I prefer), what good is the former?
What is the state of the system after a kill/stop with SIGTERM? I imagine the init process and it's children will die. What about the namespaces? Which fds will remain mounted vs being cleaned up? Can I start another init process with the same namespace fds(after a stop/kill)?
On Fri, Jan 22, 2016 at 01:40:54PM -0800, Abin Shahab wrote:
What is the state of the system after a kill/stop with SIGTERM? I imagine the init process and it's children will die.
After a SIGTERM they should all exit cleanly. But SIGTERM is catchable, and one person's “exiting cleanly” is another's “not dying fast enough”. After SIGKILLing a container's PID 1 (assuming it has a container-specific PID namespace), they will all be dead.
What about the namespaces? Which fds will remain mounted vs being cleaned up? Can I start another init process with the same namespace fds?
Everything except the PID namespace should have been created by the ‘create’ step, so you should be able to start another PID-namespace-creating process that still uses those non-PID namespaces (and presumably also reuses cgroups, etc.).
Re-using the sandbox should be possible, except that the sandbox might contain some remnants from the previous run, for example filesystem.
On Fri, Jan 22, 2016 at 2:13 PM, W. Trevor King notifications@github.com wrote:
On Fri, Jan 22, 2016 at 01:40:54PM -0800, Abin Shahab wrote:
What is the state of the system after a kill/stop with SIGTERM? I imagine the init process and it's children will die.
After a SIGTERM they should all exit cleanly. But SIGTERM is catchable, and one person's “exitting cleanly” is anothers “not dying fast enough”. After SIGKILLing a container's PID 1 (assuming it has a container-specific PID namespace), they will all be dead.
What about the namespaces? Which fds will remain mounted vs being cleaned up? Can I start another init process with the same namespace fds?
Everything except the PID namespace should have been created by the ‘create’ step, so you should be able to start another PID-namespace-creating process that still uses those non-PID namespaces (and presumably also reuses cgroups, etc.).
— Reply to this email directly or view it on GitHub https://github.com/opencontainers/specs/issues/299#issuecomment-174069791 .
I sure hope so :-)
- Create - This action will create a container sandbox. On Linux, this would be all cgroups and namespaces, excepting pid namespace (assuming all cgroups and namespaces were requested as part of the Spec).
AFAIK, there are only two APIs which can be used to create new namespaces:
They either create a new process or called by a process which will be a member of the new namespaces. So how can we create new namespaces without creating new processes? And create empty cgroups dirs without putting any processes in doesn't make much sense to me, they should be attached to start stage IMHO.
See https://github.com/opencontainers/runc/pull/506 we do create a new process during create(), but we hold on to the namespaces (except PID) so that subsequent processes can reuse those other namespaces
@vishh do you think there are any actionable items left in this issue? We did the create/start/delete split and you have an option to never actually start the container's process but keep adding exec
processes after it is created. Because of pid1 and the pid namespace issues on linux I don't think we will ever have a fully robust persistent sandbox on linux without higher level tooling adding to the functionality ( bind mounting namespaces and sharing between containers )
As of now, the runtime lifecycle states that there is a
Start
action which results in creation of a container sandbox and spawning the first process inside the container. It would be better to separate lifecycle of the container sandbox from that of container processes. To be more precise, I'm proposing introducing the following oci actions:This split will obviate the need for supporting hooks. Users are free to run hooks outside of the runtime as they please.
The existing Spec needs to be split into a Sandbox configuration and a Process configuration. The process configuration can be reused for the
Exec
use-case.FYI: This was discussed during the OCI meeting on 1/12/16 and everyone present agreed to this proposal.
We need to prototype this separation in one or more Operating Systems before requiring it in the Spec.
Input from runtime authors on non-linux platforms will be helpful here.