kubernetes / kubernetes

Production-Grade Container Scheduling and Management
https://kubernetes.io
Apache License 2.0
111.45k stars 39.76k forks source link

Support adding an init process to containers #84210

Open segevfiner opened 5 years ago

segevfiner commented 5 years ago

What would you like to be added: Add support for adding an init process (PID 1) to containers started by Kubernetes, across the supported container runtimes. This should be toggled by an appropriate flag on the Container spec.

A possible implementation is basically bind mounting a statically built init executable (e.g. tini) into the container and passing the containers entrypoint+cmd to it, which I think is what Docker's implementation of this does. Of course, the exact place to prepend the init command depends on the runtime, with Docker having its entrypoint/cmd separation and others possibly having different models.

Why is this needed: Linux containers have a well known problem where due to the first process started in the container having PID 1, that process assumes the roles and responsibilities of init, and as such, has the default signal handling behavior disabled, unless they establish signal handles, and in addition have zombies reparented to them for reaping.

Well behaved Docker images take this into account and either:

Sadly, there are plenty of badly behaving container images out there, and there are also plenty of use cases where requiring the user to add an init to the image is a problem. For example, systems where users submit jobs using their own custom images and commands, which might be less attentive to such issues.

To help with this, Docker added a flag that adds an init process backed by tini (Compiled as a vendored static executable named `docker-init) to its API. As far as I can tell, this is implemented by simply bind mounting this binary read only to the container and passing the command that would otherwise be executed as PID 1 to it.

This is of course Docker specific and for something like this to exist in Kubernetes it likely needs to be re-implemented by Kubernetes for all supported container runtimes. Though assuming you can bind mount such a binary in all runtimes, and can distribute it, it should be possible... I think...

Note that tini, specifically used by Docker, also supports some extra features that can be configured via environment variables that can sometimes be useful for certain kinds of containerizied workloads. Whether that's a good or a bad thing...

Of course, this entire thing is quite up for debate...

neolit123 commented 5 years ago

/sig node

fejta-bot commented 4 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

segevfiner commented 4 years ago

/remove-lifecycle stale

mltsy commented 4 years ago

Yes! Please make this an option... docker has the --init flag for this purpose. It's very necessary 😄

fejta-bot commented 4 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

segevfiner commented 4 years ago

/remove-lifecycle stale

fejta-bot commented 4 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

segevfiner commented 4 years ago

/remove-lifecycle stale

jgowdy commented 4 years ago

Has there been any consideration of this feature request? I believe this would have value.

fejta-bot commented 3 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

segevfiner commented 3 years ago

/remove-lifecycle stale

arnaudlacour commented 3 years ago

it's greatly surprising to find this still open and though I do understand that this makes docker even more a special snowflake case for kubernetes, it does offer a consistent mechanism for apps that need robust zombie reaping and signal handling. Without support for the --init switch, many will then implement on their own, likely poorly or completely improperly. Consistency and convention would be great here. We add tini to our images for this reason even it's now embedded in Docker. I greatly dislike the redundancy but it's needed currently for k8s deployment. This issue would solve it.

jcayzac commented 3 years ago

To recap the current available workarounds:

  1. If you don't care much about container isolation inside your pod, switch to shared process namespaces and let pause be PID 1 for everything else.
  2. Manually add tini as an entry point to the Dockerfile of a an image with a special kubernetes tag, based on your regular image.
  3. Manually add tini as an entry point to the Dockerfile of your regular image, despite it not being needed at all when used with docker run, docker compose, …
arnaudlacour commented 3 years ago
  1. If you don't care much about container isolation inside your pod, switch to shared process namespaces and let pause be PID 1 for everything else.

if this was satisafctory, Thomas Orozco wouldn't have spent the time in the first place writing tini, I reckon.

  1. Manually add tini as an entry point to the Dockerfile of a an image with a special kubernetes tag, based on your regular image.

That sounds brittle.

  1. Manually add tini as an entry point to the Dockerfile of your regular image, despite it not being needed at all when used with docker run, docker compose, …

This is what we do because it is a tiny redundancy but I'd much rather we didn't need it at all.

jcayzac commented 3 years ago

it is a tiny redundancy

I see what you did here.

swatisehgal commented 3 years ago

/triage accepted /area kubelet /priority important-longterm

k8s-triage-robot commented 3 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

vaibhav2107 commented 3 years ago

/remove-lifecycle stale

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

segevfiner commented 2 years ago

/remove-lifecycle stale

hholst80 commented 2 years ago

Is this really needed? The pause container is executed for all pods. It has init signaling and reaps zombie processes.

http://en/almighty-pause-container

Screenshot_20220206-233203.jpg

segevfiner commented 2 years ago

Only with PID namespace sharing enabled, which is not the default AFAIK and has other implications.

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

hholst80 commented 2 years ago

For the sake of sanity, please disable the bot on issues that has been captured to be important? It adds just noice and makes it very hard to read up on what is actually important.

vaibhav2107 commented 2 years ago

/remove-lifecycle stale

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

mitar commented 2 years ago

/remove-lifecycle stale

redbaron commented 2 years ago

@liggitt , any chance accepting and adding frozen label to this issue, so that bots wont close it?

liggitt commented 2 years ago

Thanks for the feature request, but this is unlikely to make progress as an issue. A proposal that works through the design along with the implications of such a change can be opened as a KEP.

I don't think perpetually pinging an issue no one is working on is fruitful

sanmai-NL commented 2 years ago

@liggitt I think there's a disconnect between your logic and what a community member/@redbaron expects. The issue was accepted as important by the maintainer team. So I'd expect them to pick this backlog issue up, not say it's important and then wait for the community to plan and do the work. Moreover, the community member didn't only ping, but proposed some backlog management action.

For what it's worth, I disagree this feature is important and I advise against it. A container image should be general to multiple container runtimes. The init executable should be specified by the image builder (as in, developer). Adding this shortcut to Kubernetes adds to the maintenance burden while absolving developers from their own responsibility or awareness about this design point. Moreover, a Container spec is now implicitly declarative and configures an immutable image. Adding some binary into the container by means of a spec field seems to me incongruent. In case this is deemed important because developers are mostly forgetting about the init process, then Containerfile linter tools could be enhanced to warn in advance about resulting entrypoints that are dubious as init processes.

redbaron commented 2 years ago

Motivation is the same as for docker run --init switch existence - everyone needs it, why make image authors do it over and over in every image, when it can be provided out of the box. Also defaults matter, too many devs are not aware of signal handling and it unrealistic to educate them all.

k8s-ci-robot commented 2 years ago

@segevfiner: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
liggitt commented 2 years ago

I think being marked important and then not revisited for 18 months is perhaps a sign it is not actually important. I have similar concerns around making init process part of the declarative spec, but will defer to node/kubelet maintainers to weigh in.

I re-marked this as needing triage, to surface it to that queue, but my comment about this being unlikely to make progress as an issue and needing a KEP still stands.

sanmai-NL commented 2 years ago

@redbaron

Motivation is the same as for docker run --init switch existence - everyone needs it, why make image authors do it over and over in every image, when it can be provided out of the box. Also defaults matter, too many devs are not aware of signal handling and it unrealistic to educate them all.

It could be off-limits due to broadening the scope of the Container spec, as argued. You mention setting a default, but that would be unrealistic anyway since it would be a breaking change to inject an init process before the Command.

segevfiner commented 2 years ago

How is it a breaking change to introduce a new optional property, that doesn't change anything unless enabled, and that affects the way the container will be run like every other property that exists on the container spec... You can literally emulate it now by modifying the command and volumes yourself, it's just inconvenient.

It's an unfortunate historical mistake IMHO that containers don't include this by default, containers started as a full OS, before Docker introduced the concept of running a single app, but they didn't add this at first and by the time they found out it was a problem, it was too late to change the default, so they introduced and option instead.

sanmai-NL commented 2 years ago

If you inject the init executable by default, then it would be breaking, I stated.

Agree with your second point.

hholst80 commented 2 years ago

The workaround is the same as the solution. Just slam this into the Pod spec:

shareProcessNamespace: true
segevfiner commented 2 years ago

The workaround is the same as the solution. Just slam this into the Pod spec:

shareProcessNamespace: true

This has other side effects though. Much more likely to break stuff than just injecting an init process.

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

mitar commented 1 year ago

/remove-lifecycle stale

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

mitar commented 1 year ago

/remove-lifecycle stale

babaMar commented 1 year ago

Any progress on this? It seems adding shareProcessNamespace: true is not a solution

hholst80 commented 1 year ago

Can you explain why it doesn't work for your use cases? I have been using this option by default on all deployments since my comment and not had any problems. knock on wood

k8s-triage-robot commented 9 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

mitar commented 9 months ago

/remove-lifecycle stale

andrewchen5678 commented 7 months ago

+1, many container orchestration solutions like docker, aws ecs supports it, but just not kubernetes.

BlaineEXE commented 6 months ago

This would be really great for using complex scripts in liveness probes. We sometimes need to run a somewhat complex bash script in a liveness probe because the container images we run don't have built-in probes. Without process reaping, we have to make our bash scripts do signal handling themselves, which is messy boilerplate, and hard to maintain in a large open source project where people don't always know to add the signal handling. Being able to use an init: true boolean without enabling PID sharing would be great.

k8s-triage-robot commented 3 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

segevfiner commented 3 months ago

/remove-lifecycle stale