kubernetes / enhancements

Enhancements tracking repo for Kubernetes
Apache License 2.0
3.46k stars 1.49k forks source link

Sidecar Containers #753

Open Joseph-Irving opened 5 years ago

Joseph-Irving commented 5 years ago

Enhancement Description

/sig node

Please keep this description up to date. This will help the Enhancement Team to track the evolution of the enhancement efficiently.

Joseph-Irving commented 5 years ago

@enisoc @dchen1107 @fejta @thockin @kow3ns @derekwaynecarr, opened this tracking issue so that we can discuss.

kow3ns commented 5 years ago

/assign

Joseph-Irving commented 5 years ago

@derekwaynecarr I've done some scoping out of the kubelet changes required for next week's sig-node meeting, I believe that changes are only needed in the kuberuntime package, specifically kuberuntime_manager.go in and kuberuntime_container.go.

In kuberuntime_manager.go you could modify computePodActions to implement the shutdown triggering (kill sidecars when all non-sidecars have permanently exited), and starting up the sidecars first.

In kuberuntime_container.go you could modify killContainersWithSyncResult for terminating the sidecars last and sending the preStop hooks (the preStop hooks bit was a bit debatable, it wasn't settled whether that should be done or not. @thockin had a good point about why you might not want to encourage that behaviour, see comment).

Let me know if you want me to investigate any further.

resouer commented 5 years ago

@kow3ns The discussion makes more sense to me if maybe we can define a full description of containers sequence in Pod spec (sig-app), and how to handle the sequence in kubelet for start, restart and cascading consideration (sig-node). Let's catch the Feb 5 sig-node meeting to give more inputs.

cc @Joseph-Irving

luksa commented 5 years ago

The proposal says that sidecars only run after the init containers run. But what if the use-case requires the sidecar to run while/before the init containers run. For example, if you'd like route the pod's traffic through a proxy running as a sidecar (as in Istio), you probably want that proxy to be in place while the init containers run in case the init container itself does network calls.

Joseph-Irving commented 5 years ago

@luksa I think there's the possibility of looking at having sidecars that run in init phase at some point but currently the proposal is not going to cover that use case. There is currently no way to have concurrent containers running in the init phase so that would be potentially a much larger/messier change than what is being suggested here.

Joseph-Irving commented 5 years ago

Update on this KEP: I've spoken to both @derekwaynecarr and @dchen1107 from sig-node about this and they did not express any major concerns about the proposal. I will raise a PR to the KEP adding some initial notes around implementation details and clarifying a few points that came up during the discussion.

We still need to agree on the API, it seems there is consensus that a simple way of marking containers as sidecars is prefered over more in depth ordering flags. Having a bool is somewhat limiting though so perhaps something more along the lines of containerLifecycle: Sidecar would be preferable so that we have the option of expanding in the future.

luksa commented 5 years ago

@Joseph-Irving Actually, neither the boolean nor the containerLifecycle: Sidecar are appropriate for proper future extensibility. Instead, containerLifecycle should be an object, just like deployment.spec.strategy, with type: Sidecar. This would allow us to then introduce additional fields. For the "sidecar for the whole lifetime of the pod" solution, it would be expressed along these lines:

containerLifecycle: 
  type: Sidecar
  sidecar:
    scope: CompletePodLifetime

as opposed to

containerLifecycle: 
  type: Sidecar
  sidecar:
    scope: AfterInit

Please forgive my bad naming - I hope the names convey the idea.

But there is one problem with the approach where we introduce containerLifecycle to pod.spec.containers. Namely, it's wrong to have sidecars that run parallel to init containers specified under pod.spec.containers. So if you really want to be able to extend this to init containers eventually, you should find an alternative solution - one that would allow you to mark containers as sidecars at a higher level - i.e. not under pod.spec.containers or pod.spec.initContainers, but something like pod.spec.sidecarContainers, which I believe you already discussed, but dismissed. The init containers problem definitely calls for a solution along these lines.

Joseph-Irving commented 5 years ago

@luksa You could also solve the init problem by just allowing an init container to be marked as a sidecar and have that run alongside the init containers. As I understand it, the problem is that init containers sometimes need sidecars, which is different from needing a container that runs for the entire lifetime of the pod.

The problem with pod.spec.sidecarContainers is that it's a far more complex change, tooling would need to updated and the kubelet would require a lot of modifying to support another set of containers. The current proposal is far more modest, it's only building on what's already there.

luksa commented 5 years ago

@Joseph-Irving We could work with that yes. It's not ideal for the sidecar to shut down after the init containers run and then have the same sidecar start up again, but it's better than not having that option. The bigger problem is that older Kubelets wouldn't handle init-sidecar containers properly (as is the case with main-sidecar containers).

I'd just like you to keep init-sidecars in mind when finalizing the proposal. In essence, you're introducing the concept of "sidecar" into k8s (previously, we basically only had a set of containers that were all equal). Now you're introducing actual sidecars, so IMHO, you really should think this out thoroughly and not dismiss a very important sidecar use-case.

I'd be happy to help with implementing this. Without it, Istio can't provide its features to init containers (actually, in a properly secured Kubernetes cluster running Istio, init containers completely lose the ability to talk to any service).

Joseph-Irving commented 5 years ago

In relation to the implementation discussion in https://github.com/kubernetes/enhancements/pull/841, I've opened a WIP PR containing a basic PoC for this proposal https://github.com/kubernetes/kubernetes/pull/75099. It's just a first draft and obviously not perfect but the basic functionality works and gives you an idea of the amount of change required.

cc @enisoc

Joseph-Irving commented 5 years ago

I put together a short video just showing how the PoC currently behaves https://youtu.be/4hC8t6_8bTs. Seeing it in action can be better than reading about it. Disclaimer: I'm not a pro youtuber.

Joseph-Irving commented 5 years ago

I've opened two new PRs:

Any thoughts or suggestions will be much appreciated.

currankaushik commented 5 years ago

@Joseph-Irving Sorry if I'm commenting late in the design process, but I have a potential use case for sidecar containers which may not be supported in the current design proposal. I just wanted to raise it for consideration. The gist is that I have a scenario where on pod termination, 1 sidecar should be terminated before the main container, while another sidecar should be terminated after the main container.

A concrete example might be a pod with a Django app container, a consul sidecar for service registration, and a pgbouncer sidecar for managing connections to the database. When the pod is terminated, I'd like the consul sidecar to be stopped first (so no more traffic is routed to the pod), then the app container (ideally after a short grace period), and then the pgbouncer sidecar. The current proposal looks great for handling the app <-> pgbouncer container dependency, but doesn't seem expressive enough to capture the case where I'd like to tear down a sidecar before the primary container.

Joseph-Irving commented 5 years ago

@currankaushik, in the scenario you described you could potentially use a pre-stop hook to tell the consul container to prepare for shutdown and stop routing requests to you (assuming it can support something like that). pre stop hooks will be sent to sidecars first before the termination of containers begins.

The motivation for this was so that proxy sidecars like istio could enter a state where they're not routing traffic to you but are still allowing traffic out while your application finishes up and shuts down.

currankaushik commented 5 years ago

Sounds good, thanks @Joseph-Irving. So just to confirm my understanding at a high level: pre-stop hooks will be sent to sidecars first, followed by pre-stop hooks to the non-sidecars, SIGTERM to non-sidecars, and then (after all non-sidecars have exited) SIGTERM to sidecars? The design proposal (https://github.com/kubernetes/enhancements/blob/master/keps/sig-apps/sidecarcontainers.md) seems to imply this but also says:

PreStop Hooks will be sent to sidecars and containers at the same time.

Joseph-Irving commented 5 years ago

@currankaushik yeah what you described is the intended behaviour.

That line you quoted needs rewording. I had some misconceptions about how the prestop hooks were sent to the containers when I wrote that. Thanks for pointing it out.

kacole2 commented 5 years ago

@Joseph-Irving is this feature targeting alpha inclusion for 1.15?

Joseph-Irving commented 5 years ago

@kacole2 yeah that is the plan, assuming we can get the KEP to implementable in time for enhancement freeze (april 30th). Once the api has been finalised https://github.com/kubernetes/enhancements/pull/919 and the test plan agreed https://github.com/kubernetes/enhancements/pull/951 I think we should be all set.

kacole2 commented 5 years ago

/milestone v1.15 /stage alpha

mrbobbytables commented 5 years ago

@Joseph-Irving Kubernetes 1.15 Enhancement Freeze is 4/30/2019. To be included in the Kubernetes 1.15 milestone, KEPs are required to be in an "Implementable" state with proper test plans and graduation criteria. Please submit any PRs needed to make this KEP adhere to inclusion criteria. If this will slip from the 1.15 milestone, please let us know so we can make appropriate tracking changes.

Joseph-Irving commented 5 years ago

@mrbobbytables unfortunately the PRs opened to get this to an implementable state have not had much movement on them so I think we will need to delay this until 1.16.

mrbobbytables commented 5 years ago

No worries. Thanks for being so quick to respond and letting us know! /milestone clear

thomschke commented 5 years ago

Please keep in mind, this KEP is very important for Istio !

It's a show stopper for all projects using service frameworks with coordinated bootstrap/shutdown (akka cluster, lagom etc.) together with istio service mesh see.

cc @jroper

zhan849 commented 5 years ago

@Joseph-Irving sry about the late comment, but I don't see the following in the design doc, and I was wondering what is the intended behavior of these:

if we see sidecar failure, do we always restart them if main container is not finished (disregarding restartPolicy in pod)? This would be useful as sidecar used to work as proxy, load balancing, house keeping role, and it doesn't matter if it fails couple of times as long as main container can continue to work

Also, when computing pod phase, if all main container succeeded, and sidecar failed (which is very common as if sidecar does not catch SIGTERM the return code will be like 143 or something), is the pod phase still "Succeeded"?

Joseph-Irving commented 5 years ago

@zhan849 currently sidecar containers obey pod restart policy and are counted when computing pod phase such as Succeeded.

We did debate this quite a bit earlier in the process but the general feeling was that we should diverge from a normal container as little as possible, only doing so if it enables the described use cases.

In regards to pod phase: I would argue that all applications running in kubernetes should be handling SIGTERMs (especially sidecars), but also sometimes you would want to know if your sidecars exited in a bad way and that should be reflected in the pod phase, hiding that info could lead to confusion.

For restart policy, it only seems like that would be an issue if restart policy is never and your sidecar is prone to crashing. I'm not sure if the complication of diverging them from pod restart policy is worth it, especially as some people may want their sidecars to obey pod restart policy.

Both of these things are just in line with what a normal container does and what currently happens. Changing them didn't seem to be required to achieve the goals listed in the Kep.

If you have some widespread use cases for why changing them is needed to achieve those goals, that would be useful. As it makes it easier to justify a more complicated change to the code base.

zhan849 commented 5 years ago

@Joseph-Irving we have some simpler side car impls that has been running internally for some immediate needs (we did not contribute as this is already in progress in the community), and here are what we learned.

Regarding pod phase:

  1. Container exist status is already reflected in pod.status.containerStatuses so we don't lose the information. Also, since a big use case of sidecar is in Job (or what ever run-to-finish pods such as those in Kubeflow), meaningful workloads will be applied to only main container and if pod phase is marked as Failed due to sidecar failure, there will result in unnecessary retries and lead to other misleading consequences such as Job fail, etc.

  2. Although it is ideal for sidecars to handle SIGTERMs, in production, there could be plenty of sidecars that is simply built upon opensource software and they are not handling SIGTERMs nicely, including kube-proxy, postfix, rsyslogd, and many others (and even if SIGTERM is handled, for non-catchable SIGKILL, it will for sure not be 0)

Regarding restart policy (it could be arguable but have sidecars strictly obey restartPolicy is kind of not realistic in production):

  1. Forcing sidecar to restart when main containers are still running by setting "OnFailure" is not an option as this will restart failed main containers and is confusing along with Job level retry limit.

  2. Usually when handling sidecars, main containers usually have plenty of retry logics for sidecar unavailable, and these are done before the community has side car support with explicit container start order. Such historical error handlings are not very easy to change given the scope of it. Not restarting sidecar will cause main containers to hang and retry

  3. Propagating failures to upper level controllers will trigger chains of reconciliation and a lot of api calls so unnecessary escalation of errors can make kubernetes less scalable. A more specific example: if a job's main containers are still running and sidecar fails, restarting sidecar will have just 1 PATCH pod status operation and at most 1 event related api call. But if failing the pod entirely will result in reconciliation of Job, and more hire level controllers such as CronJob and other CRDs and there could be many more times API call.

wanna also see if other people has seen similar issues (/cc @kow3ns )

JacobHenner commented 5 years ago

Would this change incorporate the behavior desired in https://github.com/kubernetes/community/pull/2342, such that there'd be a way to configure the entire pod (or just the non-sidecar container) to restart if a sidecar fails?

Joseph-Irving commented 5 years ago

@JacobHenner there's currently no plans to implement that kind of mechanism in this KEP, we did discuss incorporating it, but it doesn't really have much dependency on this KEP and could be developed independently of this. So it seems better suited to having its own KEP.

zhan849 commented 5 years ago

@Joseph-Irving just to share our impl that addressed the above mentioned pitfalls for your reference (https://github.com/zhan849/kubernetes/commits/kubelet-sidecar) since we our goal is to wait for official support, we try to keep change as local as possible in this commit.

so for a job restart policy == Never, with 1 main container, 1 bad sidecar that constantly crashes, 1 good sidecar that keeps running, pod status will look like this after main container quits with the above impl.

containerStatuses:
  - containerID: xxxxx
    image: xxxxx
    imageID: xxxxx
    lastState: {}
    name: main
    ready: false
    restartCount: 0
    state:
      terminated:
        containerID: xxxxx
        exitCode: 0
        finishedAt: "2019-05-24T17:59:53Z"
        reason: Completed
        startedAt: "2019-05-24T17:59:43Z"
  - containerID: xxxxx
    image: xxxxxx
    imageID: xxxxx
    lastState: {}
    name: sidecar-bad
    ready: false
    restartCount: 1
    state:
      terminated:
        containerID: xxxxx
        exitCode: 1
        finishedAt: "2019-05-24T17:59:46Z"
        reason: Error
        startedAt: "2019-05-24T17:59:45Z"
  - containerID: xxxxx
    image: xxxxxxx
    imageID: xxxxx
    lastState: {}
    name: sidecar-healthy
    ready: false
    restartCount: 0
    state:
      terminated:
        containerID: xxxxx
        exitCode: 137
        finishedAt: "2019-05-24T18:00:24Z"
        reason: Error
        startedAt: "2019-05-24T17:59:44Z"
  hostIP: 10.3.23.230
  phase: Succeeded
  podIP: 192.168.1.85
  qosClass: BestEffort
  startTime: "2019-05-24T17:59:41Z"
smarterclayton commented 5 years ago

I in general agree that a sidecar KEP needs to take into account pod phase and restart policy before it can go to an implementable state. I don't care whether it's this KEP or not, but I agree in general with @zhan849's arguments and it needs to be dealt with here.

zhan849 commented 5 years ago

thanks @smarterclayton ! @Joseph-Irving let us know if there is anything else you'd like us to share with sidecar in practice.

Joseph-Irving commented 5 years ago

@smarterclayton @zhan849, I don't particularly disagree with the points your making, just trying to give some counter points. It was a conscious choice not to change Pod Phases/Restart Policy as that would further increase the scope of this proposal and nobody felt very strongly about it.

I will take this feedback back to sig-apps/sig-node and see what they think. sig-node in particular were keen on keeping the sidecars as close to normal containers as possible, if @derekwaynecarr or @dchen1107 want to chime in that would be appreciated.

Joseph-Irving commented 5 years ago

The test plan https://github.com/kubernetes/enhancements/pull/951 and API design https://github.com/kubernetes/enhancements/pull/919 PRs have now been merged.

I've opened https://github.com/kubernetes/enhancements/pull/1109 to get the KEP marked as implementable, once everyone is happy with that we should be able to start development for this as alpha in 1.16 🤞

Joseph-Irving commented 5 years ago

This Kep has been marked implementable so I will be raising PRs to get this into 1.16 starting next week!

Joseph-Irving commented 5 years ago

I've raised https://github.com/kubernetes/kubernetes/pull/79649 to implement the API, I will have a separate PR for the Kubelet changes.

kacole2 commented 5 years ago

Hi @Joseph-Irving , I'm the 1.16 Enhancement Lead. Is this feature going to be graduating alpha/beta/stable stages in 1.16? Please let me know so it can be added to the 1.16 Tracking Spreadsheet. If not's graduating, I will remove it from the milestone and change the tracked label.

Once coding begins or if it already has, please list all relevant k/k PRs in this issue so they can be tracked properly.

Milestone dates are Enhancement Freeze 7/30 and Code Freeze 8/29.

Thank you.

mhuxtable commented 5 years ago

@Joseph-Irving If you want/need some extra people to implement this, I have a lot of interest in this landing, so I'm happy to lend a hand.

Joseph-Irving commented 5 years ago

Hi @kacole2 this is targeting Alpha for 1.16, the KEP has been marked implementable. The only PR for this currently is kubernetes/kubernetes#79649 for the API

@mhuxtable I will be raising the PR for the kubelet changes fairly soon, just finishing off some things, I would greatly appreciate some help having a look at that. I will link it here when it's raised.

Joseph-Irving commented 5 years ago

I've opened https://github.com/kubernetes/kubernetes/pull/80744 which implements the kubelet changes.

Please note that kubernetes/kubernetes#79649 (api) is still open so this PR contains commits from it, making it seem large. I've broken it down into commits that each implement a different bit of functionality so it should be easy to review it that way.

I've not quite finished doing all the test cases for this, but the first draft of working implementation is done so I'd like people to take a look.

cc @kacole2

daminisatya commented 5 years ago

@Joseph-Irving

I'm one of the v1.16 docs shadows. Does this enhancement (or the work planned for v1.16) require any new docs (or modifications to existing docs)? If not, can you please update the 1.16 Enhancement Tracker Sheet (or let me know and I’ll do so)

If so, just a friendly reminder we're looking for a PR against k/website (branch dev-1.16) due by Friday, August 23rd, it can just be a placeholder PR at this time. Let me know if you have any questions!

Joseph-Irving commented 5 years ago

Hi @daminisatya, yes this will need updates to Docs, I've raised https://github.com/kubernetes/website/pull/15693 as a placeholder PR. I'd be interested to know if anyone has any opinions on where the Docs should go, I've put something in content/en/docs/concepts/workloads/pods/pod-lifecycle.md for now.

Joseph-Irving commented 5 years ago

With less than one week to go before Code Freeze, it's looking very unlikely that this will be able to make it into 1.16. We've still got two relatively large open PRs kubernetes/kubernetes#80744 and kubernetes/kubernetes#79649 which have struggled to get any reviews. Hopefully there will be more reviewer bandwidth next release cycle to look at these.

khenidak commented 5 years ago

/assign

Ciantic commented 5 years ago

Could this allow to write a sidecar that can start the actual service on demand (and destroy it)?

Like scale to zero, but only thing that is running while idle is the sidecar. When request comes it spins up the actual service and after a last response e.g. 30s it will close it down. This could allow a simple way to do scaling to nearly zero (with only sidecars left running).

janosroden commented 5 years ago

@Ciantic With Operator Framework you can do that and much more. Take a look

Ciantic commented 5 years ago

@janosroden I looked, but it seems pretty difficult to understand how would I elevate running services to a zero scalable.

The problem is not that there isn't available options e.g. Osiris, Keda or knative. I tried the last one, but it hogs 8Gb of memory, hard to say it's 'serverless' at that point.

The problem is that most of those implementations need a new resources etc. it's much easier to think this so that one can inject a sidecar which can control the whole lifecycle (including starting and restarting on demand) so that it can control the service beyond just sitting there.

Why would this be beneficial? It's really useful in low utilisation and low memory situations, e.g. k3s with Raspberry Pi, or Digital Ocean droplet for hobby projects. Many of us have lot's of web services that need not to be running all the time, just having a sidecar which can wake them up on demand would be enough.

kfox1111 commented 5 years ago

Not sure this really works for your use case. I totally see the desire to do what you want to do on such resource constrained systems. But to be really stable, you need to use resource requests to help schedule the workload. These would need to be specified up front so regardless if the workload is running or not, it should be reserving the resource.

To work around this, you pretty much need a pod of its own to do the initial connection receiving and make a new pod request to k8s, wait for it to spin it up, then send the traffic to it. Sidecar container enhancements aren't needed in this case I think. You need something more like an xinetd for k8s I think.

mrbobbytables commented 5 years ago

Hey there @Joseph-Irving -- 1.17 Enhancements lead here. I wanted to check in and see if you think this Enhancement will be graduating to alpha in 1.17?

The current release schedule is:

If you do, once coding begins please list all relevant k/k PRs in this issue so they can be tracked properly. 👍

Thanks!

Joseph-Irving commented 5 years ago

Hi @mrbobbytables, assuming we can get everything reviewed in time the plan is to graduate to alpha in 1.17.

The current open PRs are: https://github.com/kubernetes/kubernetes/pull/79649 - API Changes https://github.com/kubernetes/kubernetes/pull/80744 - Kubelet Changes

Let me know if you need anything else!