[ingress/controllers/nginx] Use Service Virtual IP instead of maintaining Pod list

edouardKaiser commented 8 years ago

Is there a way to tell the NGINX Ingress controller to use the Service Virtual IP Address instead of maintaining the Pods IP addresses in the upstream configuration ?

I couldn't find it. If not, I think it would be good. Because with the current situation, when we scale down a service. the Ingress controller does not work in harmony with the Replication Controller of the service.

That means, some requests to the Ingress Controller will fail while waiting for the Ingress Controller to be updated.

If we use the Service Virtual IP address, we can let kube-proxy do its job in harmony with the replication controller and we have a seamless down scaling.

edouardKaiser commented 8 years ago

I guess it has been implemented that way for session stickiness. But for applications who don't need that, it could be a good option.

aledbf commented 8 years ago

when we scale down a service. the Ingress controller does not work in harmony with the Replication Controller of the service.

What do you mean? After a change in the number of replicas in a rc it takes a couple of seconds to receive the update from the api server.

In the case of scaling down the number of replicas you need to tune the upstream check defaults docs

Besides this I'm testing the module lua-upstream-nginx-module to avoid reloads and be able to add/remove servers in an upstream

edouardKaiser commented 8 years ago

Ok, I'll try to explain with another example:

When you update a deployment resource (like changing the docker image), depending your configuration (rollingUpdate strategy, max surge, max unavailable), the deployment controller will bring down some pods, and create new one. All of this, in a fashion way where there is no downtime if you use the Service VIP to communicate with the pods.

Because first, when it wants to kill a pod, it removes the pod IP address from the service to avoid any new connection, and it follow the termination grace period of the pod to drain the existing connections. Meanwhile, it also creates a new pod, with the new docker image, and wait for the pod to be ready, and add the pod behind the service VIP.

By maintaining the pod list yourself in the Ingress Controller, at a certain point, during a deployment resource update, some requests will be redirected to pods which are shutting down. Because the Ingress Controller, does not know a RollingUpdate Deployment is happening. It will know maybe 1 second later. But for services, with a lots of connection/sec, it's potentially a lots of requests failing.

I personally don't want to tune the upstream to handle this scenario. Kubernetes is already doing an amazing job to update pods with no downtime. That only if you use the Service VIP.

Did I miss something ? If it's still not clear, or there is something I'm clearly not understanding, please don't hesitate.

edouardKaiser commented 8 years ago

The NGINX Ingress Controller (https://github.com/nginxinc/kubernetes-ingress) used to use the service VIP. But they changed recently to a system like yours (pod list in the upstream).

Before they changed this behaviour, I did some test. I was continuously spamming requests to the Ingress Controller (5/sec). Meanwhile, I updated the Deployment resource related to those requests (new docker images):

Kubernetes/Contrib: You can clearly see some requests failing at the time of the update
NGINX/Controller: It looks like nothing happened behind the scene, perfect deployment, with no downtime (all because of Kubernetes doing a great job behind the scene)

aledbf commented 8 years ago

@edouardKaiser how are you testing this? The request are GET or POST? Can you provide a description of the testing scenario?

aledbf commented 8 years ago

I personally don't want to tune the upstream to handle this scenario.

I understand that but your request is the contradicts what other users requested (control over the upstream checks). Is hard to find a balance in the configuration that satisfies all the user scenarios.

edouardKaiser commented 8 years ago

I understand some people might want to tweak the upstream configuration, but on the other side Kubernetes is doing a better job at managing deployment without downtime with the concept of communicating with pods through the Service VIP.

To reproduce, I just used the Chrome App Postman, and their Runner feature (you can specify some requests to run to a particular endpoint, with a number of iteration, delay....). And while the runner was running, I just updated the Deployment resource and watched the behaviour of the runner.

When the request is GET and it fails, Nginx automatically passes the request to the next server. But for non-idempotent method like POST, it does not (and I think it's the right behavior), and then we have failure.

aledbf commented 8 years ago

But for non-idempotent method like POST, it does no

This is documented scenario https://github.com/kubernetes/contrib/tree/master/ingress/controllers/nginx#retries-in-no-idempotent-methods NGINX changed this behavior in 1.9.13

Please add the option retry-non-idempotent=true in the nginx configmap to restore the old behavior

edouardKaiser commented 8 years ago

But it doesn't change the root of the problem: Ingress Controller and Deployment Controller don't work together.

Your pod might have accepted the connection and started to process it, but what the Ingress Controller does not know, is that this pod is gonna get killed the next second by the Deployment Controller.

For the Deployment Controller it's fine, the Deployment Controller did its job, removed the pod from the service and waited the termination grace period.
On the Ingress Controller: it's not fine, your connection will suddenly be aborted because the pod died. If you're lucky, the pod got removed by the Ingress Controller before any request gets in. If you're not, you'll experiment some failure. If it's a POST request, most of the time you really don't want to retry. If it's a GET request, but the pod gets killed in the middle of transferring a response, NGINX won't retry.

I know this is not a perfect world, and we need to embrace failure. Here, we have a way to potentially avoid that failure by using Service VIP.

I'm not saying it should be the default behaviour, but an option to use Service VIP instead of Pod endpoint would be awesome.

glerchundi commented 8 years ago

I'm with @edouardKaiser because:

You (as devops or operations guy) cannot guarantee that the final developer will follow the best practices regarding to keeping, dropping connections whenever it gets a SIGTERM for example. However, if services were used the responsibility for site reliability falls completely in concrete person/team.
Although retry-no-idempotent can mitigate some problems others could arise, choosing to retry those is not an option in much cases.

IMO the controller should expose a parameter or something to choose between final endpoints or services, that would cover all the use cases.

edouardKaiser commented 8 years ago

I couldn't have explained it better. Thanks @glerchundi

thockin commented 8 years ago

If you go through the service VIP you can't ever do session affinity. It also incurs some overhead, such as conntrack entries for iptables DNAT. I think this is not ideal.

To answer the questions about "coordination" this is what readiness and grace period are for. What is supposed to happen is:

RC creates 5 pods A, B, C, D, E all 5 pods become ready endpoints controller adds all 5 pods to the Endpoints structure ingress controller sees the Endpoints update ... serving ... RC deletes 2 pods (scaling down) pods D and E are marked unready kubelet notifies pods D and E endpoints controller sees readiness change, removes D and E from endpoints ingress controller sees the Endpoints update and removes D and E termination grace period ends kubelet kills pods D and E

It is possible that your ingress controller falls so far behind that the grace expires before it has a chance to remove endpoints, but that is the nature of distrubited systems. It's equally possible that kube-proxy falls behind - they literally use the same API.

On Mon, Jun 27, 2016 at 6:51 PM, Edouard Kaiser notifications@github.com wrote:

I couldn't have explained it better. Thanks @glerchundi https://github.com/glerchundi

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/kubernetes/contrib/issues/1140#issuecomment-228926948, or mute the thread https://github.com/notifications/unsubscribe/AFVgVLVBHdUQdCIRD6uVDhs8Icwg5k8tks5qQH4hgaJpZM4IuliB .

edouardKaiser commented 8 years ago

I do understand this is not ideal for everyone, this is why I was talking about an option for this behaviour.

thockin commented 8 years ago

But I don't see how it is better for anyone? It buys you literally nothing, but you lose the potential for affinity and incur performance hit for no actual increase in stability.

On Mon, Jun 27, 2016 at 8:50 PM, Edouard Kaiser notifications@github.com wrote:

I do understand this is not idea for everyone, this is why I was talking about an option for this behaviour.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/kubernetes/contrib/issues/1140#issuecomment-228941142, or mute the thread https://github.com/notifications/unsubscribe/AFVgVEExdHf-NncMom53MS7914scnzJjks5qQJnrgaJpZM4IuliB .

edouardKaiser commented 8 years ago

Correct me if I'm wrong, I probably misunderstood something in the termination of pods flow:

When scaling down, the pod is removed from endpoints list for service and, at the same time, a TERM signal is sent.

So, for me, at this exact moment, there is an opened window. Potentially, this pod (which is shutting down gracefully), might still get some requests forwarded by the nginx ingress-controller (just the time it needs for the ingress-controller to notice the changes, update and reload the conf).

thockin commented 8 years ago

On Mon, Jun 27, 2016 at 9:53 PM, Edouard Kaiser notifications@github.com wrote:

Correct me if I'm wrong, I probably misunderstood something in the termination of pods flow:

When scaling down, the pod is removed from endpoints list for service and, at the same time, a TERM signal is sent.

Pedanticly, "at the same time" has no real meaning. It happens asynchronously. It might happen before or after or at the same time.

So, for me, at this exact moment, there is an opened window. Potentially, this pod (which is shutting down gracefully), might still get some requests forwarded by the nginx ingress-controller (just the time it needs for the ingress-controller to notify the changes, update and reload the conf).

The pod can take as long as it needs to shut down. Typically O(seconds) is sufficient time to finish or cleanly terminate open connections and ensure no new connections arrive. So, for example, you could request 1 minute grace, keep accepting connections for max(5 seconds since last connection, 30 seconds), drain any open connections, and then terminate.

Note that the exact same thing can happen with the service VIP. kube-proxy is just an API watcher. It could happen that kube-proxy sees the pod delete after kubelet does, in which case it would still be routing service VIP traffic to th epod that had already been signalled. There's literally no difference. That's my main point :)

edouardKaiser commented 8 years ago

True, "at the same time" doesn't mean that much here, it's more like those operations are triggered in parallel.

I wanted to point out that possibility because I ran some tests before opening this issue (continuously sending requests to an endpoint backed by multiple-pods while scaling down). And when ingress-controller was using VIP, the down-scaling was happening more smoothly (no failure, no request passed to the next server by nginx), contrary to when the ingress-controller is maintaining the endpoint list (I noticed some requests were failing for that short time-window, and passed to the next server for the GET, PUT type...).

I'm surprised the same thing can happen with the service VIP. I supposed that Kubelet would start the shutdown only once the pod was removed from the iptable entries, but I was wrong.

So your point is, I got lucky during my tests, because depending the timing, I might have ended up with the same situation even with Service VIP.

thockin commented 8 years ago

On Mon, Jun 27, 2016 at 10:33 PM, Edouard Kaiser notifications@github.com wrote:

I'm surprised the same thing can happen with the service VIP. I supposed that Kubelet would start the shutdown only once the pod was removed from the iptable entries, but I was wrong.

Nope. kube-proxy is replaceable, so we can't really couple except to the API.

So your point is, I got lucky during my tests, because depending the timing, I might have ended up with the same situation even with Service VIP.

I'd say you got UNlucky - it's always better to see the errors :)

If termination doesn't work as I described (roughly, I may get some details wrong) we should figure out why

edouardKaiser commented 8 years ago

Thanks for the explanation Tim, I guess I can close this one.

thockin commented 8 years ago

Not to impose too much, but since this is a rather frequent topic, I wonder if you want to write a doc or an example or something? A way to demonstrate the end-to-end config for this? I've been meaning to do it, but it means so much more when non-core-team people document stuff (less bad assumptions :).

I'll send you a tshirt...

On Mon, Jun 27, 2016 at 11:29 PM, Edouard Kaiser notifications@github.com wrote:

Thanks for the explanation Tim, I guess I can close this one.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/kubernetes/contrib/issues/1140#issuecomment-228960154, or mute the thread https://github.com/notifications/unsubscribe/AFVgVL1tS9yhMYvUh8MM6UKypMtCJRYEks5qQL9YgaJpZM4IuliB .

edouardKaiser commented 8 years ago

Happy to write something.

Were you thinking about updating the README of the Ingress Controllers (https://github.com/kubernetes/contrib/tree/master/ingress/controllers/nginx)?

We could add a new paragraph about the choice of using endpoint list instead of service VIP (advantages like upstream tuning, session affinity..) and showing that there is no guarantee of synchronisation even by using the service VIP.

glerchundi commented 8 years ago

@thockin thanks for the explanation, it's very water clear now.

edouardKaiser commented 8 years ago

I'm glad I have a better understanding on how it works, it makes sense if you think about the kube-proxy as just an API watcher.

But to be honest, now I'm kind of stuck. Some of our applications don't handle very well the SIGTERM (no graceful stop..). Even if the application is in the middle of a request, a SIGTERM would shutdown the app immediately.

Using Kubernetes, I'm not sure how to deploy without downtime now. My initial understanding was this flow when scaling down/deploying new version:

Remove the pod from the endpoint list
Wait for the terminationGracePeriod (to wait for any request in progress to finish)
Then shutdown with SIGTERM

We need to rethink about how to deploy or see if we can adapt our application to handle SIGTERM.

thockin commented 8 years ago

wrt writing something, I was thinking a doc or a blog post or even an example with yaml and a README

On Mon, Jun 27, 2016 at 11:45 PM, Edouard Kaiser notifications@github.com wrote:

Happy to write something.

Were you think about updating the README of the Ingress Controllers ( https://github.com/kubernetes/contrib/tree/master/ingress/controllers/nginx )?

We could add a new paragraph about the choice of using endpoint list instead of service VIP (advantages like upstream tuning, session affinity..) and showing that there is no guarantee of synchronisation even by using the service VIP.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/kubernetes/contrib/issues/1140#issuecomment-228964234, or mute the thread https://github.com/notifications/unsubscribe/AFVgVIOlLSwqcb-Gq7h7JcXUutoVQjmzks5qQMMCgaJpZM4IuliB .

thockin commented 8 years ago

You also have labels at your disposal.

If you make your Service select app=myapp,active=true, then you can start all your pods with that set of labels. When you want to do your own termination, you can remove the active=true label from the pod, which will update the Endpoints object, and that will stop sending traffic. Wait however long you think you need, then delete the pod.

Or you could teach your apps to handle SIGTERM.

Or you could make an argument for a configurable signal rather than SIGTERM (if you can make a good argument)

Or ... ? other ideas welcome

On Tue, Jun 28, 2016 at 4:47 AM, Edouard Kaiser notifications@github.com wrote:

I'm glad I have a better understanding on how it works, it makes sense if you think about the kube-proxy as just an API watcher.

But to be honest, now I'm kind of stuck. Some of our applications don't handle very well the SIGTERM (no graceful stop..). Even if the application is in the middle of a request, a SIGTERM would shutdown the app immediately.

Using Kubernetes, I'm not sure how to deploy without downtime now. My initial understanding was this flow when scaling down/deploying new version:

Remove the pod from the endpoint list

Wait for the terminationGracePeriod (to wait for any request in progress to finish)

Then shutdown with SIGTERM

We need to rethink about how to deploy or see if we can adapt our application to handle SIGTERM.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kubernetes/contrib/issues/1140#issuecomment-229026299, or mute the thread https://github.com/notifications/unsubscribe/AFVgVMo0zaGLSpVIPgp_E8TwYDsNX-Rcks5qQQnUgaJpZM4IuliB .

edouardKaiser commented 8 years ago

Thanks for the advice, I tend to forget how powerful labels can be.

Regarding writing something, I can write a blog to explain why using and endpoint list is better. But I'm not sure what kind of example (YAML) you are talking about.

thockin commented 8 years ago

I guess there's not much YAML to write up. :) I just want to see something that I can point the next person who asks this at and say "read this"

On Tue, Jun 28, 2016 at 8:48 PM, Edouard Kaiser notifications@github.com wrote:

Thanks for the advice, I tend to forget how powerful labels can be.

Regarding writing something, I can write a blog to explain why using and endpoint list is better. But I'm not sure what kind of example (YAML) you are talking about.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kubernetes/contrib/issues/1140#issuecomment-229249421, or mute the thread https://github.com/notifications/unsubscribe/AFVgVPjK5UBgJeyzfyR0VczUdDbpIw6Dks5qQesegaJpZM4IuliB .

edouardKaiser commented 8 years ago

No worries Tim, I keep you posted.

thockin commented 8 years ago

Fantastic!!

On Tue, Jun 28, 2016 at 11:05 PM, Edouard Kaiser notifications@github.com wrote:

No worries Tim, I keep you posted.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kubernetes/contrib/issues/1140#issuecomment-229264761, or mute the thread https://github.com/notifications/unsubscribe/AFVgVDx-gvq8eGKsrHoOrAgMbN0R2WXZks5qQgsXgaJpZM4IuliB .

edouardKaiser commented 8 years ago

I just created this blog entry:

http://onelineatatime.io/ingress-controller-forget-about-the-service/

I hope it will help some people. Feel free to tell me if there is anything wrong, anything that I could do to improve this entry.

Cheers,

thockin commented 8 years ago

Great post!!

Small nit:

1. Replication Controller deletes 1 pod
2. Pod is marked unready and shows up as Terminating
3. TERM signal is sent
4. Pod is removed from endpoints
5. Pod receives SIGKILL after grace period
6. Kube-proxy detects the change of the endpoints and update iptables

should probably be:

1. Replication Controller deletes 1 pod
2. Pod is marked as Terminating
3. Kubelet observes that change and sends SIGTERM
4. Endpoint controller observes the change and removes the pod from
Endpoints
5. Kube-proxy observes the Endpoints change and updates iptables
6. Pod receives SIGKILL after grace period

3 and 4 happen roughly in parallel. 3 and 5 are async to each other, so it's just as likely that 5 happens first as vice-versa. Your Ingress controller would be 4.1. Ingress controller observes the change and updates the proxy :)

On Wed, Jul 6, 2016 at 10:45 PM, Edouard Kaiser notifications@github.com wrote:

I just created this blog entry:

http://onelineatatime.io/ingress-controller-forget-about-the-service/

I hope it will help some people. Feel free to tell me if there is anything wrong, anything that I could do to improve this entry.

Cheers,

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kubernetes/contrib/issues/1140#issuecomment-230984547, or mute the thread https://github.com/notifications/unsubscribe/AFVgVGMuCT6-ArM6-SodThc8wrsEFfSDks5qTJKRgaJpZM4IuliB .

edouardKaiser commented 8 years ago

Thanks Tim, I will update it !

timoreimann commented 8 years ago

If you make your Service select app=myapp,active=true, then you can start all your pods with that set of labels. When you want to do your own termination, you can remove the active=true label from the pod, which will update the Endpoints object, and that will stop sending traffic. Wait however long you think you need, then delete the pod.

I was wondering if the above approach could potentially be built into Kubernetes directly. The benefit I see is that people won't need to create custom routines which effectively bypass all standard tooling (e.g., kubectl scale / delete).

If labels aren't the right thing for this case, I could also think of a more low-levellish implementation: Introduce a new state called Deactivating that precedes Terminating and serves as a trigger for the Endpoint controller to remove a pod from rotation. After (yet another) configurable grace period, the state would switch to Terminating and cause kubelet to SIGTERM the pod as usual.

@thockin would that be something worth pursuing or rather be out of question?

thockin commented 8 years ago

I'm very wary of adding another way of doing the same thing as a core feature. For the most part, graceful termination should do the right thing for most people.

I could maybe see extending DeploymentStrategy to offer blue-green rather than rolling, but that's not really this.

On Sat, Jul 9, 2016 at 5:53 AM, Timo Reimann notifications@github.com wrote:

If you make your Service select app=myapp,active=true, then you can start all your pods with that set of labels. When you want to do your own termination, you can remove the active=true label from the pod, which will update the Endpoints object, and that will stop sending traffic. Wait however long you think you need, then delete the pod.

I was wondering if the above approach could potentially be built into Kubernetes directly. The benefit I see is that people won't need to create custom routines which effectively bypass all standard tooling (e.g., kubectl scale / delete).

If labels aren't the right thing for this case, I could also think of a more low-levellish approach: Introduce a new state called Deactivating that precedes Terminating and serves as a trigger for the Endpoint controller to remove a pod from rotation. After (yet another) configurable grace period, the state would switch to Terminating and cause kubelet to SIGTERM the pod as usual.

@thockin https://github.com/thockin would that be something worth pursuing or rather be out of question?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kubernetes/contrib/issues/1140#issuecomment-231533093, or mute the thread https://github.com/notifications/unsubscribe/AFVgVABppSzvMyBPqIy2Xuyw38h0L3poks5qT5m0gaJpZM4IuliB .

timoreimann commented 8 years ago

@thockin If I understand correctly, the way to allow for a non-interruptive transition using graceful termination is to have a SIGTERM handler that (in the most simplistic use case) just delays termination for a safe amount of time.

Is there a way to reuse such a handler across various applications, possibly through a sidecar container? Otherwise, I see the problem that the handler must be implemented and integrated for each and every application (at least per language/technology) over and over again. For third-party applications, it may even be impossible to embed a handler directly.

thockin commented 8 years ago

On Sun, Jul 10, 2016 at 3:01 PM, Timo Reimann notifications@github.com wrote:

@thockin If I understand correctly, the way to allow for a non-interruptive transition using graceful termination is to have a SIGTERM handler that (in the most simplistic use case) just delays termination for a safe amount of time.

There's no use to handle SIGTERM from a sidecar, if the main app dies upon receiving it. It doesn't "just" delay - it notifies the app that its end-of-life is near, and that it should wrap up and exit soon, or otherwise be prepared.

Is there a way to reuse such a handler across various applications, possibly through a sidecar container? Otherwise, I see the problem that the handler must be implemented and integrated for each and every application (at least per language/technology) over and over again. For third-party applications, it may even be impossible to embed a handler directly.

The problem is that "handling" SIGTERM is really app-specific. Even if you just catch it and ignore it, that's a decision we shouldn't make for you.

Now, we have a proposal in flight for more more generalized notifications, including HTTP, so maybe we can eventually say that, rather than SIGTERM being hardcoded, that is merely the default handler, but someone could override that. But that spec is not fully formed yet, and I don't want to build on it just yet.

I'm open to ideas, but I don't see a clean way to handle this. Maybe a pod-level field that says "don't send me SIGTERM, but pretend you did" ? That's sort of ugly..

timoreimann commented 8 years ago

What I meant by delaying termination is that a custom SIGTERM handler could keep the container alive (i.e., time.Sleep(reasonablePeriod)) long enough until the point in time where it's safe to believe that the endpoint controller has taken the pod out of rotation so that requests won't hit an already dead pod. I don't think this is an ideal approach for reasons I have mentioned -- my assumption was that this is what you meant when you said "graceful termination [as a core feature] should do the right thing for most people". Maybe I misunderstood you; if so, I'd be glad for some clarification.

To repeat my intention: I'm looking for the best way to prevent request drops when scaling events / rolling-upgrades occur (as the OP described) without straying too far away from what standard tooling (namely kubectl) gives. My (naive) assessment is that the Kubernetes control plane is best suited for doing the necessary coordinative effort.

Do you possibly have any issue/PR numbers to share as far as that generalized notification proposal is concerned?

bprashanth commented 8 years ago

you should fail your readiness probe when you receive a sigterm. The nginx controller will health check endpoint readiness every 1s and avoid sending requests. Set termination grace to something high and keep nginx (or whatever webserver you're running in your endpoint pod) alive till existing connections drain. Is this enough? (I haven't read through previous conversation, so apologies if this was already rejected as a solution).

It sounds like what you're really asking for is to use the service vip in the nginx config and cut out the racecondition that springs from going: kubelt readiness -> apiserver -> endpoints -> kubeproxy, we've discussed various ways to achieve this (https://github.com/kubernetes/kubernetes/issues/28442), but right now the easiest way is to health check endpoints from the ingress controller.

thockin commented 8 years ago

On Sun, Jul 10, 2016 at 3:57 PM, Timo Reimann notifications@github.com wrote:

What I meant by delaying termination is that a custom SIGTERM handler could keep the container alive (i.e., time.Sleep(reasonablePeriod)) long enough until the point in time where it's safe to believe that the endpoint controller has taken the pod out of rotation so that requests won't hit an already dead pod. I don't think this is an ideal approach for reasons I have mentioned -- my assumption was that this is what you meant when you said "graceful termination [as a core feature] should do the right thing for most people". Maybe I misunderstood you; if so, I'd be glad for some clarification.

I'm a little confused, I guess. What you're describing IS the core functionality. When a pod is deleted, we notify it and wait at least grace-period seconds before killing it. During that time window (30 seconds by default), the Pod is considered "terminating" and will be removed from any Services by a controller (async). The Pod itself only needs to catch the SIGTERM, and start failing any readiness probes. Assuming nothing is totally borked in the cluster, the load-balancers should stop sending traffic and the pod will be OK to terminate within the grace period. This is, truthfully, a race and a bit of wishful thinking. If something is borked in the cluster, it is possible that load-balancers don't remove pods "in time" and when the pod dies it kills live connections.

The alternative is that we never kill a pod while any service has the pod in it's LB set. In the event of brokenness we trade a hung rolling update for the above-described early termination. Checking which Services a pod is in is hard, and we just don't do that today. Besides that, it's an unbounded problem. Today it is Services, but it is also Ingresses. But ingresses are somewhat opaque in this regard, so we can't actually check. And there may be arbitrary other frontends to a Pod. I don't think the problem is solvable this way.

So you're saying that waiting some amount of time is not "ideal", and I am agreeing. But I think it is less bad than the other solutions.

The conversation turned to "but my app doesn't handle SIGTERM", to which I proposed a hacky labels-based solution. It probably works, but it is just mimicing graceful termination

To repeat my intention: I'm looking for the best way to prevent request drops when scaling events / rolling-upgrades occur (as the OP described) without straying too far away from what standard tooling (namely kubectl) gives. My (naive) assessment is that the Kubernetes control plane is best suited for doing the necessary coordinative effort.

Graceful termination. This is the kubernetes control plane doing the coordinative effort. "wait some time" is never a satisfying answer, but in practice it is often sufficient. In this case, the failures that would cause "wait" to misbehave probably cause worse failures if you try to close the loop.

Do you have any issue/PR numbers to share as far as that generalized notification proposal is concerned?

https://github.com/kubernetes/kubernetes/pull/26884

timoreimann commented 8 years ago

@thockin @bprashanth sorry for not getting back on this one earlier. I did intend to follow up on your responses.

First, thanks for providing more details to the matter.

I'm fine with accepting the fact that graceful termination involves some timely behavior which also provides a means to set upper bounds in case things start to break. My concerns are more about the happy path and the circumstance that presumably a lot of applications running on Kubernetes will have no particular graceful termination needs but want the necessary coordination between the shutdown procedure and load-balancing adjustments to take place. As discussed, these applications need to go through the process of implementing a signal handler to switch off the readiness probe deliberately.

To add a bit of motivation on my end: We plan to migrate a number of applications to Kubernetes where the vast majority of them serves short-lived requests only and has no particular needs with regards to graceful termination. When we want to take instances down in our infrastructure, we just remove them from LB rotation and make sure in-flight requests are given enough time to finish. Moving to Kubernetes, we'll have to ask every application owner to implement and test a custom signal handler, and in the case of closed third-party applications resort to complicating workarounds with workflows/tooling separate from the existing standards. My impression is that this represents an undesirable coupling between the applications running on Kubernetes and an implementation detail on the load balancing routing part of the cluster manager.

That's why I think having a separate mechanism exclusively implemented in the control plane could contribute to simplifying running applications on Kubernetes by removing some of the lifecycle management boilerplate. To elaborate a bit on my previous idea: Instead of making each application fail its readiness probe, make Kubernetes do that "externally" and forcefully once it has decided to take a pod down, and add another grace period (possibly specified with the readiness probe) to give the system sufficient time for the change in readiness to propagate. This way, custom signal handlers for the purpose of graceful termination become limited in scope to just that: Giving an application the opportunity to execute any application-specific logic necessary to shut down cleanly, while all the load balancing coordination happens outside and up front. I'm naively hopeful that by reusing existing primitives of Kubernetes like readiness probes, timeouts, and adjustments to load-balancing, we can avoid dealing with the kind of hard problems that you have mentioned (checking which services a pod is in, unbound number of service frontends).

I'm wondering if it might be helpful to open a separate proposal issue and discuss some of these things in more detail. Please let me know if you think it's worthwhile carrying on.

Thanks.

ababkov commented 7 years ago

Sorry for pinging an old tread but i'm struggling a tad to find concrete answers in the core docs on this and this is the best thread i've found so far which explains what's going on... so can I clarify a few things in relation to termination? If this should be posted somewhere else, please let me know.

A: The termination cycle:

Before a pod is terminated, it enters a "terminating" phase that lasts the duration of a configurable grace period (which by default is 30 seconds). Once this period concludes, it enters a "terminated" phase while resources are cleaned up and is then eventually deleted (or does it just get deleted after the terminating phase?).
As soon as the pod enters the "terminating" phase, each container is sent a SIGTERM or a custom command (if the preStop lifecycle hook is configured on a container) and at the same time the pod immediately automatically advertises an "unready" state.
The rules in 1 and 2 are followed regardless of the reason for termination i.e. if the pod is exceeding max memory / cpu usage, node is OOM etc. In no case will the Pod be SIGKILL'd without first entering the "terminating" phase with the grace period.
Services, ingress etc. will see the change of the pod to the "unready" state via a subscription to the state store and start removing the pod from their own configurations. Because this entire process is async, the pod may still get some additional traffic for a few seconds after it's received a SIGTERM.
An ingress and / or service will not by default (and cannot otherwise be configured to) retry its request with another pod if it receives some kind of pod is terminating (or other) status code or response.

B: Handling the termination

If you want 100% throughput, with no dropped traffic it is not recommended that containers actually terminate themselves after being sent a SIGTERM command. Instead, they should clean up what they can and stick around to handle any remaining requests that may trickle through until the grace period expires or at least for the period of time that you guess it would take for services / ingress etc. to update their configurations.
If you're happy to drop some traffic, you can instead actually terminate your processes / containers when you get your SIGTERM signal or have them start failing their liveness probes. Even if marked as "essential", containers will not be restarted if the pod is in a terminating phase. If all containers stop prior to the grace period expiring, the pod immediately enters the "terminated" phase.

timoreimann commented 7 years ago

@ababkov From what I can see, your description is pretty much correct. (Others are needed to fully judge though.)

Two remarks:

Re: A.3.: I'd expect an OOM'ing container to receive a SIGKILL right away in the conventional Linux sense. For sure it exits with a 137 code, which traditionally represents a fatal error yielding signal n where n = 137 - 128 = 9 = SIGKILL. Re: A.5.: Ingress sends traffic to the endpoints directly, which means that in principle it's possible for it to retry requests to other pods. Whether that's an actual default or something that can be configured depends on the concrete Ingress controller used. AFAIK, both Nginx and Traefik support retrying. As far as Services are concerned, you need to distinguish between the two available modes: In userspace mode, requests can be retried. In iptables mode (the current default), they cannot. (There's a ticket to try out IPVS as a third option which, as far as I understood, would bring the benefits of the two modes together: high performance while also being able to retry requests.)

Here's a recommendable read to better understand Ingresses and Services: http://containerops.org/2017/01/30/kubernetes-services-and-ingress-under-x-ray/

ababkov commented 7 years ago

Thanks very much for your reply @timoreimann - re A.5 will watch the IPVS item, also the post you linked is really good - helped me to get a better understanding of kube-proxy - had i not spent days gradually coming to the realisation how services and ingress work it probably would have helped for that as well.

Re A.3 - Is your explanation based on a pod that's gone over its memory allocation or a node that is out of memory and killing pods so it can continue running? An immediate sigkill might be a little frustrating if you're trying to ensure your apps have a healthy shut down phase.

If i could get a few more opinions on my post above from one or two others and / or some links to relevant documentation where you think these scenarios are covered in detail (understanding I have done quite a lot of research before coming here) that would be great.

I know I can just experiment with this myself and "see what happens", but if there's someway to shortcut that and learn from others and / or the core team, that would be awesome.

timoreimann commented 7 years ago

@ababkov re: A.3: Both: There's an OOM killer for the global case (node running out of memory) and the local one (memory cgroup exceeding its limit). See also this comment by @thockin on a Docker issue.

I think that if you run into a situation where the OOM killer has selected a target process, it's already too late for a graceful termination: After all, the (global or local) system took this step in order to avoid failure on a greater scale. If "memory-offending" processes were given an additional opportunity to lengthen this measure arbitrarily, the consequences could be far worse. I'd argue that it's better to prevent an OOM situation from occurring in the first place by means of monitoring/alerting on memory consumption continuously such that you still have enough time to react.

While doing a bit of googling, I ran across kubelet soft eviction thresholds. With these, you can define an upper threshold and tell kubelet to shut down pods gracefully in time before a hard limit is reached. From what I can tell though, the eviction policies operate on a node level, so it won't help in the case where a single pod exceeds its memory limit.

Again, chances are there's something I'm missing. Since this issue has been closed quite a while ago, my suggestion to hear a few more voices would be to create a new issue / mailing list thread / StackOverflow question. I'd be curious to hear what others have to say, so if you decide to follow my advice please leave a reference behind. :-)

ababkov commented 7 years ago

@timoreimann the soft eviction thresholds add another piece to the puzzle - thanks for linking.

Happy to open another topic - i'm still new to the project but presuming i'd open this in this repo in particular?

Topic would be along the lines of trying to get some clarity in place around the nature of the termination lifecycle in every possible scenario that a pod can be terminated.

timoreimann commented 7 years ago

@ababkov I'd say that if the final goal is to contribute the information you will gain back to Kubernetes project (supposedly in form of better documentation), an issue in the main kubernetes/kubernetes repo seems in order.

OTOH, if this is "only" about getting your questions answered, StackOverflow is probably the better place to ask.

Up to you. :-)

ababkov commented 7 years ago

@timoreimann more than happy to contribute - thanks for your help mate.

domino14 commented 7 years ago

It is very odd to me that in order to not drop any traffic, SIGTERM should not actually terminate my app, but instead let it hang around for a bit? (Until Ingress updates its configurations). If I wanted actual 100% uptime during this time period, it's not possible with the default k8s? I really would rather not drop traffic if I can help it, and testing with ab definitely shows 502s with an nginx ingress controller.

I think this kind of issue should be prioritized. Otherwise I can try something like that label-based solution mentioned earlier, but then it feels like re-inventing the wheel and seems quite complex.

thockin commented 7 years ago

On Thu, Mar 30, 2017 at 9:39 PM, César Del Solar notifications@github.com wrote:

It is very odd to me that in order to not drop any traffic, SIGTERM should not actually terminate my app, but instead let it hang around for a bit? (Until Ingress updates its configurations). If I wanted actual 100% uptime during this time period, it's not possible with the default k8s? I really would rather not drop traffic if I can help it, and testing with ab definitely shows 502s with an nginx ingress controller.

I'm not sure what you're expressing here? Are you saying that the SIGTERM/SIGKILL sequence is distasteful? Or are you saying it doesn't work in some case?

I think this kind of issue should be prioritized. Otherwise I can try something like that label-based solution mentioned earlier, but then it feels like re-inventing the wheel and seems quite complex.

This was prioritized, and what resulted was the SIGTERM/SIGKILL, readiness flip, remove from LB pattern. Can you please clarify the problem you are experiencing?

domino14 commented 7 years ago

@thockin What I am saying is that after SIGTERM is sent, the Ingress still sends traffic to my dying pods for a few seconds, which then causes the end users to see 502 Gateway errors (using for example an Nginx ingress controller). A few people in this thread have mentioned something similar. I don't know of any workarounds, or how to implement that "labels" hack mentioned earlier.

How do I get a zero-downtime deploy?

kubernetes-retired / contrib

[ingress/controllers/nginx] Use Service Virtual IP instead of maintaining Pod list #1140