[ingress/controllers/nginx] Use Service Virtual IP instead of maintaining Pod list

edouardKaiser commented 8 years ago

Is there a way to tell the NGINX Ingress controller to use the Service Virtual IP Address instead of maintaining the Pods IP addresses in the upstream configuration ?

I couldn't find it. If not, I think it would be good. Because with the current situation, when we scale down a service. the Ingress controller does not work in harmony with the Replication Controller of the service.

That means, some requests to the Ingress Controller will fail while waiting for the Ingress Controller to be updated.

If we use the Service Virtual IP address, we can let kube-proxy do its job in harmony with the replication controller and we have a seamless down scaling.

thockin commented 7 years ago

On Fri, Mar 31, 2017 at 2:25 PM, César Del Solar notifications@github.com wrote:

@thockin What I am saying is that after SIGTERM is sent, the Ingress still sends traffic to my dying pods for a few seconds, which then causes the end users to see 502 Gateway errors (using for example an Nginx ingress controller). A few people in this thread have mentioned something similar. I don't know of any workarounds, or how to implement that "labels" hack mentioned earlier.

I'm not sure I understand, still, sorry. The expected timeline

1) A user or scaler or something has decided a particular pod needs to die. It sends a DELETE call.

2) The pod is marked as "being deleted".

3) Now you enter distributed-systems land. In parallel and totally asynchronously:

3a) kubelet sends SIGTERM to the Pod, Pod ignores or starts counting connections, or something

3b) a controller removes the Pod from Endpoints

3c) kube-proxy removes the Pod from virtual LBs

3d) possibly external LBs are updated

4) Now time passes, generally O(10-seconds)

5) Kubelet sends SIGKILL if the pod is still alive

6) The pod goes away.

Where in there are you getting a 502?

How do I get a zero-downtime deploy?

The above should be zero-downtime, at the cost of non-zero realtime. You have to simply wait - wait for everyone upstream to shut their traffic flows off, wait for connections to die, etc.

domino14 commented 7 years ago

Step 3. Kubelet sends SIGTERM to the Pod. Let's say the pod is running Gunicorn, which, upon receiving a SIGTERM, stops receiving new connections, and gracefully stops; finishing its current requests up until a 30-second timeout.

During this time, since all of 3 is async, Ingress is still sending some traffic to this Gunicorn, which is now refusing all connections. The nginx Ingress controller I am using then returns a 502 Bad Gateway.

thockin commented 7 years ago

On Fri, Mar 31, 2017 at 3:42 PM, César Del Solar notifications@github.com wrote:

Step 3. Kubelet sends SIGTERM to the Pod. Let's say the pod is running Gunicorn, which, upon receiving a SIGTERM, stops receiving new connections, and gracefully stops; finishing its current requests up until a 30-second timeout.

That's the wrong behavior. It's a distributed system. It should not stop accepting, it should just consider itself on notice. Or you can use a different signal - build your image with a different STOPSIGNAL, and simply trap-and-ignore it.

During this time, since all of 3 is async, Ingress is still sending some traffic to this Gunicorn, which is now refusing all connections. The nginx Ingress controller I am using then returns a 502 Bad Gateway.

Yeah, as soon as you said the above, I got it. Closing the socket is the wrong behavior.

To do what I think you really want would require a different state of existence for pods - where we have received the DELETE command but haven't told the pod about it yet, but HAVE told everyone else about it...

domino14 commented 7 years ago

So then, what is the point of the SIGTERM? My preStop could literally just have a bash -c sleep 60?

I'm not sure how to make it so that gunicorn basically ignores the SIGTERM. I set up a hack with a readinessProbe that just tries to cat a file, and in the preStop, I delete the file and sleep 20 seconds. I can see that the pod sticks around for a while, but I still drop some requests. Should I sleep longer? Would it have to do with an Http Keep-Alive? I'm kind of lost as to the best way to implement this.

domino14 commented 7 years ago

I added a simple preStop hook:

        lifecycle:
          preStop:
            exec:
              # A hack to try to get 0% downtime during deploys. This should 
              # help ensure k8s eventually stops giving this node traffic.
              command: ["sh", "-c", "sleep 75"]

I watch my pods when I do an apply and deploy a new container version. Since I have maxUnavailable: 0 my pod doesn't start terminating until another pod is ready. Then, it starts terminating and the countdown timer above starts. Around this time is when I get the 502s:

192.168.64.1 - - [19/Mar/2017:12:54:56 +0000] "POST /healthz/ HTTP/1.0" 200 8 "-" "ApacheBench/2.3" "-"
2017/03/19 12:54:56 [notice] 1321#1321: signal process started
2017/03/19 12:54:56 [error] 1322#1322: *127179 connect() failed (111: Connection refused) while connecting to upstream, client: 192.168.64.1, server: vm.example.org, request: "POST /healthz/ HTTP/1.0", upstream: "http://172.17.0.4:8000/healthz/", host: "vm.example.org"
192.168.64.1 - - [19/Mar/2017:12:54:56 +0000] "POST /healthz/ HTTP/1.0" 502 173 "-" "ApacheBench/2.3" "-"
192.168.64.1 - - [19/Mar/2017:12:54:56 +0000] "POST /healthz/ HTTP/1.0" 200 8 "-" "ApacheBench/2.3" "-"

How come I get a connection refused? My container didn't get a sigterm until 75 seconds later (I saw that being logged), which was way after these 502s.

domino14 commented 7 years ago

By the way, I think I figured out how to do this; thanks for your help in showing me the rough sequence of events. I think what was happening is that between the new container coming up and it being ready to accept requests, a small amount of time passes, and nginx was directing requests to the new container too early. I added a hack. My readinessProbe now looks like:

        readinessProbe:
          exec:
            command:
            - cat
            - /opt/myapp/test_requirements.txt
          successThreshold: 1
          failureThreshold: 2
          periodSeconds: 5
          initialDelaySeconds: 5   # Important so that the container doesn't get traffic too soon

and the preStop looks like:

        lifecycle:
          preStop:
            exec:
              # A hack to try to get 0% downtime during deploys. This should 
              # help ensure k8s eventually stops giving this node traffic.
              command: ["sh", "-c", "rm /opt/myapp/test_requirements.txt && sleep 20"]

This can probably be tweaked a bit; for example, I should be using a real readinessProbe that hits my API, and then on the preStop, run a script to tell my app to start failing the probe, rather than deleting a file. The sleep is required too, so that the old container doesn't die too quickly before the Ingress stops directing traffic to it. The numbers are probably too high but this works, I ran a bunch of tests, and no dropped requests or "connection refused" logs in nginx, which means I think it's doing the right thing. Thanks!

ababkov commented 7 years ago

What's wrong with the current process: The main issue is that SIGTERM basically means... finish what you're doing, don't start any new work and gracefully exit.

At the moment it's instead, finish what you're doing, continue accepting and processing new work for 0 - 10 seconds? ... or however long the load balancers take to update... and somehow work out when an appropriate time might be for you to gracefully exit before you're sent a SIGKILL.

I think the ideal situation here would be to:

have each load balancing service register itself as listening to a container.
modify the termination flow to first put the containers into an unready state.
after each load balancing service sees the unready state of the containers and removes the containers from their pool, they should also remove their "i'm listening to this container" marker.
When all markers are gone (or after some max configurable amount of time), continue with the termination cycle as per usual.

All of that being said, the above is probably overly intricate and not a great idea... Perhaps the best we can do is:

Educate people better about how this works in the documentation (it really needs a page that diagrams all the different ways a container can be terminated, how termination state flows and potential repercussions).
Add something to the config to allow for the equivalent of what @domino14 is doing above to delay the sigterm + subsequent sigkill commands by a load balancer grace period. Eg. go into an immediate unready state 10 seconds before attempted SIGTERM and then continue the lifecycle as expected.

timoreimann commented 7 years ago

@ababkov I think your proposed ideal is very similar to what I described above a few months ago. Effectively, it's request draining, and I believe it's not too uncommon to have that with other networking components. I'd still think that an implementation could primarily reuse already existing concepts (state transitions/observations, LB rotation changes, timers, etc.) and be fairly loosely coupled. It would be "yet another waiting layer" on top (or in place) of the existing SIGTERM logic.

We are currently transitioning 100+ services over to Kubernetes. Having to ask every service owner to provide the kind of SIGTERM procedure needed by Kubernetes is quite an effort, especially for 3rd party software which we need to wrap in custom scripts running pre-stop hooks.

At this stage, opening a separate feature request to discuss the matter seems worthwhile to me.

ababkov commented 7 years ago

@timoreimann I agree though I'd ask that you take the lead on that given that you're in the process of transitioning and can probably more clearly explain the use case. I fully understand your current situation (we're about to be in the same position) and am happy to contribute.

The only thing i'm not fully sure of from a proposal standpoint is whether the feature should / can try and address anything beyond a preliminary "draining" status that occurs for a fixed configurable period of time. The alternative of course being implementing a solution where things that route traffic to the container register themselves as listeners (via state updates) and acknowledge (via state updates) when a container had gone into draining status.... once all acknowledgements have been logged, container transitions to terminating status and everything proceeds per usual.

edouardKaiser commented 7 years ago

@timoreimann we have the same issue, we have to ask every service owner to implement a proper SIGTERM handler to make sure deployment is transparent to the users.

It's true that it would make things easier if the pod was just flagged as not-ready anymore. Give time to remove it behind the service, draining requests, and then SIGTERM....

timoreimann commented 7 years ago

@ababkov I'd be happy to be the one who files the feature request.

My understanding is that any solution requiring some means of extended coordination will be much harder to push through. @thockin and other members of the community have expressed in the past that too much coupling is rather dangerous in a distributed system like Kubernetes, and it's the reason why Kubernetes was designed differently. Personally, I think that does make a lot of sense.

I think we will have time to delve into these and other implementation details on the new ticket. Will drop pointers here once I can find some time to file it (after I made sure no one else had put forward a similar request yet).

Thanks for your input!

ababkov commented 7 years ago

@timoreimann based on that i'd suggest that the request be logged as the simpler alternative of allowing a configurable amount of time where the container sits in an unready / draining state prior to the remainder of the termination flow taking place (maybe in reference to this conversation).

That alone would make everything 10 times clearer and more straight forward.

domino14 commented 7 years ago

I think at the very least a few notes should be added in the documentation, as it's not clear that a special set of procedures is needed to get actual zero downtime.

timoreimann commented 7 years ago

@domino14 did you see @edouardKaiser's blog post he referenced before?

timoreimann commented 7 years ago

@ababkov sounds good to me!

domino14 commented 7 years ago

Yep, I did see the post and it was helpful in showing me what was going on behind the scenes, but it's still a bit tough to go from that to knowing we need delays in several places, etc, to get zero downtime deploys. That's why I suggested notes in the documentation.

In any case I'm a newcomer to K8S and I appreciate greatly the work done here. I'd submit a documentation patch if I knew the terminology better.

thockin commented 7 years ago

On Mar 31, 2017 11:52 PM, "César Del Solar" notifications@github.com wrote:

So then, what is the point of the SIGTERM? My preStop could literally just have a bash -c sleep 60?

SIGTERM is notification that your grace period has started. There are LOTS of kinds of apps out there - more than we can predict.

I'm not sure how to make it so that gunicorn basically ignores the SIGTERM. I set up a hack with a readinessProbe that just tries to cat a file, and in the preStop, I delete the file and sleep 20 seconds. I can see that the pod sticks around for a while, but I still drop some requests. Should I sleep longer? Would it have to do with an Http Keep-Alive? I'm kind of lost as to the best way to implement this.

There's a way to change the STOPSIGNAL in Dockerfile, but we don't re-expose that in a pod. Not sure if that is configurable to Docker outside of Dockerfile.

A preStop hook is maybe better, as you seem to have found, though still a little clumsy..

I'm open to exploring more intricate solutions, but I don't have bandwidth to drive the design right now. The proposed back-and-forth between LB providers would be quite a challenge to debug. I bet we can find something simpler...

keegancsmith commented 7 years ago

With regards to the original issue, could it be related to differences in connection strategies between the nginx ingress and kube-proxy. At least for the userspace kube-proxy it has a retry if dialling an endpoint fails https://sourcegraph.com/github.com/kubernetes/kubernetes@v1.6.1/-/blob/pkg/proxy/userspace/proxysocket.go#L94:1-95:1 I'm not sure if nginx ingress or the other kube-proxy modes have a similar strategy.

When gunicorn receives a SIGTERM, does it stop listening on the socket but continue to drain the open requests? In that case it should gracefully drain with kube-proxy userspace since kube-proxy will move on to the next endpoint if gunicorn does not accept the connection.

thockin commented 7 years ago

There's a race still, even with graceful termination.

t0: client deletes pod

t1: apiserver sets pod termination time

t2: kubelet wakes up and marks the pod not-ready

t3:

kubelet sends SIGTERM to pod
controller wakes up and removes pod from endpoints

t4: kube-proxies wake up and remove pod from iptables

in between t3 and t4, an external LB could route traffic. The pod could have stopped accept()ing, which will make the external LB unhappy. Pods should continue to accept() traffic, but be ready to terminate when connections reach 0 (or whatever the heuristic is).

The alternative would be to force a sequencing between LBs and pod signalling, which we don't have today.

On Mon, Apr 17, 2017 at 5:58 PM, Keegan Carruthers-Smith < notifications@github.com> wrote:

With regards to the original issue, could it be related to differences in connection strategies between the nginx ingress and kube-proxy. At least for the userspace kube-proxy it has a retry if dialling an endpoint fails https://sourcegraph.com/github.com/kubernetes/ kubernetes@master/-/blob/pkg/proxy/userspace/proxysocket.go#L94:1-95:1 I'm not sure if nginx ingress or the other kube-proxy modes have a similar strategy.

When gunicorn receives a SIGTERM, does it stop listening on the socket but continue to drain the open requests? In that case it should gracefully drain with kube-proxy userspace since kube-proxy will move on to the next endpoint if gunicorn does not accept the connection.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kubernetes/contrib/issues/1140#issuecomment-294639964, or mute the thread https://github.com/notifications/unsubscribe-auth/AFVgVOREboHbpuCgk20pHyrwASjbcDcWks5rxAqkgaJpZM4IuliB .

n1koo commented 7 years ago

Even if servers like (g)unicorn are doing the wrong thing theres the no guarantee which is faster a) endpoint update b) SIGTERM on process. I think the endpoint update should be sync and only after its finished we continue with SIGTERM (as discussed above). Any other solution has this timing problem.

Some notes about our testing and workarounds for now:

You can workaround this by using the sleep hack in case of services. Magic sleeps are never really "real" patterns, hacks at best
This problem is mitigated by nginx for GETs and such because it retries 50x. -Theres another issue in nginx where full upstream flip (eg. pool [a,b] -> [c,d) causes the proxy_next_upstream to not work. Testable with eg. rollingupdate, replicas 3, maxSurge 25%, maxUnavailable 50% where it ends up splitting from 4 pods (2 old in pool, 2 new getting ready) to 2 new in pool
I haven't been able to reproduce this on GLBC. Any ideas what it does differently since simply using a service vip doesn't fix it. Does it do the same as proxy_next_upstream but better ?

aledbf commented 7 years ago

Theres another issue in nginx where full upstream flip (eg. pool [a,b] -> [c,d) causes the proxy_next_upstream to not work. Testable with eg. rollingupdate, replicas 3, maxSurge 25%, maxUnavailable 50% where it ends up splitting from 4 pods (2 old in pool, 2 new getting ready) to 2 new in pool

The focus after the next nginx controller release (0.9) will be avoid nginx reloads for upstream updates.

warmchang commented 6 years ago

👍

Diggsey commented 6 years ago

Just ran into this issue - at the very least this should be better documented, I'm sure the vast majority of people using rolling updates believe that kubernetes waits until a pod is out of rotation before sending SIGTERM, and this is not the case!

I also think that even if not implemented right away, sequencing SIGTERM to happen after a pod has been taken out of rotation should be the eventual goal: I don't think saying "gunicorn is doing the wrong thing" is a useful response, that's just kubernetes redefining what it thinks these signals mean, gunicorn and other servers behave in a fairly standard way wrt signals, and kubernetes should aim to be compatible with them rather than offloading the problem onto users.

simonkotwicz commented 5 years ago

@thockin you mentioned in an earlier comment:

During that time window (30 seconds by default), the Pod is considered “terminating” and will be removed from any Services by a controller (async). The Pod itself only needs to catch the SIGTERM, and start failing any readiness probes.

Why do you need to start failing the readiness probe?

When a Pod is marked as Terminating, the Endpoint controller observes the change and removes the pod from Endpoints.

Since this is already done as a result of a terminating pod, why do we also need to start failing the readiness probe, which (after it fails a certain amount of times) will also result in the prod being removed as the endpoint of the service.

timoreimann commented 5 years ago

@simonkotwicz you do not need to start failing the readiness probe (these days anymore?). I'm not sure if it used to be like that, but it's certainly not the case these days. Things work fine without that part.

kubernetes-retired / contrib

[ingress/controllers/nginx] Use Service Virtual IP instead of maintaining Pod list #1140