kubernetes / kubernetes

Production-Grade Container Scheduling and Management
https://kubernetes.io
Apache License 2.0
110.88k stars 39.61k forks source link

kubectl cordon causes downtime of ingress(nginx) #65013

Closed ChikkannaSuhas closed 4 years ago

ChikkannaSuhas commented 6 years ago

Is this a BUG REPORT or FEATURE REQUEST?:

Uncomment only one, leave it on its own line:

/kind bug

/kind feature

What happened: In order to migrate to a new version of kubernetes on GKE. I cordoned(Kubectl cordon ) all nodes in the old node pool(1.8.9) in order to drain the node pool(pods on the node) with an intention to cause the pods to run on a new version of the node pool(1.10.2). As soon as I cordoned all the nodes in the old node pool, it caused the downtime of ingress(nginx). The nginx controller pods where running fine on the cordoned node pool. But the cordon caused the nodes to be removed from the target pool of ingress load balancer. Below are the documents that suggest the above same method that I followed as best practice for zero downtime upgrade.

1). https://cloud.google.com/kubernetes-engine/docs/tutorials/migrating-node-pool 2). https://cloudplatform.googleblog.com/2018/06/Kubernetes-best-practices-upgrading-your-clusters-with-zero-downtime.html

What you expected to happen:

I expected "kubectl cordon " not to have the node/s removed from the target pool of the loadbalancer because the nginx controller pods where running absolutely fine on these nodes.

How to reproduce it (as minimally and precisely as possible): 1).Cordon all the nodes in an old node pool that runs nginx controller.

Anything else we need to know?:

Environment:

jhorwit2 commented 6 years ago

Technically, this isn't a Kubernetes bug as this is working as intended; however, it does beg the question why we remove unschedulable nodes if they're marked as ready since we perform the following checks today to determine LB backend nodes.

If any of those are a no, then it'll be removed from the pool on the next sync.

Perhaps someone from @kubernetes/sig-network-bugs can shed some light.

/sig network

jhorwit2 commented 6 years ago

cc @MrHohn

This problem extends beyond GCP/ingress controller and to anyone using pools to upgrade nodes. A typical workflow I see is you'd cordon a node pool so as you drain each node the pods don't go to another node in the pool that'll shortly be drained as well. The current logic does not support that and complicates cluster administration because cordoning does not respect PDB's.

jhorwit2 commented 6 years ago

/remove-sig apps /remove-sig gcp

(since the service controller falls under sig networking)

k8s-ci-robot commented 6 years ago

@jhorwit2: Those labels are not set on the issue: sig/

In response to [this](https://github.com/kubernetes/kubernetes/issues/65013#issuecomment-396554801): >/remove-sig apps >/remove-sig gcp > >(since the service controller falls under sig networking) Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
MrHohn commented 6 years ago

Ref https://github.com/kubernetes/kubernetes/issues/44997.

I thought a general practice for draining the old pool is to move over one node at a time in a rolling fashion. Shouldn't there always be certain backend pods being able to receive & serve traffic?

cc @nicksardo @thockin

jhorwit2 commented 6 years ago

@MrHohn WRT draining, it is recommended to do it one node at a time here; however, in real world scenarios doing it one node at a time can cause a lot of avoidable container churn if you don't cordon off the nodes you at least plan to drain so that pods aren't being scheduled on a node you'll be draining after the next one.

According to the comments in the service controller it seems the unschedulable part exists because it was how we differentiated masters in the past; however, we have labels for that now. We also have the service exclusion label (can't find when/if that'll be promoted though as i don't think there's a ticket on it).

pkdetlefsen commented 6 years ago

I'm having problems with this issue as well. The fact that nodes are removed from target pools when cordoned means that I sometimes get brief interruptions of the traffic to a service when I cordon a node.

I'm guessing the node isn't gracefully removed from the load balancer which results in some traffic being cut off trying to route through the recently cordoned node.

The same problem occasionally happens when deleting nodes / node pools or when the cluster autoscaler deletes a node.

We are running GKE ver. 1.10.2-gke.1. Our best defence against these issue have so far been to limit node changes as much as possible and not use any auto scaling.

erkez commented 6 years ago

I've posted a question regarding this issue in SO some weeks before it was created.

https://stackoverflow.com/questions/50488507/why-does-http-load-balancer-forwarding-rules-exclude-cordoned-gke-nodes/50515954

The suggested action was to use node taints (NoSchedule) instead of cordon, that way nodes will still be marked as Ready.

ChikkannaSuhas commented 6 years ago

@erkez Thank you for the link, it Certainly helped. Perhaps, if thats the case(as mentioned in the accepted answer of the above link), then to cordon all the nodes of old node pools should not be a recommended action, at least in the GKE upgrade docs.

joekohlsdorf commented 6 years ago

kubectl help cordon only says: Mark node as unschedulable. And I can't find what more it does in the documentation. So I would expect that it has exactly the same behaviour as a NoSchedule taint.

Why is removing running pods from the balancer working as intended? If it isn't a bug then the documentation is certainly unclear in this regard.

dwradcliffe commented 6 years ago

This just caused a major outage for us too.

I agree with the previous posts, while this may be working as the code intended, it does not follow the patterns and documentation k8s provides. When I cordon a node it should not impact the traffic going to the node or any pods on it. I should be able to gradually remove traffic by draining nodes/pods and respecting PDBs.

jhorwit2 commented 6 years ago

@thockin @smarterclayton this issue touches on the fact we are removing masters as well and inadvertently removing cordoned nodes. The logic in the service controller mentions we only filter out unschedulable nodes since that's how masters were setup.

Chili-Man commented 6 years ago

We wanted to replace all of the kubernetes cluster nodes to different node instance types, so we started off by cordoning all of the nodes, but it surprisingly caused an outage for us.

As @joekohlsdorf pointed out, the only documentation around cordoning is Mark node as unschedulable, which I took to mean that new pods can not be scheduled on those nodes. It was surprising behavior to find out that, not only does it not allow new pods to be scheduled on those nodes, but that it also causes the service controller to remove unschedulable nodes from load balancing pools. I understand that it's working as intended, but if that was also documented as part of the cordon operation, I would have been able to avoid an outage.

thockin commented 6 years ago

I finally caught up to this bug. Fun one.

I agree that this is working-as-intended and that it stinks. We can do better.

As a short-term workaround, you can taint the nodes before cordoning one-by-one. That will stop new work from arriving, but will leave old work there. If we agree on an appropriately named taint, we can have nginx tolerate it by default. Am I missing any reason why this would not work? @bsalamat @mml

If that flies, we should fix docs (@chenopis those are GCP docs). I am happy to help with that.

Longer term maybe we can do something smarter. I am not sure what that would be, just now. Ideas?

dwradcliffe commented 6 years ago

Is there any reason we can't just leave the nodes in the load balancer pool when the node is cordoned? And when the pods are terminated the nodes can be removed?

I may be missing some context here?

thockin commented 6 years ago

The presumption is that a node that is unschedulable is going away, or needs some other repair. Using it as a target for load-balancers means there could be sessions routed THROUGH that node. Even as the pods are drained. Right up until the node suddenly dies, causing an outage of a different flavor.

What should really happen is that a node which becomes unschedulable starts draining connections. Once all the pods are moved off, the node can be removed from the pool entirely. I don't think any cloud providers, even GCP, handle this just right today.

On Mon, Sep 17, 2018 at 5:20 PM David Radcliffe notifications@github.com wrote:

Is there any reason we can't just leave the nodes in the load balancer pool when the node is cordoned? And when the pods are terminated the nodes can be removed?

I may be missing some context here?

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/kubernetes/kubernetes/issues/65013#issuecomment-422212336, or mute the thread https://github.com/notifications/unsubscribe-auth/AFVgVHz-p_M7-hezukBrJDyJ8Ynu8XbBks5ucDxAgaJpZM4UkM4m .

thockin commented 6 years ago

Wait, if you brought up a new nodepool, all those new nodes should be in the TargetPool. Cordoning the old nodepool should only terminate the existing connections, new connections should be fine. Something doesn't add up...

thockin commented 6 years ago

Was this using ExternalTrafficPolicy: Local

dwradcliffe commented 6 years ago

Was this using ExternalTrafficPolicy: Local

Yes

thockin commented 6 years ago

Right, well, that explains it.

So even draining won't do what you want - you need new connections to be allowed until you have capacity in the new pool.

The good news is that you can do this entirely yourself, with taints. We just need nginx to tolerate the taint, which you can also specify yourself. The only real alternative I see would be, as you said, to leave nodes in the LB pool until the node dies. Given that we never specified this, it is probably different on every provider, which would make it hard to change (especially if the provider already provides clean drains while unschedulable).

bsalamat commented 6 years ago

As a short-term workaround, you can taint the nodes before cordoning one-by-one. That will stop new work from arriving, but will leave old work there. If we agree on an appropriately named taint, we can have nginx tolerate it by default. Am I missing any reason why this would not work?

@thockin IIUC, we don't want new pods to land on the tainted/cordoned nodes. In that case, we can place "NoSchedule" taint on the nodes. We don't need any tolerations on nginx. This will prevent new instances of nginx from landing on the tainted nodes, but the existing ones will keep running there.

fejta-bot commented 5 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

ChikkannaSuhas commented 5 years ago

/remove-lifecycle stale

jkemp101 commented 5 years ago

This just surprised us also. I understand the technical reasons behind it but the command name cordon and associated documentation is misleading. To me it isn't intuitive that the command cordon does not remove compute (pods) but does remove some networking. I would expect cordon to keep new pods from getting scheduled to it and keep it from getting added to new services/ELBs/etc. The drain command would then cause pods to be removed and also take the node out of networking paths. Almost need a new state for a node of offline that is clearly defined.

olvesh commented 5 years ago

My main issue is with the docs, and also GKE docs. They should explain this pitfall. After I got aware of the issue I understand why it works as it does.

A simple workaround is to taint the nodes you are going to rotate first.

crunchie84 commented 5 years ago

Just chiming in; We as a GKE customer have today been bitten by this issue in production. The documentation of kubectl cordon (in the context of GKE) is not extensive enough so we could be aware of this issue. This should be addressed.

Luckily the outage was not severe because we were using the node-by-node approach (and accepted the container churn this would cause). We explicitly took this approach because our previous nodepool upgrade was a total outage also caused by kubectl cordon in a different way.

(for those interested in our total outage, not related to this issue): Due to the fact that kubectl cordon also results in the kube-dns-autoscaler re-calculating the amount of schedulable nodes and thus scaled the amount of kube-dns pods in our cluster (in our case: from 110 back to 2) resulting for us in major DNS outage internally.

Be warned that kubectl cordon for a nodepool upgrade can have a lot of unwanted & unexpected side effects.

kevinkim9264 commented 5 years ago

It looks like this happens only to AWS and GCP but not on Azure? Can anyone confirm this?

============ I just tested in both AWS and Azure, and while I see logs like

aws_loadbalancer.go:1361] Instances removed from load-balancer

in AWS Kubernetes, I do not see any logs like that in Azure. Is it intended? If so isn't behavior different for each cloud provider?

i.e.) cordoning a node in AWS Kubernetes prevents loadbalancer from routing traffic to the affected node but cordoning a node in Azure Kubernetes will still let the loadbalancer route traffic to the affected node. Is it intentional?

tklovett commented 5 years ago

Besides running a test like the @kevinkim9264 above, how would I determine what the behavior is on AWS? I haven't found anything in the node administration docs or the aws-load-balancer Service that would imply that cordon would shift traffic away from a Ready Pod. In fact, the doc for Manual Node Administration explicitly states, emphasis mine:

Marking a node as unschedulable prevents new pods from being scheduled to that node, but does not affect any existing pods on the node. This is useful as a preparatory step before a node reboot, etc. For example, to mark a node unschedulable, run this command: `kubectl cordon $NODENAME``

If I hadn't come across this Issue due to reports from GKE, I suppose it would only have been a matter of time until we had a high-impact production outage. There's a disconnect between Kubernetes API and the LoadBalancer as to whether Pods running on a cordon'd node are Ready. From all the docs I've seen, I would expect that either a) cordon would evict Pods over to a Ready Node before setting the instance as OutOfService, or b) LoadBalancer would not equate Ready,SchedulingDisabled with OutOfService

imduffy15 commented 5 years ago

Hi All,

What approaches people using too workaround this?

From reading the comments I was hoping to follow the custom taint approach. However, I've noticed kubectl drain executes a cordon making it difficult to drain the node safely with this approach.

brianirish commented 5 years ago

I was lucky in that I experienced this issue while cordoning our testing environment, so we didn't experience a production outage.

@imduffy15 I think what I'm going to try is the following... it's going to be time-consuming but I think it's the safest bet.

I'm hopeful to try this approach this afternoon, I'll comment back here if I do!

imduffy15 commented 5 years ago

I was lucky in that I experienced this issue while cordoning our testing environment, so we didn't experience a production outage.

Same as myself! Pity neither of us have a story for https://k8s.af/

I'm hopeful to try this approach this afternoon, I'll comment back here if I do!

Since my comment this morning I've been trying a few approaches, I've ended up with the following:

Ensure your nginx ingress is deployed with tolerances:

nginx-ingress:
  rbac:
    create: true
  controller:
    kind: DaemonSet
    service:
      externalTrafficPolicy: Local
    publishService:
      enabled: true
    stats:
      enabled: true
    metrics:
      enabled: true
    tolerations:
    - effect: NoExecute
      operator: Exists
    - effect: NoSchedule
      operator: Exists
    - effect: NoExecute
      key: node.kubernetes.io/not-ready
      operator: Exists
    - effect: NoExecute
      key: node.kubernetes.io/unreachable
      operator: Exists
    - effect: NoSchedule
      key: node.kubernetes.io/disk-pressure
      operator: Exists
    - effect: NoSchedule
      key: node.kubernetes.io/memory-pressure
      operator: Exists
    - effect: NoSchedule
      key: node.kubernetes.io/unschedulable
      operator: Exists
    - effect: NoSchedule
      key: node.kubernetes.io/network-unavailable
      operator: Exists
imduffy15 commented 5 years ago

Hi All,

I'm at a bit of a loss on this one.

I've started using custom taints to drain the nodes without getting them removed from the load balancer.

My application can hold a HTTP connection open for up to 600 seconds. I'm using nginx as a daemonset, have it configured with a keepalive timeout of 620s, and have ExternalTrafficPolicy: Cluster.

I drain the application first using a node-upgrade=true:NoExecute taint on each node. I have nginx on each node tolerate this taint and remove it separately at a later point using a different taint.

I'm waiting 660s between drains of each node for each taint to ensure nothing is dropped - this all works fine as expected.

However, once I kill the node pool or do a kubectl drain after the node is empty of all my stuff plus nginx ingress I start seeing connection issues. I'm a bit lost where to go at debugging them, best guess is something to do with kube-proxy.

Can anyone shed some light on this? @thockin is there anything on the Google Layer 4 load balancer I'm missing here? Is there some way to correctly drain kube-proxy (or does that even make sense?)

mml commented 5 years ago

This doesn't address the network issues you're seeing, but for those trying to "roll their own cordon" using taints:

Using a NoExecute taint to "drain" the node won't be orderly and won't respect any PDBs. You want to replicate the logic from kubectl drain which tries to do evictions of every pod in a loop. (I realize this is a burdensome idea, but it is also the safest idea.)

metral commented 5 years ago

To add on to https://github.com/kubernetes/kubernetes/issues/65013#issuecomment-515228948,:

One approach that does not use node taints to force migrate, is to change nodeSelectorTerms for the node affinity requirements of your app. A change in node selector (e.g. based on instance type), along with a PDB (e.g. 66% / min: 2 out of 3 replicas availability) should migrate you over fine. This worked out better for me than to drain or add taints to nodes to force push apps out.

After the migration completes, you should be able to kubectl drain & delete the nodes, and delete the node pool.

However, once I kill the node pool or do a kubectl drain after the node is empty of all my stuff plus nginx ingress I start seeing connection issues. I'm a bit lost where to go at debugging them, best guess is something to do with kube-proxy

FWIW, I too saw connection issues during load testing of a migration, and at least on my setup it was due to using an AWS NLB for the NGINX Ingress Controller. Switching to a classic ELB instead worked, as the NGINX LB Service seems to hit some sort of sync issue by my accounts when using NLBs.

self-plug: I happened to recently cover a similar topic in a migration blog post that you may find relevant. Hope it helps

imduffy15 commented 5 years ago

One approach that does not use node taints to force migrate, is to change nodeSelectorTerms for the node affinity requirements of your app.

For better or worse, for in my case pods that are behind an ingress resource are using node affinity to preferably not schedule them onto the same node. Originally this was done to ensure traffic is balanced across the different nginx ingresses. It has had a positive affect in this case too.

FWIW, I too saw connection issues during load testing of a migration, and at least on my setup it was due to using an AWS NLB for the NGINX Ingress Controller. Switching to a classic ELB instead worked, as the NGINX LB Service seems to hit some sort of sync issue by my accounts when using NLBs.

Very interesting! Thank you for sharing, I'm not sure that swapping out the Layer4 GCE load balancer is an option here, but you have me thinking. I'm wondering if the TCP keepalive between the LB and nginx is potentially keeping connections open. https://blog.percy.io/tuning-nginx-behind-google-cloud-platform-http-s-load-balancer-305982ddb340 , https://cloud.google.com/load-balancing/docs/https/#timeouts_and_retries and https://cloud.google.com/load-balancing/docs/https/?source=post_page--------------------------- suggest that there may be a 600 second TCP keepalive going on. Will retest doing a sleep of 620s after nginx removal and before node drain.

daenney commented 5 years ago

Has anyone come up with a strategy to safely move workloads between two pools? Reading this bug, it seems there's no way to reliably do this without causing an outage if you follow the recommendation of cordon + drain. In my case I'm not using an nginx ingress but services with type LoadBalancer.

From what I can see, adding NoSchedule to a node causes the Service Controller to consider the node not ready and thus removes it from a target group. This would mean that between NoSchedule and NoExecute there is no way to add a taint to the node that will leave it in the LB until it's drained. Using PreferNoSchedule doesn't work in this case since the thing I'm draining might just come back and I'm not sure if the service controller will still consider the node ready in that case, but I believe it does.

The one strategy I can think of is using labels on the node and an anti-affinity on the pod. You'd then add the label to a node that would push the pod away and then delete the pod. As long as the anti affinity uses requiredDuringSchedulingIgnoredDuringExecution this should be safe to do in preparation of a migration. Once the pods have been deleted you could then drain the nodes.

fejta-bot commented 4 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

thockin commented 4 years ago

/lifecycle frozen /remove-lifecycle stale /remove-lifecycle rotten

olvesh commented 4 years ago

@daenney - As I mentioned in https://github.com/kubernetes/kubernetes/issues/65013#issuecomment-455222929 I set up a script that first taints all nodes in the old node pool and then cordon+taint one by one node. If your ingress services are redundant this should work well.

blakebarnett commented 4 years ago

Is there a plan for cordon behavior to improve on GCP? We have relied on this behavior for years on our AWS clusters and it was very surprising to find out.

smarterclayton commented 4 years ago

This is a bug with service load balancers - not something we should document. We defined a label in https://github.com/kubernetes/enhancements/blob/master/keps/sig-architecture/2019-07-16-node-role-label-use.md#service-load-balancer that cluster admins can use - Unschedulable has nothing to do with SLB inclusion.

node.kubernetes.io/exclude-from-external-load-balancers which is beta and on by default in 1.19.

smarterclayton commented 4 years ago

The presumption is that a node that is unschedulable is going away, or needs some other repair.

Disagree. Unschedulable means "no new workloads". It has nothing to do with existing workloads, that is why we have drain (but we should have something in kube that communicates nodes are draining).

I have opened #90823 to remove this check - we have an alternate approach now and the old logic is wrong (the comment in the code even says "this is for masters!" and that's wrong now too as part of the kep).

danehans commented 4 years ago

It has nothing to do with existing workloads, that is why we have drain (but we should have something in kube that communicates nodes are draining).

A node status condition?

smarterclayton commented 4 years ago

Possibly, or a taint, or a label. Something concrete.

droslean commented 4 years ago

/milestone v1.19

AlphaWong commented 4 years ago

@dictcp