Closed ChikkannaSuhas closed 4 years ago
Technically, this isn't a Kubernetes bug as this is working as intended; however, it does beg the question why we remove unschedulable nodes if they're marked as ready since we perform the following checks today to determine LB backend nodes.
If any of those are a no, then it'll be removed from the pool on the next sync.
Perhaps someone from @kubernetes/sig-network-bugs can shed some light.
/sig network
cc @MrHohn
This problem extends beyond GCP/ingress controller and to anyone using pools to upgrade nodes. A typical workflow I see is you'd cordon a node pool so as you drain each node the pods don't go to another node in the pool that'll shortly be drained as well. The current logic does not support that and complicates cluster administration because cordoning does not respect PDB's.
/remove-sig apps /remove-sig gcp
(since the service controller falls under sig networking)
@jhorwit2: Those labels are not set on the issue: sig/
Ref https://github.com/kubernetes/kubernetes/issues/44997.
I thought a general practice for draining the old pool is to move over one node at a time in a rolling fashion. Shouldn't there always be certain backend pods being able to receive & serve traffic?
cc @nicksardo @thockin
@MrHohn WRT draining, it is recommended to do it one node at a time here; however, in real world scenarios doing it one node at a time can cause a lot of avoidable container churn if you don't cordon off the nodes you at least plan to drain so that pods aren't being scheduled on a node you'll be draining after the next one.
According to the comments in the service controller it seems the unschedulable part exists because it was how we differentiated masters in the past; however, we have labels for that now. We also have the service exclusion label (can't find when/if that'll be promoted though as i don't think there's a ticket on it).
I'm having problems with this issue as well. The fact that nodes are removed from target pools when cordoned means that I sometimes get brief interruptions of the traffic to a service when I cordon a node.
I'm guessing the node isn't gracefully removed from the load balancer which results in some traffic being cut off trying to route through the recently cordoned node.
The same problem occasionally happens when deleting nodes / node pools or when the cluster autoscaler deletes a node.
We are running GKE ver. 1.10.2-gke.1. Our best defence against these issue have so far been to limit node changes as much as possible and not use any auto scaling.
I've posted a question regarding this issue in SO some weeks before it was created.
The suggested action was to use node taints (NoSchedule
) instead of cordon, that way nodes will still be marked as Ready
.
@erkez Thank you for the link, it Certainly helped. Perhaps, if thats the case(as mentioned in the accepted answer of the above link), then to cordon all the nodes of old node pools should not be a recommended action, at least in the GKE upgrade docs.
kubectl help cordon
only says: Mark node as unschedulable.
And I can't find what more it does in the documentation. So I would expect that it has exactly the same behaviour as a NoSchedule
taint.
Why is removing running pods from the balancer working as intended? If it isn't a bug then the documentation is certainly unclear in this regard.
This just caused a major outage for us too.
I agree with the previous posts, while this may be working as the code intended, it does not follow the patterns and documentation k8s provides. When I cordon a node it should not impact the traffic going to the node or any pods on it. I should be able to gradually remove traffic by draining nodes/pods and respecting PDBs.
@thockin @smarterclayton this issue touches on the fact we are removing masters as well and inadvertently removing cordoned nodes. The logic in the service controller mentions we only filter out unschedulable nodes since that's how masters were setup.
We wanted to replace all of the kubernetes cluster nodes to different node instance types, so we started off by cordoning all of the nodes, but it surprisingly caused an outage for us.
As @joekohlsdorf pointed out, the only documentation around cordoning is Mark node as unschedulable
, which I took to mean that new pods can not be scheduled on those nodes. It was surprising behavior to find out that, not only does it not allow new pods to be scheduled on those nodes, but that it also causes the service controller to remove unschedulable nodes from load balancing pools. I understand that it's working as intended, but if that was also documented as part of the cordon operation, I would have been able to avoid an outage.
I finally caught up to this bug. Fun one.
I agree that this is working-as-intended and that it stinks. We can do better.
As a short-term workaround, you can taint the nodes before cordoning one-by-one. That will stop new work from arriving, but will leave old work there. If we agree on an appropriately named taint, we can have nginx tolerate it by default. Am I missing any reason why this would not work? @bsalamat @mml
If that flies, we should fix docs (@chenopis those are GCP docs). I am happy to help with that.
Longer term maybe we can do something smarter. I am not sure what that would be, just now. Ideas?
Is there any reason we can't just leave the nodes in the load balancer pool when the node is cordoned? And when the pods are terminated the nodes can be removed?
I may be missing some context here?
The presumption is that a node that is unschedulable is going away, or needs some other repair. Using it as a target for load-balancers means there could be sessions routed THROUGH that node. Even as the pods are drained. Right up until the node suddenly dies, causing an outage of a different flavor.
What should really happen is that a node which becomes unschedulable starts draining connections. Once all the pods are moved off, the node can be removed from the pool entirely. I don't think any cloud providers, even GCP, handle this just right today.
On Mon, Sep 17, 2018 at 5:20 PM David Radcliffe notifications@github.com wrote:
Is there any reason we can't just leave the nodes in the load balancer pool when the node is cordoned? And when the pods are terminated the nodes can be removed?
I may be missing some context here?
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/kubernetes/kubernetes/issues/65013#issuecomment-422212336, or mute the thread https://github.com/notifications/unsubscribe-auth/AFVgVHz-p_M7-hezukBrJDyJ8Ynu8XbBks5ucDxAgaJpZM4UkM4m .
Wait, if you brought up a new nodepool, all those new nodes should be in the TargetPool. Cordoning the old nodepool should only terminate the existing connections, new connections should be fine. Something doesn't add up...
Was this using ExternalTrafficPolicy: Local
Was this using ExternalTrafficPolicy: Local
Yes
Right, well, that explains it.
So even draining won't do what you want - you need new connections to be allowed until you have capacity in the new pool.
The good news is that you can do this entirely yourself, with taints. We just need nginx to tolerate the taint, which you can also specify yourself. The only real alternative I see would be, as you said, to leave nodes in the LB pool until the node dies. Given that we never specified this, it is probably different on every provider, which would make it hard to change (especially if the provider already provides clean drains while unschedulable).
As a short-term workaround, you can taint the nodes before cordoning one-by-one. That will stop new work from arriving, but will leave old work there. If we agree on an appropriately named taint, we can have nginx tolerate it by default. Am I missing any reason why this would not work?
@thockin IIUC, we don't want new pods to land on the tainted/cordoned nodes. In that case, we can place "NoSchedule" taint on the nodes. We don't need any tolerations on nginx. This will prevent new instances of nginx from landing on the tainted nodes, but the existing ones will keep running there.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale
/remove-lifecycle stale
This just surprised us also. I understand the technical reasons behind it but the command name cordon
and associated documentation is misleading. To me it isn't intuitive that the command cordon
does not remove compute (pods) but does remove some networking. I would expect cordon to keep new pods from getting scheduled to it and keep it from getting added to new services/ELBs/etc. The drain
command would then cause pods to be removed and also take the node out of networking paths. Almost need a new state for a node of offline
that is clearly defined.
My main issue is with the docs, and also GKE docs. They should explain this pitfall. After I got aware of the issue I understand why it works as it does.
A simple workaround is to taint the nodes you are going to rotate first.
Just chiming in; We as a GKE customer have today been bitten by this issue in production. The documentation of kubectl cordon
(in the context of GKE) is not extensive enough so we could be aware of this issue. This should be addressed.
Luckily the outage was not severe because we were using the node-by-node approach (and accepted the container churn this would cause). We explicitly took this approach because our previous nodepool upgrade was a total outage also caused by kubectl cordon
in a different way.
(for those interested in our total outage, not related to this issue):
Due to the fact that kubectl cordon
also results in the kube-dns-autoscaler
re-calculating the amount of schedulable nodes and thus scaled the amount of kube-dns
pods in our cluster (in our case: from 110 back to 2) resulting for us in major DNS outage internally.
Be warned that kubectl cordon
for a nodepool upgrade can have a lot of unwanted & unexpected side effects.
It looks like this happens only to AWS and GCP but not on Azure? Can anyone confirm this?
============ I just tested in both AWS and Azure, and while I see logs like
aws_loadbalancer.go:1361] Instances removed from load-balancer
in AWS Kubernetes, I do not see any logs like that in Azure. Is it intended? If so isn't behavior different for each cloud provider?
i.e.) cordoning a node in AWS Kubernetes prevents loadbalancer from routing traffic to the affected node but cordoning a node in Azure Kubernetes will still let the loadbalancer route traffic to the affected node. Is it intentional?
Besides running a test like the @kevinkim9264 above, how would I determine what the behavior is on AWS? I haven't found anything in the node administration docs or the aws-load-balancer
Service that would imply that cordon
would shift traffic away from a Ready
Pod. In fact, the doc for Manual Node Administration explicitly states, emphasis mine:
Marking a node as unschedulable prevents new pods from being scheduled to that node, but does not affect any existing pods on the node. This is useful as a preparatory step before a node reboot, etc. For example, to mark a node unschedulable, run this command: `kubectl cordon $NODENAME``
If I hadn't come across this Issue due to reports from GKE, I suppose it would only have been a matter of time until we had a high-impact production outage. There's a disconnect between Kubernetes API and the LoadBalancer as to whether Pods running on a cordon
'd node are Ready. From all the docs I've seen, I would expect that either a) cordon
would evict Pods over to a Ready
Node before setting the instance as OutOfService
, or b) LoadBalancer would not equate Ready,SchedulingDisabled
with OutOfService
Hi All,
What approaches people using too workaround this?
From reading the comments I was hoping to follow the custom taint approach. However, I've noticed kubectl drain
executes a cordon making it difficult to drain the node safely with this approach.
I was lucky in that I experienced this issue while cordoning our testing environment, so we didn't experience a production outage.
@imduffy15 I think what I'm going to try is the following... it's going to be time-consuming but I think it's the safest bet.
kubectl drain
executes a cordon
first, so that's a no-go. This is the time-consuming part... for each "legacy" node I am going to kubectl delete pod
each pod, and let my PDBs take care of maintaining minimum number of pods.I'm hopeful to try this approach this afternoon, I'll comment back here if I do!
I was lucky in that I experienced this issue while cordoning our testing environment, so we didn't experience a production outage.
Same as myself! Pity neither of us have a story for https://k8s.af/
I'm hopeful to try this approach this afternoon, I'll comment back here if I do!
Since my comment this morning I've been trying a few approaches, I've ended up with the following:
kubectl get nodes | grep "pool-1" | awk '{print $1}' | xargs -I {} kubectl taint nodes {} node-upgrading=true:NoSchedule
Ensure your nginx ingress is deployed with tolerances:
nginx-ingress:
rbac:
create: true
controller:
kind: DaemonSet
service:
externalTrafficPolicy: Local
publishService:
enabled: true
stats:
enabled: true
metrics:
enabled: true
tolerations:
- effect: NoExecute
operator: Exists
- effect: NoSchedule
operator: Exists
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
- effect: NoSchedule
key: node.kubernetes.io/disk-pressure
operator: Exists
- effect: NoSchedule
key: node.kubernetes.io/memory-pressure
operator: Exists
- effect: NoSchedule
key: node.kubernetes.io/unschedulable
operator: Exists
- effect: NoSchedule
key: node.kubernetes.io/network-unavailable
operator: Exists
Hi All,
I'm at a bit of a loss on this one.
I've started using custom taints to drain the nodes without getting them removed from the load balancer.
My application can hold a HTTP connection open for up to 600 seconds. I'm using nginx as a daemonset, have it configured with a keepalive timeout of 620s, and have ExternalTrafficPolicy: Cluster
.
I drain the application first using a node-upgrade=true:NoExecute
taint on each node. I have nginx on each node tolerate this taint and remove it separately at a later point using a different taint.
I'm waiting 660s between drains of each node for each taint to ensure nothing is dropped - this all works fine as expected.
However, once I kill the node pool or do a kubectl drain after the node is empty of all my stuff plus nginx ingress I start seeing connection issues. I'm a bit lost where to go at debugging them, best guess is something to do with kube-proxy.
Can anyone shed some light on this? @thockin is there anything on the Google Layer 4 load balancer I'm missing here? Is there some way to correctly drain kube-proxy (or does that even make sense?)
This doesn't address the network issues you're seeing, but for those trying to "roll their own cordon" using taints:
Using a NoExecute taint to "drain" the node won't be orderly and won't respect any PDBs. You want to replicate the logic from kubectl drain
which tries to do evictions of every pod in a loop. (I realize this is a burdensome idea, but it is also the safest idea.)
To add on to https://github.com/kubernetes/kubernetes/issues/65013#issuecomment-515228948,:
One approach that does not use node taints to force migrate, is to change nodeSelectorTerms
for the node affinity requirements of your app. A change in node selector (e.g. based on instance type), along with a PDB (e.g. 66% / min: 2
out of 3 replicas availability) should migrate you over fine. This worked out better for me than to drain or add taints to nodes to force push apps out.
After the migration completes, you should be able to kubectl drain & delete
the nodes, and delete the node pool.
However, once I kill the node pool or do a kubectl drain after the node is empty of all my stuff plus nginx ingress I start seeing connection issues. I'm a bit lost where to go at debugging them, best guess is something to do with kube-proxy
FWIW, I too saw connection issues during load testing of a migration, and at least on my setup it was due to using an AWS NLB for the NGINX Ingress Controller. Switching to a classic ELB instead worked, as the NGINX LB Service seems to hit some sort of sync issue by my accounts when using NLBs.
self-plug: I happened to recently cover a similar topic in a migration blog post that you may find relevant. Hope it helps
One approach that does not use node taints to force migrate, is to change nodeSelectorTerms for the node affinity requirements of your app.
For better or worse, for in my case pods that are behind an ingress resource are using node affinity to preferably not schedule them onto the same node. Originally this was done to ensure traffic is balanced across the different nginx ingresses. It has had a positive affect in this case too.
FWIW, I too saw connection issues during load testing of a migration, and at least on my setup it was due to using an AWS NLB for the NGINX Ingress Controller. Switching to a classic ELB instead worked, as the NGINX LB Service seems to hit some sort of sync issue by my accounts when using NLBs.
Very interesting! Thank you for sharing, I'm not sure that swapping out the Layer4 GCE load balancer is an option here, but you have me thinking. I'm wondering if the TCP keepalive between the LB and nginx is potentially keeping connections open. https://blog.percy.io/tuning-nginx-behind-google-cloud-platform-http-s-load-balancer-305982ddb340 , https://cloud.google.com/load-balancing/docs/https/#timeouts_and_retries and https://cloud.google.com/load-balancing/docs/https/?source=post_page--------------------------- suggest that there may be a 600 second TCP keepalive going on. Will retest doing a sleep of 620s after nginx removal and before node drain.
Has anyone come up with a strategy to safely move workloads between two pools? Reading this bug, it seems there's no way to reliably do this without causing an outage if you follow the recommendation of cordon
+ drain
. In my case I'm not using an nginx ingress but services with type LoadBalancer
.
From what I can see, adding NoSchedule
to a node causes the Service Controller to consider the node not ready and thus removes it from a target group. This would mean that between NoSchedule
and NoExecute
there is no way to add a taint to the node that will leave it in the LB until it's drained. Using PreferNoSchedule
doesn't work in this case since the thing I'm draining might just come back and I'm not sure if the service controller will still consider the node ready in that case, but I believe it does.
The one strategy I can think of is using labels on the node and an anti-affinity on the pod. You'd then add the label to a node that would push the pod away and then delete the pod. As long as the anti affinity uses requiredDuringSchedulingIgnoredDuringExecution
this should be safe to do in preparation of a migration. Once the pods have been deleted you could then drain
the nodes.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale
/lifecycle frozen /remove-lifecycle stale /remove-lifecycle rotten
@daenney - As I mentioned in https://github.com/kubernetes/kubernetes/issues/65013#issuecomment-455222929 I set up a script that first taints all nodes in the old node pool and then cordon+taint one by one node. If your ingress services are redundant this should work well.
Is there a plan for cordon behavior to improve on GCP? We have relied on this behavior for years on our AWS clusters and it was very surprising to find out.
This is a bug with service load balancers - not something we should document. We defined a label in https://github.com/kubernetes/enhancements/blob/master/keps/sig-architecture/2019-07-16-node-role-label-use.md#service-load-balancer that cluster admins can use - Unschedulable has nothing to do with SLB inclusion.
node.kubernetes.io/exclude-from-external-load-balancers
which is beta and on by default in 1.19.
The presumption is that a node that is unschedulable is going away, or needs some other repair.
Disagree. Unschedulable means "no new workloads". It has nothing to do with existing workloads, that is why we have drain (but we should have something in kube that communicates nodes are draining).
I have opened #90823 to remove this check - we have an alternate approach now and the old logic is wrong (the comment in the code even says "this is for masters!" and that's wrong now too as part of the kep).
It has nothing to do with existing workloads, that is why we have drain (but we should have something in kube that communicates nodes are draining).
A node status condition?
Possibly, or a taint, or a label. Something concrete.
/milestone v1.19
@dictcp
Is this a BUG REPORT or FEATURE REQUEST?:
What happened: In order to migrate to a new version of kubernetes on GKE. I cordoned(Kubectl cordon) all nodes in the old node pool(1.8.9) in order to drain the node pool(pods on the node) with an intention to cause the pods to run on a new version of the node pool(1.10.2). As soon as I cordoned all the nodes in the old node pool, it caused the downtime of ingress(nginx). The nginx controller pods where running fine on the cordoned node pool. But the cordon caused the nodes to be removed from the target pool of ingress load balancer. Below are the documents that suggest the above same method that I followed as best practice for zero downtime upgrade.
1). https://cloud.google.com/kubernetes-engine/docs/tutorials/migrating-node-pool 2). https://cloudplatform.googleblog.com/2018/06/Kubernetes-best-practices-upgrading-your-clusters-with-zero-downtime.html
What you expected to happen:
I expected "kubectl cordon" not to have the node/s removed from the target pool of the loadbalancer because the nginx controller pods where running absolutely fine on these nodes.
How to reproduce it (as minimally and precisely as possible): 1).Cordon all the nodes in an old node pool that runs nginx controller.
Anything else we need to know?:
Environment:
kubectl version
): 1.10.2uname -a
): 4.14.22+