Closed Stono closed 6 years ago
kube-dns-autoscaler is a different thing, unrelated to cluster-autoscaler. From your list of pods it looks like you're not running CA at all. On the other hand the configmap is coming from CA. Perhaps CA was only started after you run kubectl get pods
? Either way it's hard to tell anything for sure without actual cluster-autoscaler logs.
That being said my guess would be you have kube-system pods spread across your nodes and by default CA will never touch those. We have a section on how to deal with this kind of issues (including how to allow CA to move kube-system pods) in our FAQ: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#i-have-a-couple-of-nodes-with-low-utilization-but-they-are-not-scaled-down-why
@maciaszczykm Apologies for some of the errors in my post, it was at the end of a long day. I blindly looked at the kubedns autoscaller.
I'm running on GKE, it seems like the autoscaler in use is hidden from me as cluster-autoscaler is updating the ConfigMap, but I can't see it (presumably running on the master which i'm abstracted from).
I'll try creating a few PDB's for things in kube-system and see how I get on! Ta!
Is kube-proxy created by a daemonset? See https://github.com/openai/kubernetes-ec2-autoscaler/issues/23. It could be useful to have it integrated https://github.com/allenai/kubernetes-ec2-autoscaler/pull/7 upstream
@bhack Those issues are related to a completely different component, unrelated to Cluster Autoscaler. CA ignores kube-proxy (along with other pods that run on every node) in scale-down.
@aleksandra-malinowska Is this valid also for kube-system pod without a daemonset or replica set? Cause you cannot set a valid PDB on kube-proxy in this deployment case.
I'm not sure what deployment case you mean? By default kube-proxy runs as DaemonSet. If you have a kube-system pod that isn't part of any Deployment/ReplicaSet/other controller, it won't be removed because Cluster Autoscaler doesn't evict such pods (whether kube-system or not). You can add a safe-to-evict annotation to a pod like this to override this behavior. This is described in CA FAQ.
@aleksandra-malinowska I was exactly referring to https://github.com/openai/kubernetes-ec2-autoscaler/issues/23#issuecomment-284370307 that is the default Kops deployment and used by many users to deploy a k8 clusters and related to this CA behaviour.
@bhack https://github.com/openai/kubernetes-ec2-autoscaler looks like a completely different, unrelated autoscaler. If you have questions regarding it, can you please ask it in this component's repo?
@aleksandra-malinowska Was just a reference but the problem is the same in this CA cause kube-proxy will not be shut down also in this CA implementation. I've tested this scenario with PDB, this CA and a standard Kops deployment (that it is quite frequent to have).
If we don't want to handle a workaround here probably we need to change the Kops kube-proxy deployment.
I've opened an issue at https://github.com/kubernetes/kops/issues/4419.
I still cannot identify by the user logs how he has deployed kube-proxy but I can see kube-proxy pods.
@Stono Do you have a daemonset for kube-proxy in your deployment or have you used Kops?
Was just a reference but the problem is the same in this CA cause kube-proxy will not be shut down also in this CA implementation. I've tested this scenario with PDB, this CA and a standard Kops deployment (that it is quite frequent to have).
Can you please provide relevant CA logs? There should be a reason for every scale-up and scale-down decision there, including why a node wasn't removed.
And if kube-proxy isn't running as a DaemonSet in your setup, can you please include your kube-proxy spec and PDB?
If we don't want to handle a workaround here probably we need to change the Kops kube-proxy deployment.
By standard, do you mean this? I don't know much about Kops, but maybe you can get a better answer in that repo.
Since kube-proxy runs on every node in the cluster, if it was preventing scale-down, Cluster Autoscaler would never ever remove any node. There are multiple reports of Cluster Autoscaler scaling down clusters, including those running on AWS, without any hacks. I don't know what is the most commonly used Kubernetes configuration in that environment, but it would be surprising if this happened as a result of it.
The PDB on kube-proxy off course gives Warning No Controllers found no controllers for pod
and yes the default deployment is this.
In the CA FAQ i read:
By default, kube-system pods prevent CA from removing nodes on which they are running. Users can manually add PDBs for the kube-system pods that can be safely rescheduled elsewhere.
So I've not checked the code implementation but by the CA documentation seems that without a working PDB you it cannot remove kube-system pods.
So I've not checked the code implementation but by the CA documentation seems that without a working PDB you it cannot remove kube-system pods.
This is true except for DaemonSets, which are always ignored.
Can you provide Cluster Autoscaler's logs from this test?
Yes but by the Kops kube-proxy Manifest is not a Daemonset. I have any specific entry in the CA log but what is the expected CA behavior if we agree that:
OK, disregard what I said before, it seems it doesn't always run as DaemonSet by default yet.
I have any specific entry in the CA log but what is the expected CA behavior if we agree that:
kube-proxy is not a Daemonset in K8 clusters created with Kubernetes/Kops PDB on kube-proxy cannot work cause it doesn't have a controller.
In my test cluster, kube-proxy has kubernetes.io/config.mirror
annotation. This makes Cluster Autoscaler ignore it when considering nodes for scale-down. Can you run kubectl describe pod <one-of-kube-proxy-pods> -n kube-system
and see what annotations it has?
Yes I have the kubernetes.io/config.mirror annotation for the kube-proxy pods. So we need to just update the FAQ?
Yes I have the kubernetes.io/config.mirror annotation for the kube-proxy pods.
Cool. So this doesn't explain why CA wasn't able to remove nodes in your test?
So we need to just update the FAQ?
I don't mind mentioning it in FAQ as long as it's explicitly stated that for manually marking the pod as safe to evict, using cluster-autoscaler.kubernetes.io/safe-to-evict
is preferred.
Yes but I think that kubernetes.io/config.mirror
is not exactly the same as cluster-autoscaler.kubernetes.io/safe-to-evict
and FAQ still tell you to create PDB for every pod in kube-system namespace.
My scale down test was related to verify low utilization of the node but seems that Autoscaler ignored that specific node.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale
@bhack is this resolved?
Yes I needed to setup all the required pdb
Great! Closing this as resolved then.
Thanks for the insightful conversation there. Is this the right place to ask a question? If not, please guide me where should I go. I have following annotations on my kube-proxy pods:
kubernetes.io/config.hash: 94c3cb3691d60d09c8e90d0f28b1e46c
kubernetes.io/config.mirror: 94c3cb3691d60d09c8e90d0f28b1e46c
kubernetes.io/config.seen: 2018-12-26T11:06:14.309457745Z
kubernetes.io/config.source: file
scheduler.alpha.kubernetes.io/critical-pod:
And just scheduler.alpha.kubernetes.io/critical-pod:
on my fluentd-gcp
pods.
My cluster on GKE never seems to be scaling down although the minimum size set is 0. Two of the hypotheses I have are:
"cluster-autoscaler.kubernetes.io/safe-to-evict": "true"
or presence of scheduler.alpha.kubernetes.io/critical-pod:
are preventing my cluster from down-scalingThe second one seems incorrect per se, but I noticed that if I set the Minimum size (per zone)
= 0 and Maximum size (per zone)
< 4 - I get the warning on GKE interface The current node pool size is now outside the autoscaling limits.
Note: There are a total of 3 zones - hence 4 nodes per zone would mean - 12 nodes.
I do not understand what required pdb
means in @bhack's last message in the thread.
Can someone put me in right direction to resolve this? Thanks
Is this what you are looking for? https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#how-to-set-pdbs-to-enable-ca-to-move-kube-system-pods
Although 13 nodes seems quite a lot for just system pods. Do you have any workloads actually running in the cluster? Here is a comprehensive list of pods that will block scale down: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#what-types-of-pods-can-prevent-ca-from-removing-a-node
I don't understand the comment about "the warning on GKE interface The current node pool size is now outside the autoscaling limits." If you set the maximum to be 12 and currently there re 13, the warning seems to be correct?
@bskiba - Thanks for your response. Re: Number of nodes - Sorry I should have mentioned earlier - There was 1 node running from a different node pool.
The warning about autoscaling limits appears when I reduce the Maximum size (per zone) field value from 4 to 3.
I have had a look at the list of pods that will block scale down and examined the pods running on nodes in my cluster - I noticed that there on all the nodes running (11 as of now) have some or other non kubernetes standard kube-system pod running on them such as fluentd-gcp, metadata, heapster, etc. - full list below (towards the end of this post) - This is apart from some of our application specific pods also running - however total node pool capacity usage is much lower than 50% as of now.
Since I noticed our cluster wasn't auto-scaling down, I enabled preemptible
on my node pool such that (and this is my hypothesis) every 24h the node is automatically deleted and if GCP doesn't request running any of its kube-system pods in more numbers, the new node does not spin up.
Now what I want to learn is - how do I mark these non kubernetes standard kube-system pods (fluentd-gcp, heapster, metadata, etc.) as safe to evict such that the nodes get scaled down when not in use by our application.
Node 1
kube-system fluentd-gcp-v3.1.0-b6pm7 0 (0%) 0 (0%) 0 (0%) 0 (0%) 11h
kube-system kube-proxy-gke-cluster-2-pool-4-high-mem-52f08173-q2sk 100m (2%) 0 (0%) 0 (0%) 0 (0%) 11h
kube-system metadata-agent-shp2s 40m (1%) 0 (0%) 50Mi (0%) 0 (0%) 11h
Node 2
kube-system event-exporter-v0.2.1-b4b4dbddf-rtrwv 0 (0%) 0 (0%) 0 (0%) 0 (0%) 12h
kube-system fluentd-gcp-v3.1.0-2b6tv 0 (0%) 0 (0%) 0 (0%) 0 (0%) 36h
kube-system heapster-v1.5.3-557f5c8d68-krp8t 138m (3%) 138m (3%) 304456Ki (1%) 304456Ki (1%) 31h
kube-system kube-proxy-gke-cluster-2-pool-4-high-mem-52f08173-ws75 100m (2%) 0 (0%) 0 (0%) 0 (0%) 36h
kube-system metadata-agent-knln2 40m (1%) 0 (0%) 50Mi (0%) 0 (0%) 36h
kube-system tiller-deploy-895d57dd9-77drn
Node 3
kube-system fluentd-gcp-v3.1.0-hbqzd 0 (0%) 0 (0%) 0 (0%) 0 (0%) 31h
kube-system kube-proxy-gke-cluster-2-pool-4-high-mem-52f08173-z05q 100m (2%) 0 (0%) 0 (0%) 0 (0%) 31h
kube-system kubernetes-dashboard-7b9c7bc8c9-rxps6 0 (0%) 0 (0%) 0 (0%) 0 (0%) 31h
kube-system metadata-agent-blld8 40m (1%) 0 (0%) 50Mi (0%) 0 (0%) 31h
Node 4
kube-system fluentd-gcp-scaler-7c5db745fc-vq8ns 0 (0%) 0 (0%) 0 (0%) 0 (0%) 5h55m
kube-system fluentd-gcp-v3.1.0-824st 0 (0%) 0 (0%) 0 (0%) 0 (0%) 5h59m
kube-system kube-dns-788979dc8f-2ksjv 260m (6%) 0 (0%) 110Mi (0%) 170Mi (0%) 5h55m
kube-system kube-proxy-gke-cluster-2-pool-4-high-mem-bdf21dee-74z3 100m (2%) 0 (0%) 0 (0%) 0 (0%) 5h59m
kube-system metadata-agent-7jbrd 40m (1%) 0 (0%) 50Mi (0%) 0 (0%) 5h59m
kube-system metrics-server-v0.2.1-7486f5bd67-hgjxn 53m (1%) 148m (3%) 154Mi (0%) 404Mi (1%) 5h55m
Node 5
kube-system fluentd-gcp-v3.1.0-8nvhs 0 (0%) 0 (0%) 0 (0%) 0 (0%) 11h
kube-system kube-proxy-gke-cluster-2-pool-4-high-mem-bdf21dee-c0t5 100m (2%) 0 (0%) 0 (0%) 0 (0%) 11h
kube-system metadata-agent-4zqvr 40m (1%) 0 (0%) 50Mi (0%) 0 (0%) 11h
Node 6
kube-system fluentd-gcp-v3.1.0-tc87g 0 (0%) 0 (0%) 0 (0%) 0 (0%) 2d5h
kube-system kube-proxy-gke-cluster-2-pool-4-high-mem-bdf21dee-lfcv 100m (2%) 0 (0%) 0 (0%) 0 (0%) 2d5h
kube-system metadata-agent-qt88k 40m (1%) 0 (0%) 50Mi (0%) 0 (0%) 2d5h
Node 7
kube-system fluentd-gcp-v3.1.0-9bv7m 0 (0%) 0 (0%) 0 (0%) 0 (0%) 9h
kube-system kube-dns-autoscaler-79b4b844b9-qsgtd 20m (0%) 0 (0%) 10Mi (0%) 0 (0%) 8h
kube-system kube-proxy-gke-cluster-2-pool-4-high-mem-bdf21dee-mfjj 100m (2%) 0 (0%) 0 (0%) 0 (0%) 9h
kube-system l7-default-backend-5d5b9874d5-7j6rh 10m (0%) 10m (0%) 20Mi (0%) 20Mi (0%) 8h
kube-system metadata-agent-cluster-level-7b467d554f-6ggj4 40m (1%) 0 (0%) 50Mi (0%) 0 (0%) 8h
Node 8
kube-system fluentd-gcp-v3.1.0-89jqj 0 (0%) 0 (0%) 0 (0%) 0 (0%) 2d9h
kube-system kube-proxy-gke-cluster-2-pool-4-high-mem-fce77507-flgm 100m (2%) 0 (0%) 0 (0%) 0 (0%) 2d9h
kube-system metadata-agent-z9kfg 40m (1%) 0 (0%) 50Mi (0%) 0 (0%) 2d9h
Node 9
kube-system fluentd-gcp-v3.1.0-cpskg 0 (0%) 0 (0%) 0 (0%) 0 (0%) 8h
kube-system kube-proxy-gke-cluster-2-pool-4-high-mem-fce77507-lb1v 100m (2%) 0 (0%) 0 (0%) 0 (0%) 8h
kube-system metadata-agent-rkrjj 40m (1%) 0 (0%) 50Mi (0%) 0 (0%) 8h
Node 10
kube-system fluentd-gcp-v3.1.0-jpxf7 0 (0%) 0 (0%) 0 (0%) 0 (0%) 30h
kube-system kube-dns-788979dc8f-nzxpv 260m (6%) 0 (0%) 110Mi (0%) 170Mi (0%) 12h
kube-system kube-proxy-gke-cluster-2-pool-4-high-mem-fce77507-prgb 100m (2%) 0 (0%) 0 (0%) 0 (0%) 30h
Node 11
kube-system fluentd-gcp-v3.1.0-l8kgr 0 (0%) 0 (0%) 0 (0%) 0 (0%) 2d12h
kube-system kube-proxy-gke-cluster-2-pool-4-high-mem-fce77507-rls2 100m (2%) 0 (0%) 0 (0%) 0 (0%) 2d12h
kube-system metadata-agent-xpjxv 40m (1%) 0 (0%) 50Mi (0%) 0 (0%) 2d12h
Can you provide output of:
kubectl get configmap cluster-autoscaler-status -n kube-system -o yaml
This should give us an idea about what CA thinks about your cluster.
Regarding the data about what's running on the nodes, just to clarify, are the numbers you provided resource requests and limits of the pods? What Cluster Autoscaler looks at is resource requests.
AFAIK fluentd, kube-proxy and metadata-agent are running on all nodes by default, so they shouldn't block the node from being scaled down. The only things that would block from this list is metrics-server, kubernetes-dashboard and heapster.
@bskiba - Thanks, here's the output:
apiVersion: v1
data:
status: |+
Cluster-autoscaler status at 2019-01-03 16:28:02.668759958 +0000 UTC:
Cluster-wide:
Health: Healthy (ready=6 unready=0 notStarted=0 longNotStarted=0 registered=6 longUnregistered=0)
LastProbeTime: 2019-01-03 16:27:58.934485238 +0000 UTC
LastTransitionTime: 2018-12-25 13:02:58.519428566 +0000 UTC
ScaleUp: NoActivity (ready=6 registered=6)
LastProbeTime: 2019-01-03 16:27:58.934485238 +0000 UTC
LastTransitionTime: 2019-01-02 21:56:48.353282865 +0000 UTC
ScaleDown: NoCandidates (candidates=0)
LastProbeTime: 2019-01-03 16:27:58.934485238 +0000 UTC
LastTransitionTime: 2019-01-03 13:33:03.63949879 +0000 UTC
NodeGroups:
Name: https://content.googleapis.com/compute/v1/projects/playground-206205/zones/us-central1-a/instanceGroups/gke-cluster-2-pool-4-high-mem-fce77507-grp
Health: Healthy (ready=3 unready=0 notStarted=0 longNotStarted=0 registered=3 longUnregistered=0 cloudProviderTarget=3 (minSize=0, maxSize=4))
LastProbeTime: 2019-01-03 16:27:58.934485238 +0000 UTC
LastTransitionTime: 2018-12-26 04:18:37.983142165 +0000 UTC
ScaleUp: NoActivity (ready=3 cloudProviderTarget=3)
LastProbeTime: 2019-01-03 16:27:58.934485238 +0000 UTC
LastTransitionTime: 2019-01-02 07:32:04.746181534 +0000 UTC
ScaleDown: NoCandidates (candidates=0)
LastProbeTime: 2019-01-03 16:27:58.934485238 +0000 UTC
LastTransitionTime: 2019-01-03 13:33:03.63949879 +0000 UTC
Name: https://content.googleapis.com/compute/v1/projects/playground-206205/zones/us-central1-c/instanceGroups/gke-cluster-2-pool-4-high-mem-52f08173-grp
Health: Healthy (ready=1 unready=0 notStarted=0 longNotStarted=0 registered=1 longUnregistered=0 cloudProviderTarget=1 (minSize=0, maxSize=4))
LastProbeTime: 2019-01-03 16:27:58.934485238 +0000 UTC
LastTransitionTime: 2018-12-26 04:18:37.983142165 +0000 UTC
ScaleUp: NoActivity (ready=1 cloudProviderTarget=1)
LastProbeTime: 2019-01-03 16:27:58.934485238 +0000 UTC
LastTransitionTime: 2019-01-02 21:56:48.353282865 +0000 UTC
ScaleDown: NoCandidates (candidates=0)
LastProbeTime: 2019-01-03 16:27:58.934485238 +0000 UTC
LastTransitionTime: 2019-01-02 21:56:48.353282865 +0000 UTC
Name: https://content.googleapis.com/compute/v1/projects/playground-206205/zones/us-central1-b/instanceGroups/gke-cluster-2-pool-4-high-mem-bdf21dee-grp
Health: Healthy (ready=2 unready=0 notStarted=0 longNotStarted=0 registered=2 longUnregistered=0 cloudProviderTarget=2 (minSize=0, maxSize=4))
LastProbeTime: 2019-01-03 16:27:58.934485238 +0000 UTC
LastTransitionTime: 2018-12-26 04:18:37.983142165 +0000 UTC
ScaleUp: NoActivity (ready=2 cloudProviderTarget=2)
LastProbeTime: 2019-01-03 16:27:58.934485238 +0000 UTC
LastTransitionTime: 2019-01-02 06:29:51.507640337 +0000 UTC
ScaleDown: NoCandidates (candidates=0)
LastProbeTime: 2019-01-03 16:27:58.934485238 +0000 UTC
LastTransitionTime: 2019-01-03 10:49:09.930139074 +0000 UTC
kind: ConfigMap
metadata:
annotations:
cluster-autoscaler.kubernetes.io/last-updated: 2019-01-03 16:28:02.668759958 +0000
UTC
creationTimestamp: "2018-11-16T22:59:41Z"
name: cluster-autoscaler-status
namespace: kube-system
resourceVersion: "53683691"
selfLink: /api/v1/namespaces/kube-system/configmaps/cluster-autoscaler-status
uid: 4d2ed921-e9f3-11e8-8c2d-42010a8000e0
@ProProgrammer maybe this might be the issue for you: https://github.com/kubernetes/kubernetes/issues/69696
I appreciate you're probably getting sick of these sorts of questions, but I can't work out from the logs available to me why the cluster is not scaling down:
Pods:
No restrictive pdbs:
Nodes:
Autoscaler logs:
Autoscaler configmap: