Cluster autoscaler debugging

betatim commented 5 years ago

Question: is there a way to see what the CA is thinking in order to find out what is preventing a down-scale? The generic advice is to look at the CA logs on the master node, however on GKE we can't access those and my googling hasn't brought up an alternative.

Over the last few days/weeks/times I've looked it never seems to go below three nodes, even if one of them is essentially empty. I had a moment to poke around today and left feeling like I don't understand why the third node doesn't get removed.

The node I think should be removed from the pool is `` and pods on it are:

$ kubectl get pods -o wide --all-namespaces | grep gke-prod-a-user-eab507e4-8hb2  
kube-system   calico-node-ss8ww                                            2/2       Running     0          1d        10.128.0.4     gke-prod-a-user-eab507e4-8hb2
kube-system   fluentd-gcp-v3.1.0-7775b                                     1/1       Running     0          6d        10.12.3.3      gke-prod-a-user-eab507e4-8hb2
kube-system   heapster-v1.5.3-c854dfc94-rcmc9                              2/2       Running     0          4d        10.12.3.99     gke-prod-a-user-eab507e4-8hb2
kube-system   ip-masq-agent-8hbhp                                          1/1       Running     0          6d        10.128.0.4     gke-prod-a-user-eab507e4-8hb2
kube-system   kube-proxy-gke-prod-a-user-eab507e4-8hb2                     1/1       Running     0          6d        10.128.0.4     gke-prod-a-user-eab507e4-8hb2
prod          events-archiver-5757dff777-4cwlc                             1/1       Running     0          2d        10.12.3.150    gke-prod-a-user-eab507e4-8hb2
prod          matomo-mysqld-exporter-f9cd5c6b7-4kcz8                       2/2       Running     0          4d        10.12.3.100    gke-prod-a-user-eab507e4-8hb2
prod          prod-dind-gr7z6                                              1/1       Running     0          6d        10.12.3.2      gke-prod-a-user-eab507e4-8hb2
prod          prod-image-cleaner-xwg46                                     1/1       Running     0          4d        10.12.3.206    gke-prod-a-user-eab507e4-8hb2
prod          prod-prometheus-node-exporter-rrb2f                          1/1       Running     0          6d        10.128.0.4     gke-prod-a-user-eab507e4-8hb2

None of which should prevent the node from being removed because they are all controlled by a Deployment or similar. I didn't dig into the kube-system NS pods as they look like pods that would be present on all nodes and in general cluster autoscaling (CA) works.

I know how to look at the current status of CA with:

kubectl get configmap cluster-autoscaler-status -n kube-system -o yaml

and that shows it has been checking recently to decide if it needs to scale up/down.

consideRatio commented 5 years ago

@betatim a deployments pod can be moved, but critieras need to be met:

They must fit on another node
They must be allowed to be disrupted for a short duration by a PDB, if a PDB sais "i always need 1 running!" then it wont move it, but if the pdb sais "i can always accept one single pod to be temporarily disrupted" then the CA could move it to the other node.

Do not worry about any pods from a daemonset though, and i think the prometheus node exporter is such pod btw, they will be ignored when this is considered.

So, I'd verify that this node with this node:

is considered under utilized by requesting way to little resources,
has only pods with PDBs allowing them to be moved, so look into all the PDBs available

For more documentation, see:

https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#what-types-of-pods-can-prevent-ca-from-removing-a-node

(Some traces of my issues with this earlier is available in: https://github.com/jupyterhub/zero-to-jupyterhub-k8s/issues/503 - it wont be a very focused read though)

betatim commented 5 years ago

"Pods that are not backed by a controller object (so not created by deployment, replica set, job, stateful set etc)."

I read that as "if there is no PDB for this it will not hold up scale-down". Is this also how you see it?

The node is definitely underutilized (8 cores and 52GB should be far too much for those pods :) ) and the second of the three nodes in the cluster was also mostly empty. Only matomo and event-archiver would actually need relocating as all other of the pods in the prod namespace are from daemonsets.

Overall it would be nice to find a way to have the AC tell you why it thinks it can/can't do something. Mostly because it is tedious to look at all the PDBs and heritage of the running pods and because I might misread a PDB. After which we are back to "Tim thinks this node should be removed but the AC doesn't" :-/

consideRatio commented 5 years ago

Yepp! GKE access to the Cluster autoscalers logs isnt something we can get, at least not some months ago, on GKE.

Id look for PDBs for those two pods, perhaps u have one saying it is one required at all times, I dont know what would happen without any PDB but only a deployment running a single pod, would it move? Would it move if there were five pods in the deployment? What is the default behaviour for distuptions is what im asking, hmmm, / erik from mobile

consideRatio commented 5 years ago

It seems a bit extreme that the CA would evict the only pod of a deployment in prder to scale down, so i assume what you cited is one criteria rather than the only ceiteria.

oh regarding thr question, i see the quoted text as: pods not controlled by a deployment etc will always block scale down, unless a PDB makes an exception. Our JH user pods are such pods, spawned by kubespawner rather than a deployment etc.

try without making a big chart change, to simply add a PDB with kubectl, they are quite simple objects

jupyterhub / mybinder.org-deploy

Cluster autoscaler debugging #840