GKE cluster autoscaler scale down issue

tomlev commented 5 years ago

Great work ;)

I'm experimenting an issue whith autoscaler on GKE : autoscaler does not scale down if the cleanup-operator is running. When I delete it, autoscaler scales down quickly.

Env GKE, kubernetes 1.10.11-gke.1, pool with autoscaling activated

I'm testing autoscaler with some empty deployment requiring resources :

apiVersion: v1
kind: Namespace
metadata:
  name: test-autoscale
---
apiVersion: apps/v1 # for versions before 1.9.0 use apps/v1beta2
kind: Deployment
metadata:
  name: test-autoscale
  namespace: test-autoscale
spec:
  selector:
    matchLabels:
      app: test-autoscale
  replicas: 3 # tells deployment to run 1 pods matching the template
  template:
    metadata:
      labels:
        app: test-autoscale
    spec:
      containers:
      - name: test-autoscale
        image: nginx
        # Resources limits
        resources:
          requests:
            cpu: 500m

depending on replicas count and requested cpu (and the compute instance type in the pool) autoscaler will scale up, creating new nodes.

Then I delete the deployment. Autoscaler should scale down by deleting some nodes. When cleanup-operator is running it does not.

Be careful : cluster autoscaler will scale down 10 minutes later, so it is useful to test the status with the following command (which is updated every minute I think) kubectl describe -n kube-system configmap cluster-autoscaler-status

You will see

ScaleDown:   NoCandidates (candidates=0)

When I delete the cleanup-operator, it needs less than one minute to get

ScaleDown:   CandidatesPresent (candidates=1)

Then 10 minutes later the node is drained / deleted

I tried to use the annotation cluster-autoscaler.kubernetes.io/safe-to-evict: "true" on the cleanup-operator but unsuccessfully (spec.template.metadata.annotations:)

Any idea why this cleanup-operator is blocking scaledown and how to fix it ?

lwolf commented 5 years ago

Hi, thanks for trying the cleanup-operator. How many nodes do you have? Does the operator runs on the node that should be killed?

I've never tried it in GKE, but I can't think of any reason it could block the autoscaler.

tomlev commented 5 years ago

sorry for the long delay.

this issue is not occuring anymore, I think it was due to cronjobs on my cluster, preventing the scaling down of the cluster. Now I add the anotation cluster-autoscaler.kubernetes.io/safe-to-evict: "true" on all the jobs which can be safely evicted, to not block scale-down when cronjobs are scheduled on node.

This issue was not related to kube-cleanup-operator, I close it ;)

lwolf / kube-cleanup-operator

GKE cluster autoscaler scale down issue #22