idle nodes on gcs cluster

I randomly logged into the google cloud console to monitor our cluster tonight. I found that the cluster was scaled up to 8 nodes / 34 vCPUs / 170 GB memory.

However, afaict there are only two jupyter users logged in:

I poked around the nodepools, and the nodes seemed to be heavily undersubscribed.

This is as far as my debugging skills go. I don't know how to figure out what pods are running on those nodes. I wish the elastic nodepools would scale down. Maybe there are some permanent services whose pods got stuck on those nodes and now they can't be scaled down?

This is important because it costs a lot of money to have these VMs constantly running.

Seems like Dask Gateway and some JupyterHub pods are occupying these nodes.

$ kubectl get pod -o wide -n prod | grep highmem 
api-gcp-uscentral1b-prod-dask-gateway-55dffbc7d4-dpcwn           1/1     Running   0          4d18h   10.36.33.179    gke-pangeo-uscentral-nap-n1-highmem-4-04fd1efc-snip   <none>           <none>
continuous-image-puller-bgfgp                                    1/1     Running   0          11d     10.36.33.185    gke-pangeo-uscentral-nap-n1-highmem-4-04fd1efc-snip   <none>           <none>
continuous-image-puller-jkk8q                                    1/1     Running   0          11d     10.37.170.127   gke-pangeo-uscentral-nap-n1-highmem-4-04fd1efc-11hd   <none>           <none>
continuous-image-puller-smrd8                                    1/1     Running   0          11d     10.36.24.14     gke-pangeo-uscentral-nap-n1-highmem-4-04fd1efc-mzj9   <none>           <none>
continuous-image-puller-xgp8w                                    1/1     Running   0          11d     10.36.29.15     gke-pangeo-uscentral-nap-n1-highmem-4-04fd1efc-1frw   <none>           <none>

$ kubectl get pod -o wide -n prod | grep standard
continuous-image-puller-hh8kk                                    1/1     Running   0          11d     10.36.248.192   gke-pangeo-uscentral-nap-n1-standard--a4dc6106-p50c   <none>           <none>
continuous-image-puller-n4779                                    1/1     Running   0          11d     10.37.142.146   gke-pangeo-uscentral-nap-n1-standard--a4dc6106-4955   <none>           <none>
controller-gcp-uscentral1b-prod-dask-gateway-5f77b8d797-s84ph    1/1     Running   0          16d     10.36.248.64    gke-pangeo-uscentral-nap-n1-standard--a4dc6106-p50c   <none>           <none>
gcp-uscentral1b-prod-grafana-7fdb568f65-zxj9c                    2/2     Running   0          16d     10.36.248.62    gke-pangeo-uscentral-nap-n1-standard--a4dc6106-p50c   <none>           <none>
gcp-uscentral1b-prod-ingress-nginx-controller-79cd5cd96c-z8crw   1/1     Running   1          16d     10.37.142.217   gke-pangeo-uscentral-nap-n1-standard--a4dc6106-4955   <none>           <none>
gcp-uscentral1b-prod-kube-state-metrics-58d7c65fd7-kxhwk         1/1     Running   3          16d     10.37.142.218   gke-pangeo-uscentral-nap-n1-standard--a4dc6106-4955   <none>           <none>
gcp-uscentral1b-prod-prome-operator-6b5b49dccb-w4ph8             2/2     Running   0          16d     10.36.248.63    gke-pangeo-uscentral-nap-n1-standard--a4dc6106-p50c   <none>           <none>
prometheus-gcp-uscentral1b-prod-prome-prometheus-0               3/3     Running   4          16d     10.37.142.219   gke-pangeo-uscentral-nap-n1-standard--a4dc6106-4955   <none>           <none>
traefik-gcp-uscentral1b-prod-dask-gateway-55d7854bf7-xgdhc       1/1     Running   1          16d     10.36.248.61    gke-pangeo-uscentral-nap-n1-standard--a4dc6106-p50c   <none>           <none>

The dask-gateway pods should likely be in the core node pool, along with JupyterHub. There's no reason to keep them separate I think. I can take care of that.

I'm not sure about the continuous image-puller. I gather that it's a JupyterHub thing. I'm not sure what the impact of disabling it would be though. It seems to me like it shouldn't be the sole thing keeping a node from scaling down (and maybe if we fix the dask-gateway pods, it would scale down).

Thanks for looking into this Tom!

Do we need some sort of cron job that checks whether these services are running non-core nodes?

With https://github.com/dask/dask-gateway/pull/325 and https://github.com/dask/dask-gateway/pull/324 we'll be able to set things up so that these pods don't run on non-core nodes in the first place. That'll need to wait for the next dask-gateway release.

In the meantime, we can patch around it

# file: patch.yaml
spec:
  template:
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: hub.jupyter.org/node-purpose
                operator: In
                values:
                - core

$ kubectl -n staging patch deployment traefik-gcp-uscentral1b-staging-dask-gateway --patch="$(cat patch.yaml
l)"
deployment.apps/traefik-gcp-uscentral1b-staging-dask-gateway patched

$ kubectl -n staging patch deployment api-gcp-uscentral1b-staging-dask-gateway --patch="$(cat patch.yaml)"
deployment.apps/api-gcp-uscentral1b-staging-dask-gateway patched

$ kubectl -n staging patch deployment controller-gcp-uscentral1b-staging-dask-gateway --patch="$(cat patch.yaml)"
deployment.apps/controller-gcp-uscentral1b-staging-dask-gateway patched

I've confirmed that those were moved to the default pool for staging at least, and things seem to still work. Still to do are

Do the same for prod.
integrate this into CI (so we don't lose it each deployment).
Verify that the continuous image puller stuff doesn't keep a node-pool alive.
Ensure that the grafana monitoring things live in the core pool.
Maybe clean up the pangeo forge stuff (a prefect agent, and some other pod I don't recognize) in staging

I'll get to those later.

I might have broken some prometheus / grafana things (the hub should be fine)

``` Error: UPGRADE FAILED: cannot patch "gcp-uscentral1b-staging-grafana" with kind Ingress: Internal error occurred: failed calling webhook "validate.nginx.ingress.kubernetes.io": Post https://gcp-uscentral1b-prod-ingress-nginx-controller-admission.prod.svc:443/extensions/v1beta1/ingresses?timeout=30s: no endpoints available for service "gcp-uscentral1b-prod-ingress-nginx-controller-admission" && cannot patch "gcp-uscentral1b-staging-pr-alertmanager.rules" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post https://gcp-uscentral1b-prod-prome-operator.prod.svc:443/admission-prometheusrules/mutate?timeout=30s: no endpoints available for service "gcp-uscentral1b-prod-prome-operator" && cannot patch "gcp-uscentral1b-staging-pr-etcd" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post https://gcp-uscentral1b-prod-prome-operator.prod.svc:443/admission-prometheusrules/mutate?timeout=30s: no endpoints available for service "gcp-uscentral1b-prod-prome-operator" && cannot patch "gcp-uscentral1b-staging-pr-general.rules" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post https://gcp-uscentral1b-prod-prome-operator.prod.svc:443/admission-prometheusrules/mutate?timeout=30s: no endpoints available for service "gcp-uscentral1b-prod-prome-operator" && cannot patch "gcp-uscentral1b-staging-pr-k8s.rules" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post https://gcp-uscentral1b-prod-prome-operator.prod.svc:443/admission-prometheusrules/mutate?timeout=30s: no endpoints available for service "gcp-uscentral1b-prod-prome-operator" && cannot patch "gcp-uscentral1b-staging-pr-kube-apiserver-availability.rules" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post https://gcp-uscentral1b-prod-prome-operator.prod.svc:443/admission-prometheusrules/mutate?timeout=30s: no endpoints available for service "gcp-uscentral1b-prod-prome-operator" && cannot patch "gcp-uscentral1b-staging-pr-kube-apiserver-slos" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post https://gcp-uscentral1b-prod-prome-operator.prod.svc:443/admission-prometheusrules/mutate?timeout=30s: no endpoints available for service "gcp-uscentral1b-prod-prome-operator" && cannot patch "gcp-uscentral1b-staging-pr-kube-apiserver.rules" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post https://gcp-uscentral1b-prod-prome-operator.prod.svc:443/admission-prometheusrules/mutate?timeout=30s: no endpoints available for service "gcp-uscentral1b-prod-prome-operator" && cannot patch "gcp-uscentral1b-staging-pr-kube-prometheus-general.rules" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post https://gcp-uscentral1b-prod-prome-operator.prod.svc:443/admission-prometheusrules/mutate?timeout=30s: no endpoints available for service "gcp-uscentral1b-prod-prome-operator" && cannot patch "gcp-uscentral1b-staging-pr-kube-prometheus-node-recording.rules" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post https://gcp-uscentral1b-prod-prome-operator.prod.svc:443/admission-prometheusrules/mutate?timeout=30s: no endpoints available for service "gcp-uscentral1b-prod-prome-operator" && cannot patch "gcp-uscentral1b-staging-pr-kube-scheduler.rules" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post https://gcp-uscentral1b-prod-prome-operator.prod.svc:443/admission-prometheusrules/mutate?timeout=30s: no endpoints available for service "gcp-uscentral1b-prod-prome-operator" && cannot patch "gcp-uscentral1b-staging-pr-kube-state-metrics" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post https://gcp-uscentral1b-prod-prome-operator.prod.svc:443/admission-prometheusrules/mutate?timeout=30s: no endpoints available for service "gcp-uscentral1b-prod-prome-operator" && cannot patch "gcp-uscentral1b-staging-pr-kubelet.rules" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post https://gcp-uscentral1b-prod-prome-operator.prod.svc:443/admission-prometheusrules/mutate?timeout=30s: no endpoints available for service "gcp-uscentral1b-prod-prome-operator" && cannot patch "gcp-uscentral1b-staging-pr-kubernetes-apps" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post https://gcp-uscentral1b-prod-prome-operator.prod.svc:443/admission-prometheusrules/mutate?timeout=30s: no endpoints available for service "gcp-uscentral1b-prod-prome-operator" && cannot patch "gcp-uscentral1b-staging-pr-kubernetes-resources" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post https://gcp-uscentral1b-prod-prome-operator.prod.svc:443/admission-prometheusrules/mutate?timeout=30s: no endpoints available for service "gcp-uscentral1b-prod-prome-operator" && cannot patch "gcp-uscentral1b-staging-pr-kubernetes-storage" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post https://gcp-uscentral1b-prod-prome-operator.prod.svc:443/admission-prometheusrules/mutate?timeout=30s: no endpoints available for service "gcp-uscentral1b-prod-prome-operator" && cannot patch "gcp-uscentral1b-staging-pr-kubernetes-system-apiserver" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post https://gcp-uscentral1b-prod-prome-operator.prod.svc:443/admission-prometheusrules/mutate?timeout=30s: no endpoints available for service "gcp-uscentral1b-prod-prome-operator" && cannot patch "gcp-uscentral1b-staging-pr-kubernetes-system-controller-manager" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post https://gcp-uscentral1b-prod-prome-operator.prod.svc:443/admission-prometheusrules/mutate?timeout=30s: no endpoints available for service "gcp-uscentral1b-prod-prome-operator" && cannot patch "gcp-uscentral1b-staging-pr-kubernetes-system-kubelet" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post https://gcp-uscentral1b-prod-prome-operator.prod.svc:443/admission-prometheusrules/mutate?timeout=30s: no endpoints available for service "gcp-uscentral1b-prod-prome-operator" && cannot patch "gcp-uscentral1b-staging-pr-kubernetes-system-scheduler" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post https://gcp-uscentral1b-prod-prome-operator.prod.svc:443/admission-prometheusrules/mutate?timeout=30s: no endpoints available for service "gcp-uscentral1b-prod-prome-operator" && cannot patch "gcp-uscentral1b-staging-pr-kubernetes-system" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post https://gcp-uscentral1b-prod-prome-operator.prod.svc:443/admission-prometheusrules/mutate?timeout=30s: no endpoints available for service "gcp-uscentral1b-prod-prome-operator" && cannot patch "gcp-uscentral1b-staging-pr-node-network" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post https://gcp-uscentral1b-prod-prome-operator.prod.svc:443/admission-prometheusrules/mutate?timeout=30s: no endpoints available for service "gcp-uscentral1b-prod-prome-operator" && cannot patch "gcp-uscentral1b-staging-pr-prometheus-operator" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post https://gcp-uscentral1b-prod-prome-operator.prod.svc:443/admission-prometheusrules/mutate?timeout=30s: no endpoints available for service "gcp-uscentral1b-prod-prome-operator" && cannot patch "gcp-uscentral1b-staging-pr-prometheus" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post https://gcp-uscentral1b-prod-prome-operator.prod.svc:443/admission-prometheusrules/mutate?timeout=30s: no endpoints available for service "gcp-uscentral1b-prod-prome-operator" ```

I need to figure out what pods are actually needed per namespace for prometheus-operator to function.

@consideRatio the GCP cluster has a node with just system pods and two continuous-image-puller pods (one for prod and staging):

$ kubectl get pod -o wide --all-namespaces | grep gke-pangeo-uscentral-nap-n1-highmem-4-04fd1efc-11hd
kube-system   fluentd-gke-kv5n6                                                 2/2     Running     0          49d    10.128.0.108    gke-pangeo-uscentral-nap-n1-highmem-4-04fd1efc-11hd   <none>           <none>
kube-system   gke-metadata-server-p8wpm                                         1/1     Running     0          49d    10.128.0.108    gke-pangeo-uscentral-nap-n1-highmem-4-04fd1efc-11hd   <none>           <none>
kube-system   gke-metrics-agent-px4vg                                           1/1     Running     0          49d    10.128.0.108    gke-pangeo-uscentral-nap-n1-highmem-4-04fd1efc-11hd   <none>           <none>
kube-system   kube-dns-7c976ddbdb-kqglx                                         4/4     Running     2          49d    10.37.170.162   gke-pangeo-uscentral-nap-n1-highmem-4-04fd1efc-11hd   <none>           <none>
kube-system   kube-proxy-gke-pangeo-uscentral-nap-n1-highmem-4-04fd1efc-11hd    1/1     Running     0          68d    10.128.0.108    gke-pangeo-uscentral-nap-n1-highmem-4-04fd1efc-11hd   <none>           <none>
kube-system   netd-nbvfh                                                        1/1     Running     0          69d    10.128.0.108    gke-pangeo-uscentral-nap-n1-highmem-4-04fd1efc-11hd   <none>           <none>
kube-system   prometheus-to-sd-9fsqg                                            1/1     Running     0          69d    10.128.0.108    gke-pangeo-uscentral-nap-n1-highmem-4-04fd1efc-11hd   <none>           <none>
prod          continuous-image-puller-52hgv                                     1/1     Running     0          10h    10.37.170.239   gke-pangeo-uscentral-nap-n1-highmem-4-04fd1efc-11hd   <none>           <none>
staging       continuous-image-puller-7fvmg                                     1/1     Running     0          10h    10.37.170.229   gke-pangeo-uscentral-nap-n1-highmem-4-04fd1efc-11hd   <none>           <none>

That node is in an auto-provisioned node pool set to auto-scale down all the way to zero. I wouldn't expect the continuous-image-puller pods to keep a node from auto-scaling down, though perhaps that's incorrect. Does that look strange to you?

From https://zero-to-jupyterhub.readthedocs.io/en/latest/administrator/optimization.html:

It is important to realize that if the continuous-image-puller together with a Cluster Autoscaler (CA) won’t guarantee a reduced wait time for users. It only helps if the CA scales up before real users arrive, but the CA will generally fail to do so. This is because it will only add a node if one or more pods won’t fit on the current nodes but would fit more if a node is added, but at that point users are already waiting. To scale up nodes ahead of time we can use user-placeholders.

Suggests that the continuous-image-puller isn't all that useful on it's own, and we aren't using user-placeholders, so perhaps we just remove the continuous-image-puller.

Hmmm i guess if you only pull a single image, and dont have user placeholders, then its just a pod requesting no resources and can be evicted by other pods if needed.

It is very harmless in latest z2jh release, and it wont block scale down. I would inspect all pods on the nodes individually with kubectl describe nodes and see what pods ran on them, and i would inspect what the cluster autoscaler status configmap were saying in the kube-system namespace

Thanks, kubectl describe nodes is helpful.

Edit: Now that I've disabled the continuous image puller, these unused nodes have gained the taints

Taints:             ToBeDeletedByClusterAutoscaler=1602078148:NoSchedule
                    DeletionCandidateOfClusterAutoscaler=1602077543:PreferNoSchedule

And now it's been autoscaled down. So I think this is the behavior we want.

A few more stray pods that I'll pin down to the core pool

gcp-uscentral1b-prod-ingress-nginx-controller-79cd5cd96c-qdmzj
gcp-uscentral1b-staging-ingress-nginx-controller-865cfd455mrcps
mlflow-84f8c9d9c-2vssl

Leaving a note here for future debugging. I noticed that the node gke-pangeo-uscentral-nap-n1-highmem-4-04fd1efc-88l4 wasn't scaling down, despite having just kube-system pods and the prometheus-node-exporter DameonSet. https://cloud.google.com/kubernetes-engine/docs/how-to/cluster-autoscaler-visibility suggests viewing the logs at https://console.cloud.google.com/logs/query;query=logName%3D%22projects%2Fpangeo-181919%2Flogs%2Fcontainer.googleapis.com%252Fcluster-autoscaler-visibility%22?authuser=1&angularJsUrl=%2Flogs%2Fviewer%3Fsupportedpurview%3Dproject%26authuser%3D1&project=pangeo-181919&supportedpurview=project&query=%0A, with this filter

logName="projects/pangeo-181919/logs/container.googleapis.com%2Fcluster-autoscaler-visibility"

I see a NoDecisionStatus, and in the logs

reason: {
  parameters: [
    0: "metrics-server-v0.3.6-5cf765ff9-9pvxn"
  ]
  messageId: "no.scale.down.node.pod.kube.system.unmovable"
}

So there's a system pod that was added to the highmemory pool. Ideally those would be in the core pool. I'll see if I can add an annotation to it.

Hmm, according to https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#how-to-set-pdbs-to-enable-ca-to-move-kube-system-pods

Metrics Server is best left alone, as restarting it causes the loss of metrics for >1 minute, as well as metrics in dashboard from the last 15 minutes. Metrics Server downtime also means effective HPA downtime as it relies on metrics. Add PDB for it only if you're sure you don't mind.

We're probably OK with that. I wonder if defining a PDB is better than (somehow?) setting the nodeAffinity so that it ends up in the core pool in the first place? We would want the affinity regardless so that it doesn't bounce between non-core nodes.

pangeo-data / pangeo-cloud-federation

idle nodes on gcs cluster #769