Open rabernat opened 4 years ago
Seems like Dask Gateway and some JupyterHub pods are occupying these nodes.
$ kubectl get pod -o wide -n prod | grep highmem
api-gcp-uscentral1b-prod-dask-gateway-55dffbc7d4-dpcwn 1/1 Running 0 4d18h 10.36.33.179 gke-pangeo-uscentral-nap-n1-highmem-4-04fd1efc-snip <none> <none>
continuous-image-puller-bgfgp 1/1 Running 0 11d 10.36.33.185 gke-pangeo-uscentral-nap-n1-highmem-4-04fd1efc-snip <none> <none>
continuous-image-puller-jkk8q 1/1 Running 0 11d 10.37.170.127 gke-pangeo-uscentral-nap-n1-highmem-4-04fd1efc-11hd <none> <none>
continuous-image-puller-smrd8 1/1 Running 0 11d 10.36.24.14 gke-pangeo-uscentral-nap-n1-highmem-4-04fd1efc-mzj9 <none> <none>
continuous-image-puller-xgp8w 1/1 Running 0 11d 10.36.29.15 gke-pangeo-uscentral-nap-n1-highmem-4-04fd1efc-1frw <none> <none>
$ kubectl get pod -o wide -n prod | grep standard
continuous-image-puller-hh8kk 1/1 Running 0 11d 10.36.248.192 gke-pangeo-uscentral-nap-n1-standard--a4dc6106-p50c <none> <none>
continuous-image-puller-n4779 1/1 Running 0 11d 10.37.142.146 gke-pangeo-uscentral-nap-n1-standard--a4dc6106-4955 <none> <none>
controller-gcp-uscentral1b-prod-dask-gateway-5f77b8d797-s84ph 1/1 Running 0 16d 10.36.248.64 gke-pangeo-uscentral-nap-n1-standard--a4dc6106-p50c <none> <none>
gcp-uscentral1b-prod-grafana-7fdb568f65-zxj9c 2/2 Running 0 16d 10.36.248.62 gke-pangeo-uscentral-nap-n1-standard--a4dc6106-p50c <none> <none>
gcp-uscentral1b-prod-ingress-nginx-controller-79cd5cd96c-z8crw 1/1 Running 1 16d 10.37.142.217 gke-pangeo-uscentral-nap-n1-standard--a4dc6106-4955 <none> <none>
gcp-uscentral1b-prod-kube-state-metrics-58d7c65fd7-kxhwk 1/1 Running 3 16d 10.37.142.218 gke-pangeo-uscentral-nap-n1-standard--a4dc6106-4955 <none> <none>
gcp-uscentral1b-prod-prome-operator-6b5b49dccb-w4ph8 2/2 Running 0 16d 10.36.248.63 gke-pangeo-uscentral-nap-n1-standard--a4dc6106-p50c <none> <none>
prometheus-gcp-uscentral1b-prod-prome-prometheus-0 3/3 Running 4 16d 10.37.142.219 gke-pangeo-uscentral-nap-n1-standard--a4dc6106-4955 <none> <none>
traefik-gcp-uscentral1b-prod-dask-gateway-55d7854bf7-xgdhc 1/1 Running 1 16d 10.36.248.61 gke-pangeo-uscentral-nap-n1-standard--a4dc6106-p50c <none> <none>
The dask-gateway pods should likely be in the core node pool, along with JupyterHub. There's no reason to keep them separate I think. I can take care of that.
I'm not sure about the continuous image-puller. I gather that it's a JupyterHub thing. I'm not sure what the impact of disabling it would be though. It seems to me like it shouldn't be the sole thing keeping a node from scaling down (and maybe if we fix the dask-gateway pods, it would scale down).
Thanks for looking into this Tom!
Do we need some sort of cron job that checks whether these services are running non-core nodes?
With https://github.com/dask/dask-gateway/pull/325 and https://github.com/dask/dask-gateway/pull/324 we'll be able to set things up so that these pods don't run on non-core nodes in the first place. That'll need to wait for the next dask-gateway release.
In the meantime, we can patch around it
# file: patch.yaml
spec:
template:
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: hub.jupyter.org/node-purpose
operator: In
values:
- core
$ kubectl -n staging patch deployment traefik-gcp-uscentral1b-staging-dask-gateway --patch="$(cat patch.yaml
l)"
deployment.apps/traefik-gcp-uscentral1b-staging-dask-gateway patched
$ kubectl -n staging patch deployment api-gcp-uscentral1b-staging-dask-gateway --patch="$(cat patch.yaml)"
deployment.apps/api-gcp-uscentral1b-staging-dask-gateway patched
$ kubectl -n staging patch deployment controller-gcp-uscentral1b-staging-dask-gateway --patch="$(cat patch.yaml)"
deployment.apps/controller-gcp-uscentral1b-staging-dask-gateway patched
I've confirmed that those were moved to the default pool for staging at least, and things seem to still work. Still to do are
I'll get to those later.
I might have broken some prometheus / grafana things (the hub should be fine)
I need to figure out what pods are actually needed per namespace for prometheus-operator to function.
@consideRatio the GCP cluster has a node with just system pods and two continuous-image-puller
pods (one for prod
and staging
):
$ kubectl get pod -o wide --all-namespaces | grep gke-pangeo-uscentral-nap-n1-highmem-4-04fd1efc-11hd
kube-system fluentd-gke-kv5n6 2/2 Running 0 49d 10.128.0.108 gke-pangeo-uscentral-nap-n1-highmem-4-04fd1efc-11hd <none> <none>
kube-system gke-metadata-server-p8wpm 1/1 Running 0 49d 10.128.0.108 gke-pangeo-uscentral-nap-n1-highmem-4-04fd1efc-11hd <none> <none>
kube-system gke-metrics-agent-px4vg 1/1 Running 0 49d 10.128.0.108 gke-pangeo-uscentral-nap-n1-highmem-4-04fd1efc-11hd <none> <none>
kube-system kube-dns-7c976ddbdb-kqglx 4/4 Running 2 49d 10.37.170.162 gke-pangeo-uscentral-nap-n1-highmem-4-04fd1efc-11hd <none> <none>
kube-system kube-proxy-gke-pangeo-uscentral-nap-n1-highmem-4-04fd1efc-11hd 1/1 Running 0 68d 10.128.0.108 gke-pangeo-uscentral-nap-n1-highmem-4-04fd1efc-11hd <none> <none>
kube-system netd-nbvfh 1/1 Running 0 69d 10.128.0.108 gke-pangeo-uscentral-nap-n1-highmem-4-04fd1efc-11hd <none> <none>
kube-system prometheus-to-sd-9fsqg 1/1 Running 0 69d 10.128.0.108 gke-pangeo-uscentral-nap-n1-highmem-4-04fd1efc-11hd <none> <none>
prod continuous-image-puller-52hgv 1/1 Running 0 10h 10.37.170.239 gke-pangeo-uscentral-nap-n1-highmem-4-04fd1efc-11hd <none> <none>
staging continuous-image-puller-7fvmg 1/1 Running 0 10h 10.37.170.229 gke-pangeo-uscentral-nap-n1-highmem-4-04fd1efc-11hd <none> <none>
That node is in an auto-provisioned node pool set to auto-scale down all the way to zero. I wouldn't expect the continuous-image-puller
pods to keep a node from auto-scaling down, though perhaps that's incorrect. Does that look strange to you?
From https://zero-to-jupyterhub.readthedocs.io/en/latest/administrator/optimization.html:
It is important to realize that if the continuous-image-puller together with a Cluster Autoscaler (CA) won’t guarantee a reduced wait time for users. It only helps if the CA scales up before real users arrive, but the CA will generally fail to do so. This is because it will only add a node if one or more pods won’t fit on the current nodes but would fit more if a node is added, but at that point users are already waiting. To scale up nodes ahead of time we can use user-placeholders.
Suggests that the continuous-image-puller isn't all that useful on it's own, and we aren't using user-placeholders
, so perhaps we just remove the continuous-image-puller.
Hmmm i guess if you only pull a single image, and dont have user placeholders, then its just a pod requesting no resources and can be evicted by other pods if needed.
It is very harmless in latest z2jh release, and it wont block scale down. I would inspect all pods on the nodes individually with kubectl describe nodes and see what pods ran on them, and i would inspect what the cluster autoscaler status configmap were saying in the kube-system namespace
Thanks, kubectl describe nodes
is helpful.
Edit: Now that I've disabled the continuous image puller, these unused nodes have gained the taints
Taints: ToBeDeletedByClusterAutoscaler=1602078148:NoSchedule
DeletionCandidateOfClusterAutoscaler=1602077543:PreferNoSchedule
And now it's been autoscaled down. So I think this is the behavior we want.
A few more stray pods that I'll pin down to the core pool
Leaving a note here for future debugging. I noticed that the node gke-pangeo-uscentral-nap-n1-highmem-4-04fd1efc-88l4
wasn't scaling down, despite having just kube-system
pods and the prometheus-node-exporter DameonSet. https://cloud.google.com/kubernetes-engine/docs/how-to/cluster-autoscaler-visibility suggests viewing the logs at https://console.cloud.google.com/logs/query;query=logName%3D%22projects%2Fpangeo-181919%2Flogs%2Fcontainer.googleapis.com%252Fcluster-autoscaler-visibility%22?authuser=1&angularJsUrl=%2Flogs%2Fviewer%3Fsupportedpurview%3Dproject%26authuser%3D1&project=pangeo-181919&supportedpurview=project&query=%0A, with this filter
logName="projects/pangeo-181919/logs/container.googleapis.com%2Fcluster-autoscaler-visibility"
I see a NoDecisionStatus
, and in the logs
reason: {
parameters: [
0: "metrics-server-v0.3.6-5cf765ff9-9pvxn"
]
messageId: "no.scale.down.node.pod.kube.system.unmovable"
}
So there's a system pod that was added to the highmemory pool. Ideally those would be in the core pool. I'll see if I can add an annotation to it.
Hmm, according to https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#how-to-set-pdbs-to-enable-ca-to-move-kube-system-pods
Metrics Server is best left alone, as restarting it causes the loss of metrics for >1 minute, as well as metrics in dashboard from the last 15 minutes. Metrics Server downtime also means effective HPA downtime as it relies on metrics. Add PDB for it only if you're sure you don't mind.
We're probably OK with that. I wonder if defining a PDB is better than (somehow?) setting the nodeAffinity so that it ends up in the core pool in the first place? We would want the affinity regardless so that it doesn't bounce between non-core nodes.
I randomly logged into the google cloud console to monitor our cluster tonight. I found that the cluster was scaled up to 8 nodes / 34 vCPUs / 170 GB memory.
However, afaict there are only two jupyter users logged in:
I poked around the nodepools, and the nodes seemed to be heavily undersubscribed.
This is as far as my debugging skills go. I don't know how to figure out what pods are running on those nodes. I wish the elastic nodepools would scale down. Maybe there are some permanent services whose pods got stuck on those nodes and now they can't be scaled down?
This is important because it costs a lot of money to have these VMs constantly running.