Open TomAugspurger opened 4 years ago
Can we provide any feedback on what we're waiting on in the gateway.new_cluster() and cluster.scale(...)?
This is an open question. This could definitely be handled, but I'm not sure where/how to display this information to the user (if you have thoughts, I'd be interested to hear them). Presumably other Cluster
implementations have the same issue, so this isn't specific to dask-gateway.
Stepping back a bit here, we'd like:
Given the above requirements (which I may have assumed incorrectly), perhaps dask schedulers and notebooks could share the notebook nodes? To aid in packing and speedy scale up, we might also enable some of the JupyterHub scheduling/placeholder options here: https://zero-to-jupyterhub.readthedocs.io/en/latest/administrator/optimization.html.
cc @consideRatio, who likely has some thoughts on this.
Thanks. Your summary is accurate. So I think we can approach it on two fronts
We discussed putting the schedulers in the same node-pool as the jupyter notebooks. @jhamman raised some issues at https://github.com/pangeo-data/pangeo-cloud-federation/pull/567#issuecomment-601784900. In the end we decided to go with a separate node pool (but that could be revisited).
@jcrist I setup a dask-gateway deployment and opted to make the scheduler run alongside the user pods. Related config:
gateway:
backend:
scheduler:
# Any extra configuration for the worker pod. Sets
# `c.KubeClusterConfig.worker_extra_pod_config`.
extraPodConfig:
tolerations:
- key: "hub.jupyter.org_dedicated"
operator: "Equal"
value: "user"
effect: "NoSchedule"
nodeSelector:
hub.jupyter.org/node-purpose: user
schedulerName: myjhubhelmreleasename-user-scheduler
I think a key question regarding positioning the user schedulers on the same node or a different node may relate to the resource requests for users / schedulers respectively. If you can fit 4 users per node perfectly and a scheduler takes up 0.1 of the user and you suddenly can only fit three, thats not great. I have no experience with the amount of resource requests that the scheduler needs in general etc though, I find it complicated to figure out what makes most sense here.
A safe default is 1 core and 2 GiB memory for the scheduler, but they can get by with far less.
I'm more interested in your recommendations on enabling the user scheduler/placeholder pods with respect to the pangeo deployment. My understanding is that:
@jcrist your understanding is like mine. The placeholder pods are running a pause container but have resources requests as the real user pods. They have a lower pod priority though, so they can be kicked out to make room for real users though, and that is what makes them able to hold a real user pods place. They are only kicked out / evicted like this if it was required to make room for a real user, and when that happen they probably cant fit anywhere, so they end up pending. And, pending pods will trigger a scale up event by a clusterautoscaler that adds nodes to a k8s cluster.
The custom k8s scheduler does indeed pack pods, by running a now outdated kube-scheduler (k8s 1.13.x) binary with custom configuration.
I think a general practice that makes sense:
If the JupyterHub's custom scheduler is scheduling pods to run on a set of nodes, all pods scheduled on these nodes should be scheduled by JupyterHub's custom scheduler, or be of the kind that at least won't block a scale down - such as daemonset pods.
So, if the dask-scheduler created by a user should live on pangeos user nodes, then I think it should they should pack using the JupyterHub's custom kube-scheduler that packs pods tight, otherwise it will be the worst of both worlds. Tight packing without smooth down scaling, because individual schedulers would end up blocking scale downs while lots of JH users packed on other nodes.
@jcrist and @consideRatio - I opened a similar issue related to dask worker scheduling on dask-kubernetes a while back which I think is informative https://github.com/dask/dask-kubernetes/issues/233
Make changes to inform the user what's happening in the background by exposing the node-pool / pod status.
@TomAugspurger I think this would be really great. At the very least a Pod triggered scale-up message
such as when a jupyterhub session is starting, signals that things are happening and the kernel doesn't need restarting.
I think this would be really great. At the very least a Pod triggered scale-up message such as when a jupyterhub session is starting, signals that things are happening and the kernel doesn't need restarting.
I can see how to support this technically, but I'm not sure where we should expose this information to the user. When creating a cluster programmatically, or calling scale
/adapt
, how would you expect this information to be available?
I'd say simply printing a message to stdout would be sufficient for these two cases per @TomAugspurger 's comment above:
1) cluster = gateway.new_cluster()
--> dask scheduler requested, may take several minutes to become active as cluster scales machines...
2) cluster.scale(...)
--> dask workers requested, may take several minutes to become active as cluster scales machines...
Even snazzier would be directly exposing the kubernetes events, perhaps piping to a dropdown ipywidget so as to not clutter the notebook output (see https://github.com/dask/dask-kubernetes/pull/142):
(scheduler)
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 39s default-scheduler Successfully assigned prod/dask-gateway-928c4eeee28546cdae0cb920da20aa87 to ip-192-168-156-187.us-west-2.compute.internal
Normal Pulling 38s kubelet, ip-192-168-156-187.us-west-2.compute.internal Pulling image "pangeoaccess/binder-scottyhq-2dpangeodev-2dbinder-a75d9b:f40cf3ba877e0c396a72a262df0e209d95002987"
Normal Pulled 8s kubelet, ip-192-168-156-187.us-west-2.compute.internal Successfully pulled image "pangeoaccess/binder-scottyhq-2dpangeodev-2dbinder-a75d9b:f40cf3ba877e0c396a72a262df0e209d95002987"
Normal Created 2s kubelet, ip-192-168-156-187.us-west-2.compute.internal Created container dask-scheduler
Normal Started 2s kubelet, ip-192-168-156-187.us-west-2.compute.internal Started container dask-scheduler
(worker)
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 2m26s default-scheduler 0/2 nodes are available: 1 Insufficient cpu, 1 node(s) had taints that the pod didn't tolerate.
Normal TriggeredScaleUp 2m22s cluster-autoscaler pod triggered scale-up: [{eksctl-pangeo-binder-nodegroup-worker-spot-NodeGroup-P9W9ZIJ11UZ 0->1 (max: 10)}]
Warning FailedScheduling 23s (x3 over 73s) default-scheduler 0/3 nodes are available: 1 Insufficient cpu, 2 node(s) had taints that the pod didn't tolerate.
Normal Scheduled 21s default-scheduler Successfully assigned prod/dask-gateway-dkkwf to ip-192-168-139-158.us-west-2.compute.internal
Normal Pulling 19s kubelet, ip-192-168-139-158.us-west-2.compute.internal Pulling image "pangeoaccess/binder-scottyhq-2dpangeodev-2dbinder-a75d9b:f40cf3ba877e0c396a72a262df0e209d95002987"
We recently made a dedicated node pool for schedulers (in addition to our node pool for dask workers).
This makes things a bit slower to get a Dask Cluster when a the pangeo cluster has been idle for a while. A user comes along and does
Two things
gateway.new_cluster()
andcluster.scale(...)
? (cc @jcrist). This seems hard since it's backend specific, and logs from the backend may contain sensitive information. This might also be better solved somewhere like the lab-extension.