Poor UX when waiting for scheduler node pool to scale up

TomAugspurger commented 4 years ago

We recently made a dedicated node pool for schedulers (in addition to our node pool for dask workers).

This makes things a bit slower to get a Dask Cluster when a the pangeo cluster has been idle for a while. A user comes along and does

gateway = Gateway()
cluster = gateway.new_cluster()  # wait for scheduler node-pool to scale up, pod to be scheduled
cluster.scale(...)  # second wait for worker node-pool to scale up, pods to be scheduled

Two things

Can we "ping" the scheduler node pool when we start a user pod, triggering a scale up for the pool?
Can we provide any feedback on what we're waiting on in the gateway.new_cluster() and cluster.scale(...)? (cc @jcrist). This seems hard since it's backend specific, and logs from the backend may contain sensitive information. This might also be better solved somewhere like the lab-extension.

jcrist commented 4 years ago

Can we provide any feedback on what we're waiting on in the gateway.new_cluster() and cluster.scale(...)?

This is an open question. This could definitely be handled, but I'm not sure where/how to display this information to the user (if you have thoughts, I'd be interested to hear them). Presumably other Cluster implementations have the same issue, so this isn't specific to dask-gateway.

Stepping back a bit here, we'd like:

Dask workers to be run on preemptible nodes
Dask schedulers and Jupyter notebooks to be run on non-preemptible nodes
Pods to be packed nicely, so that node pools can scale up/down nicely, reducing costs
Getting a notebook and a scheduler quickly make things feel snappier. Users can't actually do work until they get workers, but since worker allocation happens in the background they're less likely to notice slowdowns (in my experience).

Given the above requirements (which I may have assumed incorrectly), perhaps dask schedulers and notebooks could share the notebook nodes? To aid in packing and speedy scale up, we might also enable some of the JupyterHub scheduling/placeholder options here: https://zero-to-jupyterhub.readthedocs.io/en/latest/administrator/optimization.html.

cc @consideRatio, who likely has some thoughts on this.

TomAugspurger commented 4 years ago

Thanks. Your summary is accurate. So I think we can approach it on two fronts

Make changes to get a user from cold-start to running Dask cluster as quickly as possible.
Make changes to inform the user what's happening in the background by exposing the node-pool / pod status.

We discussed putting the schedulers in the same node-pool as the jupyter notebooks. @jhamman raised some issues at https://github.com/pangeo-data/pangeo-cloud-federation/pull/567#issuecomment-601784900. In the end we decided to go with a separate node pool (but that could be revisited).

consideRatio commented 4 years ago

@jcrist I setup a dask-gateway deployment and opted to make the scheduler run alongside the user pods. Related config:

gateway:
  backend:
    scheduler:
      # Any extra configuration for the worker pod. Sets
      # `c.KubeClusterConfig.worker_extra_pod_config`.
      extraPodConfig:
        tolerations:
          - key: "hub.jupyter.org_dedicated"
            operator: "Equal"
            value: "user"
            effect: "NoSchedule"
        nodeSelector:
          hub.jupyter.org/node-purpose: user
        schedulerName: myjhubhelmreleasename-user-scheduler

I think a key question regarding positioning the user schedulers on the same node or a different node may relate to the resource requests for users / schedulers respectively. If you can fit 4 users per node perfectly and a scheduler takes up 0.1 of the user and you suddenly can only fit three, thats not great. I have no experience with the amount of resource requests that the scheduler needs in general etc though, I find it complicated to figure out what makes most sense here.

jcrist commented 4 years ago

A safe default is 1 core and 2 GiB memory for the scheduler, but they can get by with far less.

I'm more interested in your recommendations on enabling the user scheduler/placeholder pods with respect to the pangeo deployment. My understanding is that:

Placeholder pods will help keep some space available for quick spinup by new users
The user scheduler helps "pack" pods together on nodes, making the autoscaler more effective. If pangeo enabled this, would it also make sense to add JupyterHub's custom kube-scheduler for the dask scheduler/worker pods?

consideRatio commented 4 years ago

@jcrist your understanding is like mine. The placeholder pods are running a pause container but have resources requests as the real user pods. They have a lower pod priority though, so they can be kicked out to make room for real users though, and that is what makes them able to hold a real user pods place. They are only kicked out / evicted like this if it was required to make room for a real user, and when that happen they probably cant fit anywhere, so they end up pending. And, pending pods will trigger a scale up event by a clusterautoscaler that adds nodes to a k8s cluster.

The custom k8s scheduler does indeed pack pods, by running a now outdated kube-scheduler (k8s 1.13.x) binary with custom configuration.

I think a general practice that makes sense:

If the JupyterHub's custom scheduler is scheduling pods to run on a set of nodes, all pods scheduled on these nodes should be scheduled by JupyterHub's custom scheduler, or be of the kind that at least won't block a scale down - such as daemonset pods.

So, if the dask-scheduler created by a user should live on pangeos user nodes, then I think it should they should pack using the JupyterHub's custom kube-scheduler that packs pods tight, otherwise it will be the worst of both worlds. Tight packing without smooth down scaling, because individual schedulers would end up blocking scale downs while lots of JH users packed on other nodes.

scottyhq commented 4 years ago

@jcrist and @consideRatio - I opened a similar issue related to dask worker scheduling on dask-kubernetes a while back which I think is informative https://github.com/dask/dask-kubernetes/issues/233

Make changes to inform the user what's happening in the background by exposing the node-pool / pod status.

@TomAugspurger I think this would be really great. At the very least a Pod triggered scale-up message such as when a jupyterhub session is starting, signals that things are happening and the kernel doesn't need restarting.

jcrist commented 4 years ago

I think this would be really great. At the very least a Pod triggered scale-up message such as when a jupyterhub session is starting, signals that things are happening and the kernel doesn't need restarting.

I can see how to support this technically, but I'm not sure where we should expose this information to the user. When creating a cluster programmatically, or calling scale/adapt, how would you expect this information to be available?

scottyhq commented 4 years ago

I'd say simply printing a message to stdout would be sufficient for these two cases per @TomAugspurger 's comment above:

1) cluster = gateway.new_cluster() --> dask scheduler requested, may take several minutes to become active as cluster scales machines...

2) cluster.scale(...) --> dask workers requested, may take several minutes to become active as cluster scales machines...

Even snazzier would be directly exposing the kubernetes events, perhaps piping to a dropdown ipywidget so as to not clutter the notebook output (see https://github.com/dask/dask-kubernetes/pull/142):

(scheduler)
Events:
  Type    Reason     Age   From                                                    Message
  ----    ------     ----  ----                                                    -------
  Normal  Scheduled  39s   default-scheduler                                       Successfully assigned prod/dask-gateway-928c4eeee28546cdae0cb920da20aa87 to ip-192-168-156-187.us-west-2.compute.internal
  Normal  Pulling    38s   kubelet, ip-192-168-156-187.us-west-2.compute.internal  Pulling image "pangeoaccess/binder-scottyhq-2dpangeodev-2dbinder-a75d9b:f40cf3ba877e0c396a72a262df0e209d95002987"
  Normal  Pulled     8s    kubelet, ip-192-168-156-187.us-west-2.compute.internal  Successfully pulled image "pangeoaccess/binder-scottyhq-2dpangeodev-2dbinder-a75d9b:f40cf3ba877e0c396a72a262df0e209d95002987"
  Normal  Created    2s    kubelet, ip-192-168-156-187.us-west-2.compute.internal  Created container dask-scheduler
  Normal  Started    2s    kubelet, ip-192-168-156-187.us-west-2.compute.internal  Started container dask-scheduler

(worker)
Events:
  Type     Reason            Age                From                                                    Message
  ----     ------            ----               ----                                                    -------
  Warning  FailedScheduling  2m26s              default-scheduler                                       0/2 nodes are available: 1 Insufficient cpu, 1 node(s) had taints that the pod didn't tolerate.
  Normal   TriggeredScaleUp  2m22s              cluster-autoscaler                                      pod triggered scale-up: [{eksctl-pangeo-binder-nodegroup-worker-spot-NodeGroup-P9W9ZIJ11UZ 0->1 (max: 10)}]
  Warning  FailedScheduling  23s (x3 over 73s)  default-scheduler                                       0/3 nodes are available: 1 Insufficient cpu, 2 node(s) had taints that the pod didn't tolerate.
  Normal   Scheduled         21s                default-scheduler                                       Successfully assigned prod/dask-gateway-dkkwf to ip-192-168-139-158.us-west-2.compute.internal
  Normal   Pulling           19s                kubelet, ip-192-168-139-158.us-west-2.compute.internal  Pulling image "pangeoaccess/binder-scottyhq-2dpangeodev-2dbinder-a75d9b:f40cf3ba877e0c396a72a262df0e209d95002987"

pangeo-data / pangeo-cloud-federation

Poor UX when waiting for scheduler node pool to scale up #587