dask workers can be scheduled on hub pods with default config

scottyhq commented 5 years ago

Our current setup allows for dask pods on hub nodes: https://github.com/pangeo-data/pangeo-stacks/blob/master/base-notebook/binder/dask_config.yaml

This seems to be due to 'prefer' rather than 'require' when scheduling: https://github.com/dask/dask-kubernetes/blob/ec4666a4af5acad03c24b84aca4fcf8ccd791b4f/dask_kubernetes/objects.py#L177

which results in the following for pods:

spec:
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - preference:
          matchExpressions:
          - key: k8s.dask.org/node-purpose
            operator: In
            values:
            - worker
        weight: 100

not sure how we modify the config file to get the stricter 'require' condition like we have for notebook pods:

spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: k8s.dask.org/node-purpose
            operator: In
            values:
            - worker

@jhamman , @TomAugspurger

jhamman commented 5 years ago

If you want to keep non-core pods off your core (hub) pool, you need to add a taint that only core pods can tolerate. I tend to just size the core pool to the smallest possible size to fit the hub pods. If you don't leave space, things wont try to schedule there. You can also up the node purpose scheduling requirements for dask pods, but in my experience, this is unnecessary.

For posterity, I should also link to this blog post that describes all of this in more detail: https://medium.com/pangeo/pangeo-cloud-cluster-design-9d58a1bf1ad3

scottyhq commented 5 years ago

@jhamman - i'm thinking we might want the core pool to autoscale eventually if we try to consolidate multiple hubs on a single EKS cluster. If we add a taint to the core pool, it seems like pods in the kube-system namespace might have trouble (for example aws-node, tiller-deploy, cluster-autoscaler).

Another approach is to expose match_node_purpose="require" in https://github.com/dask/dask-kubernetes/blob/ec4666a4af5acad03c24b84aca4fcf8ccd791b4f/dask_kubernetes/objects.py#L177

TomAugspurger commented 5 years ago

@jhamman is there a downside to the hard affinity (at least optionally)? It couldn't be the default, but it seems useful as an option.

TomAugspurger commented 5 years ago

FYI, rather than exposing it as a config / parameter in KubeCluster, we could document how to achieve it.

kind: Pod
metadata:
  labels:
    foo: bar
spec:
  restartPolicy: Never
  containers:
  - image: daskdev/dask:latest
    imagePullPolicy: IfNotPresent
    args: [dask-worker, --nthreads, '2', --no-bokeh, --memory-limit, 6GB, --death-timeout, '60']
    name: dask
    resources:
      limits:
        cpu: "2"
        memory: 6G
      requests:
        cpu: "2"
        memory: 6G
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          key: k8s.dask.org/node-purpose
          operator: In
          values:
            - worker

On master, that'll result in both the preferred and required affinity types being applied.

>>> a.pod_template.spec.affinity.node_affinity
{'preferred_during_scheduling_ignored_during_execution': [{'preference': {'match_expressions': [{'key': 'k8s.dask.org/node-purpose',
                                                                                                 'operator': 'In',
                                                                                                 'values': ['worker']}],
                                                                          'match_fields': None},
                                                           'weight': 100}],
 'required_during_scheduling_ignored_during_execution': {'node_selector_terms': [{'match_expressions': None,
                                                                                  'match_fields': None}]}}

I'm not sure how Kubernetes will handle that (presumably it's fine, just not the cleanest). Right now my preference would be to add a config option / argument to KubeCluster that's passed through to clean_pod_template, but I may be missing some context.

jhamman commented 5 years ago

@jhamman is there a downside to the hard affinity (at least optionally)?

Not really. I think this is a fine approach. Of course, there is not way to enforce that users follow this pattern so dask workers may still end up in your core pool with this approach.

jhamman commented 5 years ago

In thinking about this a little more, it may be easier for some to simply add a taint to the core pool that the hub and ingress pods can tolerate.

scottyhq commented 5 years ago

In thinking about this a little more, it may be easier for some to simply add a taint to the core pool that the hub and ingress pods can tolerate.

@jhamman are you doing this now on the google clusters?

jhamman commented 5 years ago

No. Not yet, but we could.

bgroenks96 commented 4 years ago

If you don't feel like modifying all of the JupyterHub services' configurations to include the toleration, this can also be accomplished by 1) adding a taint to the worker pools to prevent scheduling from core services, with corresponding tolerances added to worker pods and 2) adding a node selector to the worker pods with corresponding labels on the worker nodes. This will pretty much guarantee that everything ends up on the right nodes without having to taint/tolerate the core services.

pangeo-data / pangeo-stacks

dask workers can be scheduled on hub pods with default config #59