pangeo-data / pangeo-cloud-federation

Deployment automation for Pangeo JupyterHubs on AWS, Google, and Azure
https://pangeo.io/cloud.html
58 stars 32 forks source link

dask worker pods and nodes not removed by autoscaler #408

Closed scottyhq closed 4 years ago

scottyhq commented 5 years ago

dask-worker nodes were left running on the aws clusters even after user-notebook pods and nodes shut down. Here are some relevant kubectl outputs (dask worker node running for 9 days!)

kubectl get pods --all-namespaces -o wide

esip-prod dask-rsignell-usgs-0b2bf9f0-1kc48v 0/1 ContainerCreating 0 9d ip-192-168-38-94.us-west-2.compute.internal

kubectl describe pod dask-rsignell-usgs-0b2bf9f0-1kc48v -n esip-prod

State: Waiting Reason: ContainerCreating

kubectl get nodes -o wide

ip-192-168-38-94.us-west-2.compute.internal Ready worker 9d v1.13.7-eks-c57ff8 192.168.38.94 34.223.66.115 Amazon Linux 2 4.14.128-112.105.amzn2.x86_64 docker://18.6.1

kubectl logs cluster-autoscaler-6575548656-74955 | grep dask

I0904 23:59:46.094549 1 cluster.go:104] Fast evaluation: node ip-192-168-38-94.us-west-2.compute.internal cannot be removed: esip-prod/dask-rsignell-usgs-0b2bf9f0-1kc48v is not replicated

and finally,

import dask
import dask.config
dask.config.config
{'distributed': {'logging': {'bokeh': 'critical'},
  'dashboard': {'link': '/user/{JUPYTERHUB_USER}/proxy/{port}/status'},
  'admin': {'tick': {'limit': '5s'}}},
 'kubernetes': {'name': 'dask-{JUPYTERHUB_USER}-{uuid}',
  'worker-template': {'spec': {'restartPolicy': 'Never',
    'containers': [{'name': 'dask-${JUPYTERHUB_USER}',
      'image': '${JUPYTER_IMAGE_SPEC}',
      'args': ['dask-worker',
       '--nthreads',
       '2',
       '--no-dashboard',
       '--memory-limit',
       '7GB',
       '--death-timeout',
       '60'],
      'resources': {'limits': {'cpu': '1.75', 'memory': '7G'},
       'requests': {'cpu': 1, 'memory': '7G'}}}]}}},
 'labextension': {'factory': {'module': 'dask_kubernetes',
   'class': 'KubeCluster',
   'args': [],
   'kwargs': {}}},
 'temporary-directory': None,
 'array': {'svg': {'size': 120}}}

Originally posted by @scottyhq in https://github.com/pangeo-data/pangeo/issues/712#issuecomment-528165345

scottyhq commented 5 years ago

moving comment from other issue tracker:

Okay, I can reproduce the dask pods and nodes in limbo by logging onto esip.pangeo.io and launching a dask cluster. if no worker nodes are available it seems to be taking too long to scale up and there are network errors (scroll to bottom):

kubectl describe pod dask-scottyhq-527e3b25-ag48fq -n esip-prod

Name:         dask-scottyhq-527e3b25-ag48fq
Namespace:    esip-prod
Priority:     0
Node:         ip-192-168-78-224.us-west-2.compute.internal/192.168.78.224
Start Time:   Wed, 04 Sep 2019 19:55:16 -0700
Labels:       app=dask
              component=dask-worker
              dask.org/cluster-name=dask-scottyhq-527e3b25-a
              user=jovyan
Annotations:  kubernetes.io/psp: eks.privileged
Status:       Pending
IP:           
Containers:
  dask-scottyhq:
    Container ID:  
    Image:         783380859522.dkr.ecr.us-west-2.amazonaws.com/pangeo-esip:e6ec0b9
    Image ID:      
    Port:          <none>
    Host Port:     <none>
    Args:
      dask-worker
      --nthreads
      2
      --no-dashboard
      --memory-limit
      7GB
      --death-timeout
      60
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:     1750m
      memory:  7G
    Requests:
      cpu:     1
      memory:  7G
    Environment:
      DASK_SCHEDULER_ADDRESS:  tcp://192.168.4.168:35189
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-l8cgz (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  default-token-l8cgz:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-l8cgz
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     k8s.dask.org/dedicated=worker:NoSchedule
                 k8s.dask.org_dedicated=worker:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason                  Age                    From                                                   Message
  ----     ------                  ----                   ----                                                   -------
  Warning  FailedScheduling        7m42s (x2 over 7m42s)  default-scheduler                                      0/2 nodes are available: 1 Insufficient cpu, 2 Insufficient memory.
  Normal   TriggeredScaleUp        7m32s                  cluster-autoscaler                                     pod triggered scale-up: [{eksctl-pangeo-esip-nodegroup-dask-worker-v2-NodeGroup-UKEXL45VXVA7 0->1 (max: 100)}]
  Warning  FailedScheduling        6m12s (x3 over 6m14s)  default-scheduler                                      0/3 nodes are available: 1 Insufficient cpu, 1 node(s) had taints that the pod didn't tolerate, 2 Insufficient memory.
  Normal   Scheduled               6m4s                   default-scheduler                                      Successfully assigned esip-prod/dask-scottyhq-527e3b25-ag48fq to ip-192-168-78-224.us-west-2.compute.internal
  Warning  FailedCreatePodSandBox  6m2s                   kubelet, ip-192-168-78-224.us-west-2.compute.internal  Failed create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "77c187a18ff7974198302f452bfb89cf126689939ff7fe72bb4598c55ea4d360" network for pod "dask-scottyhq-527e3b25-ag48fq": NetworkPlugin cni failed to set up pod "dask-scottyhq-527e3b25-ag48fq_esip-prod" network: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused", failed to clean up sandbox container "77c187a18ff7974198302f452bfb89cf126689939ff7fe72bb4598c55ea4d360" network for pod "dask-scottyhq-527e3b25-ag48fq": NetworkPlugin cni failed to teardown pod "dask-scottyhq-527e3b25-ag48fq_esip-prod" network: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused"]

then it becomes a kubernetes autoscaler problem (see my first comment) where "the node cannot be removed because the pod is not replicated". This seems relevant kubernetes/autoscaler#351

scottyhq commented 5 years ago

Looked into this a bit more today since it came up again, and the issue seems to be related to aws-cni issues that should be fixed by upgrading versions on nodegroups (kubernetes >1.13.8): https://github.com/aws/amazon-vpc-cni-k8s/issues/282#issuecomment-518518687

Confirmed re-creating the nodegroups installed version 1.13.10 and we no longer have lingering dask nodes: kubectl get node

NAME                                           STATUS                     ROLES    AGE     VERSION
ip-192-168-1-96.us-west-2.compute.internal     Ready                      user     11m     v1.13.10-eks-d6460e
ip-192-168-10-159.us-west-2.compute.internal   Ready                      worker   3m55s   v1.13.10-eks-d6460e

cc @rsignell-usgs

scottyhq commented 4 years ago

This has come up again where we have 5 worker nodes continuing to run 24 hours after a user has logged out:

ip-X.us-west-2.compute.internal    Ready    worker   27h   v1.14.7-eks-1861c5
ip-X.us-west-2.compute.internal    Ready    worker   26h   v1.14.7-eks-1861c5
ip-X.us-west-2.compute.internal   Ready    core     17d   v1.14.7-eks-1861c5
ip-X.us-west-2.compute.internal    Ready    worker   26h   v1.14.7-eks-1861c5
ip-X.us-west-2.compute.internal   Ready    user     15m   v1.14.7-eks-1861c5
ip-X.us-west-2.compute.internal   Ready    worker   26h   v1.14.7-eks-1861c5
ip-X.us-west-2.compute.internal    Ready    worker   26h   v1.14.7-eks-1861c5

pods:

NAMESPACE         NAME                                  READY   STATUS    RESTARTS   AGE
icesat2-prod      autohttps-59654c74c4-mnf9k            2/2     Running   0          14d
icesat2-prod      dask-rsignell-usgs-64057291-e7kzhs    1/1     Running   0          26h
icesat2-prod      dask-rsignell-usgs-64057291-e7qc7j    1/1     Running   0          26h
icesat2-prod      dask-rsignell-usgs-64057291-ehqqgt    1/1     Running   0          26h
icesat2-prod      dask-rsignell-usgs-64057291-ergjkp    1/1     Running   0          26h
icesat2-prod      dask-rsignell-usgs-64057291-evnqc7    1/1     Running   0          26h
icesat2-prod      dask-rsignell-usgs-64057291-exzv8p    1/1     Running   0          26h
icesat2-prod      dask-rsignell-usgs-64057291-ezv8lw    1/1     Running   0          26h
icesat2-prod      dask-rsignell-usgs-64057291-ezztnn    1/1     Running   0          26h
icesat2-prod      hub-dc99c9488-2czd8                   1/1     Running   0          14d
icesat2-prod      jupyter-robfatland                    1/1     Running   0          20m
icesat2-prod      proxy-8758bbc4c-g8lbj                 1/1     Running   0          14d
icesat2-prod      user-scheduler-9545b9788-fmgs8        1/1     Running   0          14d
icesat2-prod      user-scheduler-9545b9788-hp4kb        1/1     Running   0          14d

The autoscaler won't remove these nodes according to this log message: I1116 17:36:40.750318 1 cluster.go:90] Fast evaluation: ip-X.us-west-2.compute.internal for removal I1116 17:36:40.750331 1 cluster.go:104] Fast evaluation: node ip-X.us-west-2.compute.internal cannot be removed: icesat2-prod/dask-rsignell-usgs-64057291-ezv8lw is not replicated

I suspect this is a dask issue though, since these pods probably should not still be listed as Running, and looking at the logs of a dask pod I see errors such as:

tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOLoop object at 0x7fe21e4b8748>>, <Task finished coro=<Worker.heartbeat() done, defined at /srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/worker.py:860> exception=OSError("Timed out trying to connect to 'tcp://192.168.183.87:38875' after 10 s: Timed out trying to connect to 'tcp://192.168.183.87:38875' after 10 s: connect() didn't finish in time")>)
Traceback (most recent call last):
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/comm/core.py", line 221, in connect
    _raise(error)
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/comm/core.py", line 204, in _raise
    raise IOError(msg)
OSError: Timed out trying to connect to 'tcp://192.168.183.87:38875' after 10 s: connect() didn't finish in time

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/tornado/ioloop.py", line 743, in _run_callback
    ret = callback()
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/tornado/ioloop.py", line 767, in _discard_future_result
    future.result()
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/worker.py", line 869, in heartbeat
    metrics=await self.get_metrics(),
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/core.py", line 745, in send_recv_from_rpc
    comm = await self.pool.connect(self.addr)
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/core.py", line 879, in connect
    connection_args=self.connection_args,
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/comm/core.py", line 230, in connect
    _raise(error)
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/comm/core.py", line 204, in _raise
    raise IOError(msg)
OSError: Timed out trying to connect to 'tcp://192.168.183.87:38875' after 10 s: Timed out trying to connect to 'tcp://192.168.183.87:38875' after 10 s: connect() didn't finish in time

Pinging @jacobtomlinson @TomAugspurger and @jhamman for help sorting this out!

workaround for now is to delete the pods and then the autoscaler removes the nodes a few minutes later: kubectl get pods -n icesat2-prod --no-headers=true | cut -d ' ' -f1 | grep dask --color=never | xargs kubectl delete pod -n icesat2-prod

TomAugspurger commented 4 years ago

I suspect this is a dask issue though, since these pods probably should not still be listed as Running

Do you think the Python process is just hanging there, and hasn't actually exited?

jacobtomlinson commented 4 years ago

That would be my guess. The workers should time out, but if they've hung they may not be able to run the timeout code.

scottyhq commented 4 years ago

Thanks @TomAugspurger and @jacobtomlinson. Is there a bit of logic to add (distributed? dask-kubernetes?) such that many occurrences of OSError: Timed out trying to connect to 'tcp://192.168.183.87:38875' after 10 s: Timed out trying to connect to 'tcp://192.168.183.87:38875' after 10 s: connect() didn't finish in time kills the process / transitions the pod from Running to Error or Completed ?

TomAugspurger commented 4 years ago

Unfortunately, I'm not sure. I don't know, but it's possible that the event loop is blocked, and so the thread that should kill the process isn't able to run? Again, just speculation. If you have any context on what was going on when the worker timed out trying to reach the scheduler that'd be great, but I'm guessing you don't have access to that info.

cc @mrocklin if you have any guesses.

A potential solution is to include --lifetime="1 hour" in the dask-worker command used. This would be a bit of a hammer though. All workers will be shut down after an hour, regardless of whether they're been active recently, and are replaced with a fresh worker. If the scheduler really has gone away, presumably this worker would fail to connect and the pod would be cleaned up appropriately by kubernetes. See https://docs.dask.org/en/latest/setup/cli.html#cmdoption-dask-worker-lifetime

tjcrone commented 4 years ago

I am having a similar problem on the Azure Pangeo. I noticed that many worker pods were hanging around a long time, after users logged off. To test this, I logged in and created a Dask cluster, then logged off immediately. The worker pods have been up all night:

(base) tjc@lulu:~$ kubectl get pods -o wide --all-namespaces --sort-by="{.spec.nodeName}" | grep ooi-prod | grep dask-tj
ooi-prod      dask-tjcrone-c82d1516-0smzlk           1/1     Running   0          8h      10.0.130.51    aks-nodepool1-41768535-vmss00001h   <none>           <none>
ooi-prod      dask-tjcrone-c82d1516-0z8hp8           1/1     Running   0          8h      10.0.128.75    aks-nodepool1-41768535-vmss000001   <none>           <none>

The pods are throwing errors when trying to connect to the scheduler, but I think this is what they should be doing:

distributed.nanny - INFO -         Start Nanny at: 'tcp://10.0.128.75:36773'
distributed.diskutils - INFO - Found stale lock file and directory '/home/jovyan/worker-plklaxl4', purging
distributed.worker - INFO -       Start worker at:    tcp://10.0.128.75:45553
distributed.worker - INFO -          Listening to:    tcp://10.0.128.75:45553
distributed.worker - INFO - Waiting to connect to:   tcp://10.0.128.211:39647
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          1
distributed.worker - INFO -                Memory:                    1.25 GB
distributed.worker - INFO -       Local Directory: /home/jovyan/worker-4vez2fnb
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -         Registered to:   tcp://10.0.128.211:39647
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
distributed.worker - INFO - Connection to scheduler broken.  Reconnecting...
distributed.worker - WARNING - Heartbeat to scheduler failed
tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOLoop object at 0x7fa7247260f0>>, <Task finished coro=<Worker.heartbeat() done, defined at /srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/worker.py:871> exception=OSError("Timed out trying to connect to 'tcp://10.0.128.211:39647' after 10 s: Timed out trying to connect to 'tcp://10.0.128.211:39647' after 10 s: in <distributed.comm.tcp.TCPConnector object at 0x7fa71c494e48>: ConnectionRefusedError: [Errno 111] Connection refused")>)
Traceback (most recent call last):
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/comm/core.py", line 221, in connect
    _raise(error)
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/comm/core.py", line 204, in _raise
    raise IOError(msg)
OSError: Timed out trying to connect to 'tcp://10.0.128.211:39647' after 10 s: in <distributed.comm.tcp.TCPConnector object at 0x7fa71c494e48>: ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/tornado/ioloop.py", line 743, in _run_callback
    ret = callback()
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/tornado/ioloop.py", line 767, in _discard_future_result
    future.result()
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/worker.py", line 880, in heartbeat
    metrics=await self.get_metrics(),
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/core.py", line 735, in send_recv_from_rpc
    comm = await self.pool.connect(self.addr)
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/core.py", line 869, in connect
    connection_args=self.connection_args,
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/comm/core.py", line 230, in connect
    _raise(error)
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/comm/core.py", line 204, in _raise
    raise IOError(msg)
OSError: Timed out trying to connect to 'tcp://10.0.128.211:39647' after 10 s: Timed out trying to connect to 'tcp://10.0.128.211:39647' after 10 s: in <distributed.comm.tcp.TCPConnector object at 0x7fa71c494e48>: ConnectionRefusedError: [Errno 111] Connection refused

Or maybe this isn't supposed to be happening? @mrocklin, any ideas here? Thank you!

tjcrone commented 4 years ago

The OOI configuration is here: https://github.com/pangeo-data/pangeo-cloud-federation/tree/staging/deployments/ooi

tjcrone commented 4 years ago

Looks like this issue has been fixed here: https://github.com/dask/distributed/issues/2880. Will need to go onto a dev version of distributed I think.

TomAugspurger commented 4 years ago

Ideally https://github.com/dask/distributed/pull/3250 fixed this, but if not LMK.

I see you've already deployed with distributed dev, but I believe that the fix was included in distributed 2.8.1 released on Nov 22.

scottyhq commented 4 years ago

@tjcrone I'd recommend changing https://github.com/pangeo-data/pangeo-cloud-federation/blob/staging/deployments/ooi/image/binder/Dockerfile#L2 to the 2019.11.25 tag, which includes dask and distributed 2.8.1 (https://github.com/pangeo-data/pangeo-stacks).

tjcrone commented 4 years ago

Ah! Great suggestion @scottyhq. Thank you. This is one of the updates I often forget to do when I move our image forward. I wonder if I can change it to LATEST or something like that so that I do not forget this in the future.

tjcrone commented 4 years ago

@TomAugspurger, it looks like dask/distributed#3250 did fix this issue! Thanks. One thing I noticed is that after a worker loses contact with the scheduler, the pod goes into a "Completed" state, rather than terminating as it would if the dask cluster was explicitly shut down. The pod in a completed state still seems to retain an IP address, but it is not clear if the kubernetes cluster will kick that pod out and scale down. Any thoughts on whether it would be better to explicitly delete worker pods when they lose contact with the scheduler?

scottyhq commented 4 years ago

Just want to link this discussion regarding @tjcrone's point. @jacobtomlinson has floated the idea of refactoring dask workers as Jobs instead of Pods https://github.com/dask/dask-kubernetes/issues/53#issuecomment-512030629

Also, want to note that we are still faced with the situation of old dask versions in environments being run on binderhub that stick around: https://github.com/NCAR/cesm-lens-aws/pull/27

binder-staging   dask-ncar-cesm-lens-aws-36dzuhnd-31978533-08f9zd                1/1     Running   0          4d14h
binder-staging   dask-ncar-cesm-lens-aws-36dzuhnd-31978533-08tvtq                1/1     Running   0          4d14h
binder-staging   dask-ncar-cesm-lens-aws-36dzuhnd-31978533-0f9vrf                1/1     Running   0          4d14h
binder-staging   dask-ncar-cesm-lens-aws-36dzuhnd-31978533-0fcpdn                1/1     Running   0          4d14h
binder-staging   dask-ncar-cesm-lens-aws-36dzuhnd-31978533-0splxm                1/1     Running   0          4d14h

So I'm wondering how to enforce the behavior of removing long-running pods within the pangeo helm chart configuration (https://github.com/pangeo-data/pangeo-cloud-federation/pull/477). Perhaps this is yet another reason to incorporate dask-gateway? Thoughts @jhamman ?

jhamman commented 4 years ago

@scottyhq -A few things you may consider doing:

  1. Purge all old images in your container registry. This will force rebuilds, even for repos that don't have new changes.
  2. Help us get dask-gateway working on binder (WIP PR here: https://github.com/pangeo-data/pangeo-binder/pull/87). There will be a interim period where we'll have to let both work and a PR campaign to educate people on the new way of creating and managing dask clusters.
scottyhq commented 4 years ago

closing since as of https://github.com/pangeo-data/pangeo-cloud-federation/pull/577 we are now using dask-gateway to manage dask pods