Closed scottyhq closed 4 years ago
moving comment from other issue tracker:
Okay, I can reproduce the dask pods and nodes in limbo by logging onto esip.pangeo.io and launching a dask cluster. if no worker nodes are available it seems to be taking too long to scale up and there are network errors (scroll to bottom):
kubectl describe pod dask-scottyhq-527e3b25-ag48fq -n esip-prod
Name: dask-scottyhq-527e3b25-ag48fq
Namespace: esip-prod
Priority: 0
Node: ip-192-168-78-224.us-west-2.compute.internal/192.168.78.224
Start Time: Wed, 04 Sep 2019 19:55:16 -0700
Labels: app=dask
component=dask-worker
dask.org/cluster-name=dask-scottyhq-527e3b25-a
user=jovyan
Annotations: kubernetes.io/psp: eks.privileged
Status: Pending
IP:
Containers:
dask-scottyhq:
Container ID:
Image: 783380859522.dkr.ecr.us-west-2.amazonaws.com/pangeo-esip:e6ec0b9
Image ID:
Port: <none>
Host Port: <none>
Args:
dask-worker
--nthreads
2
--no-dashboard
--memory-limit
7GB
--death-timeout
60
State: Waiting
Reason: ContainerCreating
Ready: False
Restart Count: 0
Limits:
cpu: 1750m
memory: 7G
Requests:
cpu: 1
memory: 7G
Environment:
DASK_SCHEDULER_ADDRESS: tcp://192.168.4.168:35189
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-l8cgz (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
default-token-l8cgz:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-l8cgz
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: k8s.dask.org/dedicated=worker:NoSchedule
k8s.dask.org_dedicated=worker:NoSchedule
node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 7m42s (x2 over 7m42s) default-scheduler 0/2 nodes are available: 1 Insufficient cpu, 2 Insufficient memory.
Normal TriggeredScaleUp 7m32s cluster-autoscaler pod triggered scale-up: [{eksctl-pangeo-esip-nodegroup-dask-worker-v2-NodeGroup-UKEXL45VXVA7 0->1 (max: 100)}]
Warning FailedScheduling 6m12s (x3 over 6m14s) default-scheduler 0/3 nodes are available: 1 Insufficient cpu, 1 node(s) had taints that the pod didn't tolerate, 2 Insufficient memory.
Normal Scheduled 6m4s default-scheduler Successfully assigned esip-prod/dask-scottyhq-527e3b25-ag48fq to ip-192-168-78-224.us-west-2.compute.internal
Warning FailedCreatePodSandBox 6m2s kubelet, ip-192-168-78-224.us-west-2.compute.internal Failed create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "77c187a18ff7974198302f452bfb89cf126689939ff7fe72bb4598c55ea4d360" network for pod "dask-scottyhq-527e3b25-ag48fq": NetworkPlugin cni failed to set up pod "dask-scottyhq-527e3b25-ag48fq_esip-prod" network: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused", failed to clean up sandbox container "77c187a18ff7974198302f452bfb89cf126689939ff7fe72bb4598c55ea4d360" network for pod "dask-scottyhq-527e3b25-ag48fq": NetworkPlugin cni failed to teardown pod "dask-scottyhq-527e3b25-ag48fq_esip-prod" network: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused"]
then it becomes a kubernetes autoscaler problem (see my first comment) where "the node cannot be removed because the pod is not replicated". This seems relevant kubernetes/autoscaler#351
Looked into this a bit more today since it came up again, and the issue seems to be related to aws-cni issues that should be fixed by upgrading versions on nodegroups (kubernetes >1.13.8): https://github.com/aws/amazon-vpc-cni-k8s/issues/282#issuecomment-518518687
Confirmed re-creating the nodegroups installed version 1.13.10 and we no longer have lingering dask nodes:
kubectl get node
NAME STATUS ROLES AGE VERSION
ip-192-168-1-96.us-west-2.compute.internal Ready user 11m v1.13.10-eks-d6460e
ip-192-168-10-159.us-west-2.compute.internal Ready worker 3m55s v1.13.10-eks-d6460e
cc @rsignell-usgs
This has come up again where we have 5 worker nodes continuing to run 24 hours after a user has logged out:
ip-X.us-west-2.compute.internal Ready worker 27h v1.14.7-eks-1861c5
ip-X.us-west-2.compute.internal Ready worker 26h v1.14.7-eks-1861c5
ip-X.us-west-2.compute.internal Ready core 17d v1.14.7-eks-1861c5
ip-X.us-west-2.compute.internal Ready worker 26h v1.14.7-eks-1861c5
ip-X.us-west-2.compute.internal Ready user 15m v1.14.7-eks-1861c5
ip-X.us-west-2.compute.internal Ready worker 26h v1.14.7-eks-1861c5
ip-X.us-west-2.compute.internal Ready worker 26h v1.14.7-eks-1861c5
pods:
NAMESPACE NAME READY STATUS RESTARTS AGE
icesat2-prod autohttps-59654c74c4-mnf9k 2/2 Running 0 14d
icesat2-prod dask-rsignell-usgs-64057291-e7kzhs 1/1 Running 0 26h
icesat2-prod dask-rsignell-usgs-64057291-e7qc7j 1/1 Running 0 26h
icesat2-prod dask-rsignell-usgs-64057291-ehqqgt 1/1 Running 0 26h
icesat2-prod dask-rsignell-usgs-64057291-ergjkp 1/1 Running 0 26h
icesat2-prod dask-rsignell-usgs-64057291-evnqc7 1/1 Running 0 26h
icesat2-prod dask-rsignell-usgs-64057291-exzv8p 1/1 Running 0 26h
icesat2-prod dask-rsignell-usgs-64057291-ezv8lw 1/1 Running 0 26h
icesat2-prod dask-rsignell-usgs-64057291-ezztnn 1/1 Running 0 26h
icesat2-prod hub-dc99c9488-2czd8 1/1 Running 0 14d
icesat2-prod jupyter-robfatland 1/1 Running 0 20m
icesat2-prod proxy-8758bbc4c-g8lbj 1/1 Running 0 14d
icesat2-prod user-scheduler-9545b9788-fmgs8 1/1 Running 0 14d
icesat2-prod user-scheduler-9545b9788-hp4kb 1/1 Running 0 14d
The autoscaler won't remove these nodes according to this log message: I1116 17:36:40.750318 1 cluster.go:90] Fast evaluation: ip-X.us-west-2.compute.internal for removal I1116 17:36:40.750331 1 cluster.go:104] Fast evaluation: node ip-X.us-west-2.compute.internal cannot be removed: icesat2-prod/dask-rsignell-usgs-64057291-ezv8lw is not replicated
I suspect this is a dask issue though, since these pods probably should not still be listed as Running
, and looking at the logs of a dask pod I see errors such as:
tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOLoop object at 0x7fe21e4b8748>>, <Task finished coro=<Worker.heartbeat() done, defined at /srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/worker.py:860> exception=OSError("Timed out trying to connect to 'tcp://192.168.183.87:38875' after 10 s: Timed out trying to connect to 'tcp://192.168.183.87:38875' after 10 s: connect() didn't finish in time")>)
Traceback (most recent call last):
File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/comm/core.py", line 221, in connect
_raise(error)
File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/comm/core.py", line 204, in _raise
raise IOError(msg)
OSError: Timed out trying to connect to 'tcp://192.168.183.87:38875' after 10 s: connect() didn't finish in time
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/srv/conda/envs/notebook/lib/python3.7/site-packages/tornado/ioloop.py", line 743, in _run_callback
ret = callback()
File "/srv/conda/envs/notebook/lib/python3.7/site-packages/tornado/ioloop.py", line 767, in _discard_future_result
future.result()
File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/worker.py", line 869, in heartbeat
metrics=await self.get_metrics(),
File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/core.py", line 745, in send_recv_from_rpc
comm = await self.pool.connect(self.addr)
File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/core.py", line 879, in connect
connection_args=self.connection_args,
File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/comm/core.py", line 230, in connect
_raise(error)
File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/comm/core.py", line 204, in _raise
raise IOError(msg)
OSError: Timed out trying to connect to 'tcp://192.168.183.87:38875' after 10 s: Timed out trying to connect to 'tcp://192.168.183.87:38875' after 10 s: connect() didn't finish in time
Pinging @jacobtomlinson @TomAugspurger and @jhamman for help sorting this out!
workaround for now is to delete the pods and then the autoscaler removes the nodes a few minutes later: kubectl get pods -n icesat2-prod --no-headers=true | cut -d ' ' -f1 | grep dask --color=never | xargs kubectl delete pod -n icesat2-prod
I suspect this is a dask issue though, since these pods probably should not still be listed as Running
Do you think the Python process is just hanging there, and hasn't actually exited?
That would be my guess. The workers should time out, but if they've hung they may not be able to run the timeout code.
Thanks @TomAugspurger and @jacobtomlinson. Is there a bit of logic to add (distributed? dask-kubernetes?) such that many occurrences of OSError: Timed out trying to connect to 'tcp://192.168.183.87:38875' after 10 s: Timed out trying to connect to 'tcp://192.168.183.87:38875' after 10 s: connect() didn't finish in time
kills the process / transitions the pod from Running
to Error
or Completed
?
Unfortunately, I'm not sure. I don't know, but it's possible that the event loop is blocked, and so the thread that should kill the process isn't able to run? Again, just speculation. If you have any context on what was going on when the worker timed out trying to reach the scheduler that'd be great, but I'm guessing you don't have access to that info.
cc @mrocklin if you have any guesses.
A potential solution is to include --lifetime="1 hour"
in the dask-worker
command used. This would be a bit of a hammer though. All workers will be shut down after an hour, regardless of whether they're been active recently, and are replaced with a fresh worker. If the scheduler really has gone away, presumably this worker would fail to connect and the pod would be cleaned up appropriately by kubernetes. See https://docs.dask.org/en/latest/setup/cli.html#cmdoption-dask-worker-lifetime
I am having a similar problem on the Azure Pangeo. I noticed that many worker pods were hanging around a long time, after users logged off. To test this, I logged in and created a Dask cluster, then logged off immediately. The worker pods have been up all night:
(base) tjc@lulu:~$ kubectl get pods -o wide --all-namespaces --sort-by="{.spec.nodeName}" | grep ooi-prod | grep dask-tj
ooi-prod dask-tjcrone-c82d1516-0smzlk 1/1 Running 0 8h 10.0.130.51 aks-nodepool1-41768535-vmss00001h <none> <none>
ooi-prod dask-tjcrone-c82d1516-0z8hp8 1/1 Running 0 8h 10.0.128.75 aks-nodepool1-41768535-vmss000001 <none> <none>
The pods are throwing errors when trying to connect to the scheduler, but I think this is what they should be doing:
distributed.nanny - INFO - Start Nanny at: 'tcp://10.0.128.75:36773'
distributed.diskutils - INFO - Found stale lock file and directory '/home/jovyan/worker-plklaxl4', purging
distributed.worker - INFO - Start worker at: tcp://10.0.128.75:45553
distributed.worker - INFO - Listening to: tcp://10.0.128.75:45553
distributed.worker - INFO - Waiting to connect to: tcp://10.0.128.211:39647
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Threads: 1
distributed.worker - INFO - Memory: 1.25 GB
distributed.worker - INFO - Local Directory: /home/jovyan/worker-4vez2fnb
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Registered to: tcp://10.0.128.211:39647
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
distributed.worker - INFO - Connection to scheduler broken. Reconnecting...
distributed.worker - WARNING - Heartbeat to scheduler failed
tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOLoop object at 0x7fa7247260f0>>, <Task finished coro=<Worker.heartbeat() done, defined at /srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/worker.py:871> exception=OSError("Timed out trying to connect to 'tcp://10.0.128.211:39647' after 10 s: Timed out trying to connect to 'tcp://10.0.128.211:39647' after 10 s: in <distributed.comm.tcp.TCPConnector object at 0x7fa71c494e48>: ConnectionRefusedError: [Errno 111] Connection refused")>)
Traceback (most recent call last):
File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/comm/core.py", line 221, in connect
_raise(error)
File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/comm/core.py", line 204, in _raise
raise IOError(msg)
OSError: Timed out trying to connect to 'tcp://10.0.128.211:39647' after 10 s: in <distributed.comm.tcp.TCPConnector object at 0x7fa71c494e48>: ConnectionRefusedError: [Errno 111] Connection refused
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/srv/conda/envs/notebook/lib/python3.7/site-packages/tornado/ioloop.py", line 743, in _run_callback
ret = callback()
File "/srv/conda/envs/notebook/lib/python3.7/site-packages/tornado/ioloop.py", line 767, in _discard_future_result
future.result()
File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/worker.py", line 880, in heartbeat
metrics=await self.get_metrics(),
File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/core.py", line 735, in send_recv_from_rpc
comm = await self.pool.connect(self.addr)
File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/core.py", line 869, in connect
connection_args=self.connection_args,
File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/comm/core.py", line 230, in connect
_raise(error)
File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/comm/core.py", line 204, in _raise
raise IOError(msg)
OSError: Timed out trying to connect to 'tcp://10.0.128.211:39647' after 10 s: Timed out trying to connect to 'tcp://10.0.128.211:39647' after 10 s: in <distributed.comm.tcp.TCPConnector object at 0x7fa71c494e48>: ConnectionRefusedError: [Errno 111] Connection refused
Or maybe this isn't supposed to be happening? @mrocklin, any ideas here? Thank you!
The OOI configuration is here: https://github.com/pangeo-data/pangeo-cloud-federation/tree/staging/deployments/ooi
Looks like this issue has been fixed here: https://github.com/dask/distributed/issues/2880. Will need to go onto a dev version of distributed I think.
Ideally https://github.com/dask/distributed/pull/3250 fixed this, but if not LMK.
I see you've already deployed with distributed dev, but I believe that the fix was included in distributed 2.8.1 released on Nov 22.
@tjcrone I'd recommend changing https://github.com/pangeo-data/pangeo-cloud-federation/blob/staging/deployments/ooi/image/binder/Dockerfile#L2 to the 2019.11.25 tag, which includes dask and distributed 2.8.1 (https://github.com/pangeo-data/pangeo-stacks).
Ah! Great suggestion @scottyhq. Thank you. This is one of the updates I often forget to do when I move our image forward. I wonder if I can change it to LATEST or something like that so that I do not forget this in the future.
@TomAugspurger, it looks like dask/distributed#3250 did fix this issue! Thanks. One thing I noticed is that after a worker loses contact with the scheduler, the pod goes into a "Completed" state, rather than terminating as it would if the dask cluster was explicitly shut down. The pod in a completed state still seems to retain an IP address, but it is not clear if the kubernetes cluster will kick that pod out and scale down. Any thoughts on whether it would be better to explicitly delete worker pods when they lose contact with the scheduler?
Just want to link this discussion regarding @tjcrone's point. @jacobtomlinson has floated the idea of refactoring dask workers as Jobs
instead of Pods
https://github.com/dask/dask-kubernetes/issues/53#issuecomment-512030629
Also, want to note that we are still faced with the situation of old dask versions in environments being run on binderhub that stick around: https://github.com/NCAR/cesm-lens-aws/pull/27
binder-staging dask-ncar-cesm-lens-aws-36dzuhnd-31978533-08f9zd 1/1 Running 0 4d14h
binder-staging dask-ncar-cesm-lens-aws-36dzuhnd-31978533-08tvtq 1/1 Running 0 4d14h
binder-staging dask-ncar-cesm-lens-aws-36dzuhnd-31978533-0f9vrf 1/1 Running 0 4d14h
binder-staging dask-ncar-cesm-lens-aws-36dzuhnd-31978533-0fcpdn 1/1 Running 0 4d14h
binder-staging dask-ncar-cesm-lens-aws-36dzuhnd-31978533-0splxm 1/1 Running 0 4d14h
So I'm wondering how to enforce the behavior of removing long-running pods within the pangeo helm chart configuration (https://github.com/pangeo-data/pangeo-cloud-federation/pull/477). Perhaps this is yet another reason to incorporate dask-gateway? Thoughts @jhamman ?
@scottyhq -A few things you may consider doing:
closing since as of https://github.com/pangeo-data/pangeo-cloud-federation/pull/577 we are now using dask-gateway to manage dask pods
dask-worker nodes were left running on the aws clusters even after user-notebook pods and nodes shut down. Here are some relevant kubectl outputs (dask worker node running for 9 days!)
kubectl get pods --all-namespaces -o wide
kubectl describe pod dask-rsignell-usgs-0b2bf9f0-1kc48v -n esip-prod
kubectl get nodes -o wide
kubectl logs cluster-autoscaler-6575548656-74955 | grep dask
and finally,
Originally posted by @scottyhq in https://github.com/pangeo-data/pangeo/issues/712#issuecomment-528165345