slejdops commented 4 years ago

Dask workers lose connection with dask scheduler and do not exit when the user pod is terminated.

Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/tornado/ioloop.py", line 743, in _run_callback
    ret = callback()
  File "/opt/conda/lib/python3.6/site-packages/tornado/ioloop.py", line 767, in _discard_future_result
    future.result()
  File "/opt/conda/lib/python3.6/site-packages/distributed/worker.py", line 875, in heartbeat
    metrics=await self.get_metrics(),
  File "/opt/conda/lib/python3.6/site-packages/distributed/core.py", line 747, in send_recv_from_rpc
    comm = await self.pool.connect(self.addr)
  File "/opt/conda/lib/python3.6/site-packages/distributed/core.py", line 874, in connect
    connection_args=self.connection_args,
  File "/opt/conda/lib/python3.6/site-packages/distributed/comm/core.py", line 227, in connect
    _raise(error)
  File "/opt/conda/lib/python3.6/site-packages/distributed/comm/core.py", line 204, in _raise
    raise IOError(msg)
OSError: Timed out trying to connect to 'tcp://10.32.1.154:39889' after 10 s: in <distributed.comm.tcp.TCPConnector object at 0x7f9edeb742e8>: OSError: [Errno 113] No route to host

slejdops commented 4 years ago

2880

athornton commented 4 years ago

Interested to find out if you've figured out anything. We're seeing the same thing at LSST in what I think is a similar circumstance: dask using Kubespawner in a JupyterHub+user-containers+K8s environment, when we try to do any operation on data that doesn't fit into memory (which is kind of the reason to be using dask).

Any ideas?

malariagen / datalab

Dask workers lose connection to scheduler #69

https://github.com/dask/distributed/issues/2880