ERROR - 'tcp://10.33.3.2:42745' when using dask

hardingnj commented 4 years ago

Can't see a report- apologies if this is known, but I get these errors when using dask:

Traceback (most recent call last): File "/opt/conda/lib/python3.6/site-packages/distributed/utils.py", line 666, in log_errors yield File "/opt/conda/lib/python3.6/site-packages/distributed/scheduler.py", line 1499, in add_worker await self.handle_worker(comm=comm, worker=address) File "/opt/conda/lib/python3.6/site-packages/distributed/scheduler.py", line 2408, in handle_worker worker_comm = self.stream_comms[worker] KeyError: 'tcp://10.33.15.2:35089' distributed.utils - ERROR - 'tcp://10.33.9.3:42121' Traceback (most recent call last): File "/opt/conda/lib/python3.6/site-packages/distributed/utils.py", line 666, in log_errors yield File "/opt/conda/lib/python3.6/site-packages/distributed/scheduler.py", line 1499, in add_worker await self.handle_worker(comm=comm, worker=address) File "/opt/conda/lib/python3.6/site-packages/distributed/scheduler.py", line 2408, in handle_worker worker_comm = self.stream_comms[worker] KeyError: 'tcp://10.33.9.3:42121' distributed.utils - ERROR - 'tcp://10.32.254.2:41423' Traceback (most recent call last): File "/opt/conda/lib/python3.6/site-packages/distributed/utils.py", line 666, in log_errors yield File "/opt/conda/lib/python3.6/site-packages/distributed/scheduler.py", line 1499, in add_worker await self.handle_worker(comm=comm, worker=address) File "/opt/conda/lib/python3.6/site-packages/distributed/scheduler.py", line 2408, in handle_worker worker_comm = self.stream_comms[worker] KeyError: 'tcp://10.32.254.2:41423' distributed.utils - ERROR - 'tcp://10.33.3.2:42745' Traceback (most recent call last): File "/opt/conda/lib/python3.6/site-packages/distributed/utils.py", line 666, in log_errors yield File "/opt/conda/lib/python3.6/site-packages/distributed/scheduler.py", line 1499, in add_worker await self.handle_worker(comm=comm, worker=address) File "/opt/conda/lib/python3.6/site-packages/distributed/scheduler.py", line 2408, in handle_worker worker_comm = self.stream_comms[worker] KeyError: 'tcp://10.33.3.2:42745' distributed.utils - ERROR - 'tcp://10.33.15.3:43249' Traceback (most recent call last): File "/opt/conda/lib/python3.6/site-packages/distributed/utils.py", line 666, in log_errors yield File "/opt/conda/lib/python3.6/site-packages/distributed/scheduler.py", line 1499, in add_worker await self.handle_worker(comm=comm, worker=address) File "/opt/conda/lib/python3.6/site-packages/distributed/scheduler.py", line 2408, in handle_worker worker_comm = self.stream_comms[worker] KeyError: 'tcp://10.33.15.3:43249' distributed.utils - ERROR - 'tcp://10.33.11.3:37789' Traceback (most recent call last): File "/opt/conda/lib/python3.6/site-packages/distributed/utils.py", line 666, in log_errors yield File "/opt/conda/lib/python3.6/site-packages/distributed/scheduler.py", line 1499, in add_worker await self.handle_worker(comm=comm, worker=address) File "/opt/conda/lib/python3.6/site-packages/distributed/scheduler.py", line 2408, in handle_worker worker_comm = self.stream_comms[worker] KeyError: 'tcp://10.33.11.3:37789' distributed.utils - ERROR - 'tcp://10.32.254.4:40659' Traceback (most recent call last):

The error doesn't stop the notebook kernel, but still concerning. Is this a known issue or something with our configuration?

Thanks!

slejdops commented 4 years ago

I'll look into it

hardingnj commented 4 years ago

I'm also seeing an error very similar to this one: https://github.com/dask/dask-jobqueue/issues/222

Are we using a version earlier than the fix implemented in this?

hardingnj commented 4 years ago

Removing cluster.adapt() seems to fix this issue.

slejdops commented 4 years ago

we're using dask-kubernetes not dask-jobqueue jobqueue is for deploying dask on SGE, SLURM , etc. https://jobqueue.dask.org/en/latest/

hardingnj commented 4 years ago

Ah, thanks- I guess the cluster adapt logic is similar though?

slejdops commented 4 years ago

Yes, I think the errors you're getting might be related to this distributed.nanny - INFO - Closing Nanny at 'tcp://10.32.60.2:35879' distributed.worker - INFO - Stopping worker at tcp://10.32.60.2:41549 distributed.worker - INFO - Closed worker has not yet started: None

that's an error from one of your dask workers

malariagen / datalab

ERROR - 'tcp://10.33.3.2:42745' when using dask #71