Closed TomAugspurger closed 4 years ago
Looks like this run is going to finish @jhamman.
I spoke too soon apparently.
$ kubectl -n staging logs -l prefect.io/flow_run_id=332dc822-c407-43fe-a4d6-21c30f4e3ecd
[2020-09-15 04:07:52] INFO - prefect.CloudTaskRunner | Task 'download': Starting task run...
[2020-09-15 04:07:52] INFO - prefect.CloudTaskRunner | Task 'download': finished task run for task with final state: 'Mapped'
[2020-09-15 04:07:53] INFO - prefect.CloudTaskRunner | Task 'download[0]': Starting task run...
distributed.worker - WARNING - Heartbeat to scheduler failed
distributed.client - ERROR - Failed to reconnect to scheduler after 3.00 seconds, closing client
ERROR:asyncio:_GatheringFuture exception was never retrieved
future: <_GatheringFuture finished exception=CancelledError()>
concurrent.futures._base.CancelledError
[2020-09-15 04:08:04] INFO - prefect.CloudTaskRunner | Task 'download[0]': finished task run for task with final state: 'Success'
INFO:prefect.CloudTaskRunner:Task 'download[0]': finished task run for task with final state: 'Success'
loop.run_sync(run)
File "/srv/conda/envs/notebook/lib/python3.7/site-packages/tornado/ioloop.py", line 532, in run_sync
return future_cell[0].result()
File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/cli/dask_worker.py", line 426, in run
await asyncio.gather(*nannies)
File "/srv/conda/envs/notebook/lib/python3.7/asyncio/tasks.py", line 630, in _wrap_awaitable
return (yield from awaitable.__await__())
File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/core.py", line 286, in _
type(self).__name__, timeout
concurrent.futures._base.TimeoutError: Nanny failed to start in 60 seconds
loop.run_sync(run)
File "/srv/conda/envs/notebook/lib/python3.7/site-packages/tornado/ioloop.py", line 532, in run_sync
return future_cell[0].result()
File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/cli/dask_worker.py", line 426, in run
await asyncio.gather(*nannies)
File "/srv/conda/envs/notebook/lib/python3.7/asyncio/tasks.py", line 630, in _wrap_awaitable
return (yield from awaitable.__await__())
File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/core.py", line 286, in _
type(self).__name__, timeout
concurrent.futures._base.TimeoutError: Nanny failed to start in 60 seconds
File "/srv/conda/envs/notebook/lib/python3.7/site-packages/tornado/tcpserver.py", line 327, in <lambda>
gen.convert_yielded(future), lambda f: f.result()
File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/comm/tcp.py", line 457, in _handle_stream
await self.on_connection(comm)
File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/comm/core.py", line 232, in on_connection
comm.local_info, comm.remote_info
File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/comm/core.py", line 145, in handshake_configuration
"Your Dask versions may not be in sync. "
ValueError: Your Dask versions may not be in sync. Please ensure that you have the same version of dask and distributed on your client, scheduler, and worker machines
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP local=tcp://10.37.142.194:37097 remote=tcp://10.36.248.128:44652>
Not sure what to make of that error message about mismatched versions. They're all running image: pangeoforge/terraclimate:latest
(I repushed to latest
, so the node may have a stale version. But I wouldn't expect the dask version to have changed)
They also didn't clean up completely. The errored pods, and some Running pods were still around this morning.
The flow runs were timing out (during scheduling?) in part because we had ~1,000+ tasks. This switches to Task.map, which generates them on the fly.
I need to fixup the permission issue on the the bucket, then we should be good.