pangeo-forge / terraclimate-feedstock-archive

A pangeo-smithy repository for the terraclimate dataset.
Apache License 2.0
3 stars 3 forks source link

Have fewer static tasks #4

Closed TomAugspurger closed 4 years ago

TomAugspurger commented 4 years ago

The flow runs were timing out (during scheduling?) in part because we had ~1,000+ tasks. This switches to Task.map, which generates them on the fly.

I need to fixup the permission issue on the the bucket, then we should be good.

TomAugspurger commented 4 years ago

Looks like this run is going to finish @jhamman.

TomAugspurger commented 4 years ago

I spoke too soon apparently.

$ kubectl -n staging logs -l prefect.io/flow_run_id=332dc822-c407-43fe-a4d6-21c30f4e3ecd

[2020-09-15 04:07:52] INFO - prefect.CloudTaskRunner | Task 'download': Starting task run...
[2020-09-15 04:07:52] INFO - prefect.CloudTaskRunner | Task 'download': finished task run for task with final state: 'Mapped'
[2020-09-15 04:07:53] INFO - prefect.CloudTaskRunner | Task 'download[0]': Starting task run...
distributed.worker - WARNING - Heartbeat to scheduler failed
distributed.client - ERROR - Failed to reconnect to scheduler after 3.00 seconds, closing client
ERROR:asyncio:_GatheringFuture exception was never retrieved
future: <_GatheringFuture finished exception=CancelledError()>
concurrent.futures._base.CancelledError
[2020-09-15 04:08:04] INFO - prefect.CloudTaskRunner | Task 'download[0]': finished task run for task with final state: 'Success'
INFO:prefect.CloudTaskRunner:Task 'download[0]': finished task run for task with final state: 'Success'
    loop.run_sync(run)
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/tornado/ioloop.py", line 532, in run_sync
    return future_cell[0].result()
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/cli/dask_worker.py", line 426, in run
    await asyncio.gather(*nannies)
  File "/srv/conda/envs/notebook/lib/python3.7/asyncio/tasks.py", line 630, in _wrap_awaitable
    return (yield from awaitable.__await__())
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/core.py", line 286, in _
    type(self).__name__, timeout
concurrent.futures._base.TimeoutError: Nanny failed to start in 60 seconds
    loop.run_sync(run)
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/tornado/ioloop.py", line 532, in run_sync
    return future_cell[0].result()
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/cli/dask_worker.py", line 426, in run
    await asyncio.gather(*nannies)
  File "/srv/conda/envs/notebook/lib/python3.7/asyncio/tasks.py", line 630, in _wrap_awaitable
    return (yield from awaitable.__await__())
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/core.py", line 286, in _
    type(self).__name__, timeout
concurrent.futures._base.TimeoutError: Nanny failed to start in 60 seconds
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/tornado/tcpserver.py", line 327, in <lambda>
    gen.convert_yielded(future), lambda f: f.result()
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/comm/tcp.py", line 457, in _handle_stream
    await self.on_connection(comm)
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/comm/core.py", line 232, in on_connection
    comm.local_info, comm.remote_info
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/comm/core.py", line 145, in handshake_configuration
    "Your Dask versions may not be in sync. "
ValueError: Your Dask versions may not be in sync. Please ensure that you have the same version of dask and distributed on your client, scheduler, and worker machines
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP  local=tcp://10.37.142.194:37097 remote=tcp://10.36.248.128:44652>

Not sure what to make of that error message about mismatched versions. They're all running image: pangeoforge/terraclimate:latest (I repushed to latest, so the node may have a stale version. But I wouldn't expect the dask version to have changed)

TomAugspurger commented 4 years ago

They also didn't clean up completely. The errored pods, and some Running pods were still around this morning.