ratt-ru / pfb-imaging

Preconditioned forward/backward clean algorithm
MIT License
6 stars 5 forks source link

When using the distributed scheduler, pfb-clean should wait for the requested number of workers to start #60

Closed sjperkins closed 1 year ago

sjperkins commented 1 year ago

Describe the problem that the feature should address

Distributed dask clusters can scale elastically, but it is not clear whether the dask scheduler can currently handle pfb-clean graphs appropriately in this context. Additionally, the AutoScaler plugin works by assuming a fixed number of workers to which tasks are pinned.

Describe the solution you'd like

When a distributed scheduler address is provided, the following code should be executed:

from distributed import Client

with Client(address) client:
  client.wait_for_workers(nworkers)

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

Additional context

We'll be using the dask Kubernetes operator to launch DaskJob's on an EKS cluster. Scheduler and Worker pods are launched alongside a Job pod which will run pfb-clean. The EKS cluster will scale up the number of instances to support all these pods, but this doesn't happen immediately. Therefore, it is necessary to wait for the workers.

landmanbester commented 1 year ago

Completed in https://github.com/ratt-ru/pfb-clean/pull/61