ratt-ru / QuartiCal

CubiCal, but with greater power.
MIT License
8 stars 4 forks source link

When using the distributed scheduler, Quartical should wait for the requested number of workers to start #197

Closed sjperkins closed 2 years ago

sjperkins commented 2 years ago

Describe the problem that the feature should address

Distributed dask clusters can scale elastically, but it is not clear whether the dask scheduler can currently handle Quartical graphs appropriately in this context. Additionally, the AutoScaler plugin works by assuming a fixed number of workers to which tasks are pinned.

Describe the solution you'd like

When a distributed scheduler address is provided, the following code should be executed:

from distributed import Client

with Client(address) client:
  client.wait_for_workers(nworkers)

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

Additional context

We'll be using the dask Kubernetes operator to launch DaskJob's on an EKS cluster. Scheduler and Worker pods are launched alongside a Job pod which will run Quartical. The EKS cluster will scale up the number of instances to support all these pods, but this doesn't happen immediately. Therefore, it is necessary to wait for the workers.