pytroll / trollflow2

Next generation Trollflow. Trollflow is for batch-processing satellite data using Satpy
https://trollflow2.readthedocs.org/
GNU General Public License v3.0
10 stars 15 forks source link

Add support for dask.distributed #82

Closed pnuu closed 4 years ago

pnuu commented 4 years ago

Feature Request

Is your feature request related to a problem? Please describe. Trollflow2 should eventually support dask.distributed for parallellism. Both local cluster and external cluster should be supported, but also the current default scheduler should be a possibility.

Describe the solution you'd like When suitable configuration is given, use dask.distributed.Client() to either connect to a external scheduler, or create a local cluster of workers. If client config isn't given, the current behaviour of default scheduler should be used.

The configuration could be something like this:

client:
  # If scheduler address is given, use that to access the workers
  scheduler_address: <ip>:8786
  # Otherwise, or if the scheduler can't be connected, local cluster is used with the following params.
  # For the missing arguments the defaults will be used.
  local_cluster:
    nprocs: 4
    nthreads: 1

Describe any changes to existing user workflow Ideally this shouldn't affect the default usage.

Additional context Some things that might affect the implementation and/or use:

djhoese commented 4 years ago

Would it be possible to use dask's builtin configuration files for this? I think with those you can set the default scheduler for the running system.

pnuu commented 4 years ago

For the user it would be more straightforward to only have the one configuration file to edit.

I don't see any config options in Dask documentation to set the scheduler address in the configuration file. I tested by starting a scheduler ($ dask-scheduler) and a worker ($ dask-worker) on other terminals, then inspecting dask.config.config before and after creating a Client() connected to the scheduler. The scheduler were set to dask.distributed, but there is no information about the scheduler where the client connected.

So based on this, I think we need the client to be created - or not created - based on values set in the trollflow2 configuration file.

djhoese commented 4 years ago

Yeah you might be right on the address. Here's the template that dask.distributed uses (there's one for regular dask too):

https://github.com/dask/distributed/blob/master/distributed/distributed.yaml

Lots of options but maybe not a "this exact scheduler".

As for the options for this, It'd be nice if you allowed for the user to specify the name of the cluster class. Either as a YAML object path to the dask class like we do for satpy readers/composites or as a name that gets imported directly from dask.distributed (less flexible but probably works in most cases). That way you could let users create their own Slurm or Kubernetes or other cluster. There's also Dask Gateway that might be used in the long run for some organizations. Just brainstorming.