reanahub / reana

REANA: Reusable research data analysis platform
https://docs.reana.io
MIT License
123 stars 54 forks source link

RFC supporting Dask workflows #823

Open tiborsimko opened 3 weeks ago

tiborsimko commented 3 weeks ago

About

We are starting a new sprint to support the running of Dask workflows in REANA. Please let us know what you think and please share any other desiderata you may have. Thanks for your input!

Goals

The main goals are:

Use cases

  1. As a researcher, I would like to bring up a Dask cluster for my workflow runs, so that I can use Dask task graphs in my analyses.

  2. As a researcher, I would like to use a particular Dask cluster version and the Dask worker image, so that I can ensure my analysis can be reinterpreted correctly even several years later.

  3. As a researcher, I would like to configure the amount of necessary Dask resources such as CPU and RAM, so that my analysis can be run efficiently.

  4. As a researcher, I would like to mount my REANA secrets alongside Dask jobs, so that I can profit from the usual Kerberos or VOMS authentications to access remote resources.

  5. As a researcher, I would like to see the logs of my Dask jobs in the regular REANA logging system, so that I can be informed about the workflow progress or errors in the usual manner.

  6. As a researcher, I would like to list all my workflows using Dask clusters and their statuses, so that I can make sure that I have not left behind anything unnecessary.

  7. As a cluster administrator, I would like to specify the list of vetted (allowed and recommended) images to be used for Dask workflows, so that the cluster says safe from running possibly vulnerable images.

  8. As a cluster administrator, I would like to inspect who is using which Dask cluster, so that I can quickly get in touch with researchers in case of problems.

  9. As a cluster administrator, I would like to configure various Dask resource limits for users, so that workflows asking for exaggerated resources can be filtered early.

  10. As a cluster administrator, I would like to benefit from the auto-scaling features during user workflow execution, so that my cluster uses resources only when really necessary.

Discussion

dask

User configuration

If using one Dask cluster for the entire analysis is sufficient for the analysis, the reana.yaml could look like:

inputs:
  files:
    - myanalysis.py
workflow:
  type: serial
  resources:
    dask:
      version: 2023.1.0
      cores: 16
      memory: "96 GB"
  specification:
    steps:
      - name: mystep
        environment: docker.io/mygroup/myenvironment:2.13.4
        commands:
        - python myanalysis.py  # uses DASK_SCHEDULER_URI=tcp://10.96.4.51:8786
outputs:
  files:
    - myplot.png

See also