reanahub / reana-client

REANA command-line client
http://reana-client.readthedocs.io/
MIT License
10 stars 45 forks source link

workspace: expose workspace choice to users #545

Open tiborsimko opened 2 years ago

tiborsimko commented 2 years ago

Now that we have an option to use several different POSIX workspaces where to run workflows, the users should be able to configure where they would like to run their given workflow. E.g. one workflow in the default place, another workflow in their EOS home, etc.

This configuration should be done in reana.yaml.

Option 1: introduce new top-level section

We can introduce a new section in reana.yaml to express the concept of workspace. Pros: instead of just writing the POSIX path, we could store more information there, should we need it in the future. Also, the concept of workspace will stand out clearly. Cons: we would need to amend parsing and REST API protocols due to having new section.

An example of how this could look like:

version: 0.6.0
inputs:
  files:
    - code/gendata.C
    - code/fitdata.C
  parameters:
    events: 20000
    data: results/data.root
    plot: results/plot.png
workflow:
  type: serial
  specification:
    steps:
      - name: gendata
        environment: 'reanahub/reana-env-root6:6.18.04'
        commands:
        - mkdir -p results && root -b -q 'code/gendata.C(${events},"${data}")'
      - name: fitdata
        environment: 'reanahub/reana-env-root6:6.18.04'
        commands:
        - root -b -q 'code/fitdata.C("${data}","${plot}")'
workspace:
  type: posix
  workspace_root_dir: /eos/home-s/simko/myworkflows
outputs:
  files:
    - results/plot.png

A future option could be:

workspace:
  type: s3
  workspace_root_dir: s3://mybucket/myworkflows

Option 2: use existing options clause

We have an option of not changing reana.yaml and simply use existing clauses, such as parameters or options. Parameters, such as temperature=20c and mass=10g, influence the research results, whilst options, such as cache=off, keep the physics results and only influence how the workflow is orchestrated. From this point of view, a choice of workspace is more an option than a parameter, since a good reproducible analysis should not depend on where it is run. Hence we could choose options. Pros: we only add some parameter, REST API could use existing vehicle. Cons: conceptually the notion of workspace would not stand out so clearly, the workspace configuration would be "hidden" amongst other options. Also, options can be set via CLI options (e.g. reana-client start -o foo=bar) but this cannot be done for workspace, since it must be initialised before.

Example:

version: 0.6.0
inputs:
  files:
    - code/gendata.C
    - code/fitdata.C
  parameters:
    events: 20000
    data: results/data.root
    plot: results/plot.png
  options:
    workspace_root_prefix: /eos/home-s/simko/myworkflows
workflow:
  type: serial
  specification:
    steps:
      - name: gendata
        environment: 'reanahub/reana-env-root6:6.18.04'
        commands:
        - mkdir -p results && root -b -q 'code/gendata.C(${events},"${data}")'
      - name: fitdata
        environment: 'reanahub/reana-env-root6:6.18.04'
        commands:
        - root -b -q 'code/fitdata.C("${data}","${plot}")'
workspace:
  type: eos
  workspace_root_dir: /eos/home-s/simko/myworkflows
outputs:
  files:
    - results/plot.png

A future option could be:

  options:
    workspace_root_prefix: s3://mybucket/myworkflows

(The type is inferred from the beginning of the value. Or, if need be, more strings would be added, such as workspace_type: s3. This is basically "flattened" option 1 expressed via options clause.)

Notes

Regardless of which option we shall choose, there is a certain default that should be used in case the user does not set anything. This default will be set by the cluster administrator, but this will be part of another issue.

tiborsimko commented 2 years ago

P.S. In the above, we might read workspace_root_path or whatever name we shall select :wink: in al the places.

mvidalgarcia commented 2 years ago

IMO option 1 looks cleaner. I think the workspace is relevant enough to have its own section.

Currently, we support some input.options but those are very related to certain workflow languages whereas the workspace would be universal. OTOH, it's true that the CACHE option is directly related to the storage.. but still, I think it'd be harder for the final user to set it as an option.