reanahub / reana

REANA: Reusable research data analysis platform
https://docs.reana.io
MIT License
127 stars 54 forks source link

reana.yaml: parameter array read from file #305

Closed alintulu closed 4 years ago

alintulu commented 4 years ago

Both CWL and Yadage provide a “scatter-gather” paradigm. The workflow takes the input as an array and runs the specified steps on each element of the array as if it were a single input (Yadage allows for wanted batch size if specified).

The array can be declared in reana.yaml under inputs: parameters: like in the example from the Awesome Workshop.

Currently the parameter array has to be declared explicitly by writing each element of the array down as a new line in the reana.yaml.

inputs:
  parameters:
    cross_sections:
      - 19.6
      - 1.55
     [...]

This is okey when you have 2-10 entries, however not realistic to enter 1500 entries as may be the case (example; names of data set files).

To be added:

Allow to specify a file to read the entries from. Each line in the file would be taken as an entry to the array.

inputs:
  parameters:
    cross_sections:
      - index.txt

Instead of adding 1500 lines to the reana.yaml those lines could be read from index.txt. The parameter array cross_sections would then be provided to CWL or Yadage which would use it as an input for their “scatter-gather” paradigm.

tiborsimko commented 4 years ago

Both CWL and Yadage can have inputs specified as separate files. Example for CWL:

$ cat reana.yaml
inputs:
  parameters:
    input: workflow/input.yml
workflow:
  type: cwl
  file: workflow/workflow.cwl

$ cat workflow/input.yml
library:
  class: File
  path: src/PhysicsObjectsHistos.cc
build_file:
  class: File
  path: BuildFile.xml
validation_script:
  class: File
  path: demoanalyzer_cfg.py

So you could use this technique, create a big input.yml that would list all the cross section values or all the dataset ROOT files etc, and this should work.

Can you try to create a vanilla cwltool or yadage-run example using such input file, and once you have an example ready, we can see how to best convert it to `reana.yaml?

P.S. See e.g. reana-demo-worldpopulation CWL example that has 4-5 parameters.

alintulu commented 4 years ago

Simple example of Yadage containing

can be found here. Workflow runs with

yadage-run workdir workflow.yaml input.yaml

where the input is read from input.yaml. Next step figuring out how to best implement the passing of input parameters from input.yaml when file declared in reana.yaml. As mentioned this already works for CWL :)

alintulu commented 4 years ago

In yadage it seems like initdata is a json with key-value pairs of 'parameter name'-'parameter value'. It is set in two ways, by

In REANA initdata is set to workflow_parameters at reana_workflow_engine_yadage/clip.y and reana_workflow_controller/workflow_run_manager.py which in turn is set to parameters at reana_deb/models.py.

parameters are read from reana.yaml from the inputs: parameter: field.

inputs:
    parameters:

i.e. currently initdata passed to yadage can only be set by defining the parameters in reana.yaml.

It also seems like initfiles cannot be directly passed to yadage since only initdata is specified in steering_ctx.

Hence we can not just create an input.yml file and hand it to yadage as initfiles, but instead we have to create a method in REANA that sets initdata by

  1. given an input.yml file create json with key-values as specified in the yaml file
  2. append json to parameters which in turn sets initdata
tiborsimko commented 4 years ago

Regarding user interface, we should introduce a new option initfiles that people can use in their reana.yaml, similarly to the recently-added options initdir and toplevel. In this way the analysis will have explicitly documented its input files and/or parameters.

Regarding implementation, the r-w-e-yadage would have to do something like the following to merge the input file parameters and command-line parameters:

from yadage.utils import getinit_data
initdata = getinit_data(initfiles, parameter)

in order to pass the resulting merged initdata to the yadage steering. (See Yadage sources.)