reana.yaml: multi-line instructions

tiborsimko commented 6 years ago

Currently we have in reana.yaml long instructions like:

workflow:
  type: serial
  specification:
    steps:
      - environment: 'reanahub/reana-env-jupyter'
        commands:
          - mkdir -p results && papermill ${notebook} /dev/null -p input_file ${input_file} -p output_file ${output_file} -p region ${region} -p year_min ${year_min} -p year_max ${year_max}

located in one single line.

It would be useful to accept mult-iline formats such as:

workflow:
  type: serial
  specification:
    steps:
      - environment: 'reanahub/reana-env-jupyter'
        commands:
          - mkdir -p results && 
            papermill ${notebook} /dev/null 
                 -p input_file ${input_file} 
                 -p output_file ${output_file} 
                 -p region ${region} 
                 -p year_min ${year_min} 
                 -p year_max ${year_max}

for better readability.

A quick experiment with YAML's standard '>' technique to allow for newlines did not work; see https://github.com/reanahub/reana-demo-worldpopulation/pull/22#discussion_r213982149.

Investigate this.

diegodelemos commented 5 years ago

After investigating the yaml standard regarding multi-line strings, I have found the three ways in which we can support multi-line commands:

1. Using the `>` syntax (block scalar, folded style):

workflow:
  type: serial
  specification:
    steps:
      - environment: 'python:2.7'
        commands:
        - >
          echo "Running ${helloworld}." &&
          python "${helloworld}"
          --sleeptime ${sleeptime}
          --inputfile "${inputfile}"
          --outputfile "${outputfile}"

Available to try at https://github.com/reanahub/reana-demo-helloworld/pull/32/commits/07c8fdb8d1d328563a4abf7dcf424591ead6420d.

Potential source of errors with this approach: it took me a while to realise that the > syntax was not working because, as the standard states, there is no line folding (allows long lines to be broken for readability) when the indentation of the different lines in the multi-line string is different, so next example wouldn't work (more info here):

workflow:
  type: serial
  specification:
    steps:
      - environment: 'python:2.7'
        commands:
         - >
           echo "Running ${helloworld}." &&
           python "${helloworld}"
-          --sleeptime ${sleeptime}
-          --inputfile "${inputfile}"
-          --outputfile "${outputfile}"
+                 --sleeptime ${sleeptime}
+                 --inputfile "${inputfile}"
+                 --outputfile "${outputfile}"

This is how the command looks like in the container:

$ kubectl get -o yaml pod bc381d0e-7ca6-43dd-87cb-2a02d0758a45-4dgp9
...
  - command:
    - bash
    - -c
    - "cd /reana/users/00000000-0000-0000-0000-000000000000/workflows/03d7521d-e606-4d48-b9d7-4a9e42ad0e15
      ; echo \"Running code/helloworld.py.\" && python \"code/helloworld.py\" --sleeptime
      2 --inputfile \"inputs/names.txt\" --outputfile \"outputs/greetings.txt\"\n "
...

2. Using the `|` syntax (block scalar, literal style):

workflow:
  type: serial
  specification:
    steps:
      - environment: 'python:2.7'
        commands:
        - |
          echo "Running ${helloworld}."
          python "${helloworld}" --sleeptime ${sleeptime} \
                                 --inputfile "${inputfile}" \
                                 --outputfile "${outputfile}"

Available to try at https://github.com/reanahub/reana-demo-helloworld/pull/32/commits/3fdcc4717d709d5feff1dc1bdac96c6362fb3946.

It is a more close approach to Dockerfiles' command syntax.

This is how it ends up looking inside the container:

$ kubectl get -o yaml pod 3a774fc3-5a83-4305-be68-edf93382e78d-wv579
...
  - command:
    - bash
    - -c
    - "cd /reana/users/00000000-0000-0000-0000-000000000000/workflows/6fb6fc46-a8d9-46b2-9bb6-6875e9537833
      ; echo \"Running code/helloworld.py.\"\npython \"code/helloworld.py\" --sleeptime
      2 \\\n                       --inputfile \"inputs/names.txt\" \\\n                       --outputfile
      \"outputs/greetings.txt\"\n "
...

3. Using no indicator (flow scalar, plain syle)

workflow:
  type: serial
  specification:
    steps:
      - environment: 'python:2.7'
        commands:
        - echo "Running ${helloworld}." &&
          python "${helloworld}" --sleeptime ${sleeptime}
                                 --inputfile "${inputfile}"
                                 --outputfile "${outputfile}"

Available to try at https://github.com/reanahub/reana-demo-helloworld/pull/32/commits/9bce4936f94e7f706d87fddb92eec2c2694b34f9.

This approach is the less powerful since it has a lot of limitations, due to ambiguity reasons many characters would be forbidden. There is also the possibility to enclose the whole string in double or single quotes, plus escaping all forbidden characters inside the string (more info here).

$ kubectl get -o yaml pod a8f3a74c-8e1b-4266-8311-9b64e0f31120-4mdbl
...
  - command:
    - bash
    - -c
    - 'cd /reana/users/00000000-0000-0000-0000-000000000000/workflows/b953b28f-b7f2-44fa-a60c-8464fd65ad45
      ; echo "Running code/helloworld.py." && python "code/helloworld.py" --sleeptime
      2 --inputfile "inputs/names.txt" --outputfile "outputs/greetings.txt" '
...

As a conclusion, I think we should definitely go for a block scalar because option 3 will potentially end up being messy with escaped characters. Regarding block scalars, I would choose the literal style (option 2) since the problem with the indentation for the folded style (option 1) will definitely end up creating problems for users. Moreover, the standard recommends literal for code blocks.

cc'ing @reanahub/developers since this directly affects users.

tiborsimko commented 5 years ago

@diegodelemos Nice summary; I also prefer the option number 2 where the use of backslashes seems rather intuitive. (E.g. Travis CI does the same in multiline conditions https://docs.travis-ci.com/user/conditions-v1#line-continuation-multiline-conditions.)

However dunno about the "visual non-splitting" of the echo and python commands in your second example; e.g. see its JSON representation:

$ yaml2json reana.yaml | jq -S '.workflow.specification.steps'
[
  {
    "commands": [
      "echo \"Running ${helloworld}.\"\npython \"${helloworld}\" --sleeptime ${sleeptime} \\\n                       --inputfile \"${inputfile}\" \\\n                       --outputfile \"${outputfile}\"\n"
    ],
    "environment": "python:2.7"
  }
]

The notion that the commands are multiple is lost there. Would be nice if commands were a list.

Seeing

 - command1 arg11 arg12
   command2 arg21 arg22 arg23 \ 
            arg24 arg25

people might treat it as:

 - command1 arg11 arg12 && \
   command2 arg21 arg22 arg23 \ 
            arg24 arg25

Consider something long as:

 - command1 arg11 arg12
 - command2 arg21 arg22 arg23 \ 
            arg24 arg25
 - command3 arg31 arg32
 - command4 arg41 arg42 arg43 \ 
            arg44 arg45
 - command5 arg51

...

reanahub / reana-workflow-engine-serial