Reduce overhead introduced by hpcflow in jobscripts

aplowman commented 5 months ago

We can speed up execution in a few ways.

[ ] Invoke hpcflow fewer times
[ ] Reuse command files (e.g. js_0_act_0.sh) and source files (e.g. python scripts) across elements

Invoke hpcflow fewer times

In jobscripts, we currently call four hpcflow commands for each action run:

get-ear-skipped: skip=`wkflow_app internal workflow "$WK_PATH_ARG" get-ear-skipped $EAR_ID 2>> "$app_stream_file"`
write-commands: wkflow_app internal workflow "$WK_PATH_ARG" write-commands $SUB_IDX $JS_IDX $JS_act_idx $EAR_ID >> "$app_stream_file" 2>&1
set-ear-start: wkflow_app internal workflow "$WK_PATH_ARG" set-ear-start $EAR_ID >> "$app_stream_file" 2>&1
set-ear-end: wkflow_app internal workflow "$WK_PATH_ARG" set-ear-end $JS_IDX $JS_act_idx $EAR_ID "--" "$exit_code" >> "$app_stream_file" 2>&1

These should be combined into two commands, which will reduce the overhead of starting (and unpacking if using the built executable) hpcflow:

pre-run: combine 1., 2., and 3. from above
post-run: 4. from above

Reuse command and source files

It should be possible to write source files once (in all cases?)
They could be placed in the artifacts directory
For command files, I think we'd need to return the path to the file as part of the pre-run command above.

aplowman commented 5 months ago

We could potentially get it down to a single hpcflow invocation, if that command also executes the action commands via subprocess.run. We would wait for the subprocess to finish and then do post-run steps. Would need to check that the environment is inherited correctly on all three supported OSes.

(There is a way to avoid creating a sub-process by replacing the current process with os.exec*, but in that case we would then need another invocation to do post-run steps, so we wouldn't gain anything.)

aplowman commented 5 months ago

For initial reference, I've timed parts of the jobscript for (single-action) array jobs of different sizes below, using the workflow defined at the bottom, which runs a simple Python script (taking a single parameter and outputting a single parameter). "action loop time" is the total time within the action loop (a loop with one iteration in this case, action 0); "element time" is the total run time of the jobscript.

N=10:

action 0: cut time:                                N=10         mean=0.0        std=0.00
action 0: get EAR skipped time:                    N=10         mean=11.8       std=9.92
action 0: write commands time:                     N=10         mean=4.4        std=4.32
action 0: set EAR start time:                      N=10         mean=3.7        std=3.41
action 0: commands execution time:                 N=10         mean=6.0        std=3.44
action 0: set EAR end time:                        N=10         mean=4.1        std=4.23
action loop time:                                  N=10         mean=30.1       std=21.92
element time:                                      N=10         mean=30.2       std=22.12

N=1000:

action 0: cut time:                                N=1000       mean=0.0        std=0.04
action 0: get EAR skipped time:                    N=1000       mean=2.9        std=4.06
action 0: write commands time:                     N=1000       mean=2.2        std=1.24
action 0: set EAR start time:                      N=1000       mean=2.1        std=1.10
action 0: commands execution time:                 N=1000       mean=3.7        std=3.55
action 0: set EAR end time:                        N=1000       mean=2.2        std=1.58
action loop time:                                  N=1000       mean=13.1       std=9.52
element time:                                      N=1000       mean=13.1       std=9.55

N=5000:

action 0: cut time:                                N=5000       mean=0.0        std=0.06
action 0: get EAR skipped time:                    N=5000       mean=4.1        std=9.70
action 0: write commands time:                     N=5000       mean=2.5        std=2.07
action 0: set EAR start time:                      N=5000       mean=2.2        std=1.79
action 0: commands execution time:                 N=5000       mean=1.8        std=1.47
action 0: set EAR end time:                        N=5000       mean=2.3        std=1.60
action loop time:                                  N=5000       mean=13.0       std=14.98
element time:                                      N=5000       mean=13.0       std=15.24

Results:

Workflow size (number of elements) does not significantly affect jobscript execution time (for a looped workflow, it most likely would; see https://github.com/hpcflow/hpcflow-new/issues/667) (EDIT: maybe it does for get-ear-skipped)
Lots of variance in timings. I guess this is due to variation in the load on the Lustre filesystem.

Methodology:

Run on CSD3 using icelake nodes
Times are in seconds and come from subtracting $SECONDS in bash at various point in the jobscript, introduced temporarily in this branch: https://github.com/hpcflow/hpcflow-new/commits/fix/large-workflow/

Test workflow template:

doc: |
  A workflow for benchmarking the overhead introduced by hpcflow in running a Python
  script `N` times.

template_components:
  task_schemas:
    - objective: run_script
      inputs:
        - parameter: p1
      outputs:
        - parameter: p2
      actions:
        - environments:
            - scope:
                type: any
              environment: python_env
          script: <<script:/absolute/path/to/main_script_test_direct_in_direct_out.py>>
          script_exe: python_script
          script_data_in: direct
          script_data_out: direct

resources:
  any:
    scheduler_args:
      options:
        --time: 00:10:00
        --partition: <<var:partition[default=icelake]>>

tasks:
  - schema: run_script
    inputs:
      p1: 101
    repeats: <<var:N[default=1]>>

Script main_script_test_direct_in_direct_out.py:

def main_script_test_direct_in_direct_out(p1):
    # process
    p2 = p1 + 100

    # return outputs
    return {"p2": p2}

hpcflow / hpcflow-new

Reduce overhead introduced by hpcflow in jobscripts #670