payu-org / payu

A workflow management tool for numerical models on the NCI computing systems
Apache License 2.0
18 stars 25 forks source link

Add `payu sync` cmd for syncing archive to a remote directory #360

Closed jo-basevi closed 8 months ago

jo-basevi commented 10 months ago

Add a new payu command to sync output to a remote directory. I've extended the expt.postprocess(), which either runs after collation or end of archive if not collating, to also call 'payu sync' if syncing is enabled.

An example config currently is

sync:
    enable: true
    # PBS specific 
    # queue: copyq # Default
    # mem: 2GB # Default
    # ncpus: 1 # Default

    # rsync specific
    path: /g/data/{Project}/{user}/{model}/{experiment name}/archive  # Absolute path to sync data to
    # rsync_flags: -vrltoD --safe-links # Default
    exclude: 
        - 'iceh.????-??-??.nc' 
        - '*-DEPRECATED' 
        - '*-DELETE' 
        - '*-IN-PROGRESS'

   # exclude_uncollated: false # Default is true if collation is enabled.

    # Extra paths to sync to remote-archive
    extra_paths:
       - metadata.yaml
       - run_summary*.csv

    # Sync permanently saved restarts
    # restarts: false # Default

    # remove_local_files: false # Default
    # remove_local_dirs: false # Default

    # Add a bare repo for the runlog 
    # runlog: true # Default 

userscripts:
    # User defined scripts/commands to be called before remote syncing model archive - useful for any post-processing
    sync: postprocess_archive.sh

The cosima's sync_data.sh (https://github.com/COSIMA/1deg_jra55_iaf/blob/master/sync_data.sh) does some extra processing, e.g: runs a run_summary python script and concatenates ice daily files.

An example of postprocess_archive.sh could look like for 1deg_jra55_iaf ```bash #!/bin/bash # Load/use required enviroment modules if [ -z "$MODULESHOME" ]; then echo "warning: No Environment Modules found; skipping $command call." exit 1 fi modulecmd="$MODULESHOME/bin/modulecmd bash" # Modules for ice concatenation eval "$($modulecmd load nco)" # Modules for run summary eval "$($modulecmd use /g/data/hh5/public/modules)" eval "$($modulecmd load conda/analysis3)" eval "$($modulecmd load python3-as-python)" # Concatenate ice daily files for d in archive/output*/ice/OUTPUT; do for f in $d/iceh.????-??-01.nc; do if [[ ! -f ${f/-01.nc/-IN-PROGRESS} ]] && [[ ! -f ${f/-01.nc/-daily.nc} ]]; then touch ${f/-01.nc/-IN-PROGRESS} echo "doing ncrcat -O -L 5 -7 ${f/-01.nc/-??.nc} ${f/-01.nc/-daily.nc}" ncrcat -O -L 5 -7 ${f/-01.nc/-??.nc} ${f/-01.nc/-daily.nc} && chmod g+r ${f/-01.nc/-daily.nc} && rm ${f/-01.nc/-IN-PROGRESS} if [[ ! -f ${f/-01.nc/-IN-PROGRESS} ]] && [[ -f ${f/-01.nc/-daily.nc} ]]; then for daily in ${f/-01.nc/-??.nc} do # mv $daily $daily-DELETE # rename individual daily files - user to delete rm $daily done else rm ${f/-01.nc/-IN-PROGRESS} fi fi done done sourcepath="$PWD" cd archive || exit 1 # Delete any cice log files that only have a 105-character header and nothing else find output* -size 105c -iname "ice.log.task_*" -delete # Update and commit run summary - do this last in case it doesn't work cd $sourcepath ./run_summary.py --no_header git add run_summary*.csv git commit -m "update run summary" echo "$0 completed successfully" ```

Edited: 2/12/23

coveralls commented 10 months ago

Coverage Status

coverage: 47.069% (+1.5%) from 45.539% when pulling 9ac075a31bc13b00c1ef0a067151d208e9125b3f on jo-basevi:200-restoring-remote-archive into a336968a44fc4eddd847a5c7750a40bd00e0d9b0 on payu-org:master.

aidanheerdegen commented 10 months ago

Closes #200

pep8speaks commented 8 months ago

Hello @jo-basevi! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! :beers:

Comment last updated at 2023-11-02 00:36:33 UTC
jo-basevi commented 8 months ago

Extra Notes/Questions

Syncing and postscript post-processing

Currently if post-script is configured AND automatic syncing is enabled, the latest output (and lastest restarts, if syncing restarts is enabled) will not be synced. To manually sync the latest outputs after the post-script job has completed, run: payu sync

Instead of post-script, to automatically ensure post-processing happens before syncing, there's a sync userscript hook. This will run at the start of the sync pbs job and before running any rsync commands. There are options to modify pbs resources/queue under the sync namespace in config.yaml.

If post-processing needs to run on a different queue to sync job (if need network access for syncing to a remote machine and post-processing >1 NCPUs), or it requires significant resources, an option could be to call payu sync at the end of a post-script and disable sync being called automatically by payu.

If the above won't work, an option could be to add to payu a post-processing command to submit a post-processing job which runs any shell/python post-processing userscripts and then either runs payu.sync() after completion or submits a separate copyq job to run payu.sync().

Local delete options

There is a config option to remove local files once synced. As this leaves behind empty directories and files excluded from rsync, there is another config option to remove local restart/output directories. This also happens after the directory has been successfully synced.

The latest output (and last "permanently-saved" restart and subsequent restarts) are protected from both local delete options. Any additional configured paths to sync in config.yaml are also protected. The last "permanently-saved" restart is protected because it is used as a checkpoint for date-based restart pruning.

A procedure to remove all local files and directories - including last outputs/restarts could be:

  1. Run payu sync to sync all outputs, or payu sync --sync-restarts to sync all restarts, as well.
  2. Check files have been successfully synced by checking logs for errors and inspect remote archive.
  3. Then run payu sweep --hard to delete everything in the local archive.

#### Default path for the remove archive?.

~~If not syncing to a remote machine, I've automated the destination path for remote archive to be /g/data/${PROJECT}/${USER}/${MODEL}/${EXPERIMENT_NAME}/archive.~~ I'm unsure if this is the best idea as it could risk outputs being overwritten if experiment name is not unique.. I am thinking it might be best to remove this, until experiments have a unique ID. Edit: This has been removed

Syncing to remote machine

I've kept most of the logic in pre-exisiting remote_archive() for rsyncing to a remote machine, which seems to be in-line with nci's guide for setting up restricted rsync calls with ssh-keys.

I've attempted to go through the above guide above and rsync small files from gadi to my local desktop but haven't had any success. So I've left the remote sync options (e.g rsync_protocol, user and url) out of the documentation for now, as I've haven't been able to test it.

Uncollated files

Currently, --exclude *.nc.* is always added to the rsync commands to ignore uncollated files. Would syncing uncollated files be an option? If so, I could add a ignore_uncollated flag to config.yaml, with True being the default value.

aekiss commented 8 months ago

Just checking - the current sync script rsyncs all outputs (not just the latest) every time it's called, to handle asynchronous collation. Does payu sync have the same behaviour?

jo-basevi commented 8 months ago

Just checking - the current sync script rsyncs all outputs (not just the latest) every time it's called, to handle asynchronous collation. Does payu sync have the same behaviour?

Yes, it rsyncs all the outputs directories every time it's called. I had looked into rsyncing only the current output or current/ prior restart- if collation is enabled, however realised there wasn't much point as rsync would be able to quickly pass over already synced directories in archive anyway..

jo-basevi commented 8 months ago

Thanks @aidanheerdegen for the review! I've just added in the requested changes. I've tested it locally but will have to wait until gadi is back up and running to run through a couple of actual model output tests.

jo-basevi commented 8 months ago

I've tested it all works on gadi, I just added on line in the tests to disable attempting to sync runlog. I also tidied up the commit history