Closed aidanheerdegen closed 12 months ago
This sounds good to me. work
would be on scratch
. archive
could be specified either on scratch
(while setting up auto-syncing to a permanent data store for archive
) or with the rest presumably at /g/data/--project--/--user--/
? Sounds like you just need a new default_gdata_path
to set this second directory.
If someone uses scratch
for their archive
, and doesn't set up an auto-sync, can we get a warning setup about the time limit (or will NCI be providing some kind of warning for when files will be deleted)?
There is a function in Experiment
named remote_archive
which was once used to transfer model output over to MDSS, back when storage limits were so severe that model runs would often go over quota.
As storage became less of a problem, this function became deprecated and is now basically a zombie function which cannot be called. But it would not be much work to reinstate something like this.
Which is to say that we've been here before, albeit under different constraints (space vs time), but it should not be a problem to either append this function to the end of archive()
or integrate into the payu archive
command.
Ok. I think this is well overdue to be implemented.
The COSIMA issue linked above refers to adding syncing of restarts to their existing sync script
https://github.com/COSIMA/01deg_jra55_iaf/blob/master/sync_data.sh
This script is called with the postscript
hook in the payu
https://github.com/COSIMA/01deg_jra55_iaf/blob/master/config.yaml#L77
Not everything in that script is appropriate for including in payu
, but it gives a good overview of what syncing capability is required.
Also important to note the rsync
options that are used. It isn't feasible to use -a
with rsync
, as it automatically changes the group (project in NCI speak) to that of origin, but in general /g/data
directories have the setgid bit set, so that folders and files copied there have the same project code as the enclosing folder. It is important to keep this behaviour as files and folder accounting is (mostly) done by the group (project) of the file/folder.
Using -a
with rsync
undoes this behaviour.
I've created issue #358 and cross-referenced here. Syncing restarts that will late be deleted is an issue.
It could be that the logic for restarts is different than for outputs: could use rsync
machinery to delete restarts at the destination once they're deleted/pruned at the source. This is problematic though if restarts are deleted by time based purging of scratch, and then later deleted.
Another possibility is to have an option to not sync restarts, then tidy them in a separate step and then turn on restart syncing and then call something like
payu sync
If payu
did support date based restart frequencies then it potentially has the information it needs to make sure it doesn't sync restarts that will later be pruned.
It would probably require inspection of restart files (so retaining a dependency on the netCDF4
python module, or adding a dependency on xarray
, and might not work for all models, so might need to be driver dependent.
Just documenting some notes here:
In the COSIMA issue above, it was also referenced that payu doesn't automatically collate the most recent restart. If rsync was set to exclude uncollated files, then the most recent restart wouldn't be synced. So payu collate -d archive/restart<num>.
may need to be another step before syncing restarts.
One payu sync
command could potentially collate the latest restart if required, then run a user-script before any syncing (to tidy up any restarts) and then finally rsync the restarts.
Otherwise, auto syncing restarts could be setup to only sync restarts using the integer restart frequency (or using date-based restart frequency).
To automatically sync outputs, could run the sync command where the postscript hook is run, which is at the end of archive if not collating otherwise after collation.
The sync config could something look like:
sync:
- enable: default false
# PBS specific:
- queue: default copyq
- walltime: (e.g 10:00:00)
- mem: default 2GB
- ncpus: default 1
# rsync specific:
- directory: destination dir to copy data to
- rsync_flags: string of any additional flags for rsync
# For exclusions, could add string to rsync_flag (e.g "--exclude *.nc.* --exclude iceh.????-??-??.nc --exclude *-DEPRECATED --exclude *-DELETE --exclude *-IN-PROGRESS")
# or have a parameter with a list of strings, e.g.:
- exclude:
- *.nc.*
- iceh.????-??-??.nc
- etc
# For restarts
- restarts:
enable: default false
collate_latest: default false # collate latest restart prior to rsync
userscripts:
# User defined scripts/commands to be called before remote syncing model archive
# e.g. in access-om2: a script to concatenate cice daily files and deleting cice log files with only 105-char header
sync: tidy_archive.sh ```
NCI is upgrading its HPC and at the same time changing replacing the
short
filesystem withscratch
, which is time limited.payu
is so well written (props @marshallward) that there is only one mention of the actual path/short
in the entire codebasehttps://github.com/payu-org/payu/blob/f567db9dad9fdd219ea632b76d7d715e8e65e457/payu/laboratory.py#L57
(we'll ignore hard coded paths in profiler modules)
So at the very minimum to support the new machine
default_short_path
should be changed toscratch
.However, as
scratch
is time limited it is no longer a good fit for the currentpayu
pattern, wherebin
,input
andcodebase
are stored in the samelaboratory
aswork
andarchive
.https://github.com/payu-org/payu/blob/f567db9dad9fdd219ea632b76d7d715e8e65e457/payu/laboratory.py#L45-L49
With strict time limited deletion of files on
scratch
, the only directory that is a clear fit for this pattern iswork
. Thearchive
directory could live onscratch
, with some syncing to a permanent data store, but I thinkpayu
should also supportarchive
not being physically co-located with `work.Thoughts?