payu-org / payu

A workflow management tool for numerical models on the NCI computing systems
Apache License 2.0
19 stars 26 forks source link

Input file and executable tracking #90

Closed aidanheerdegen closed 5 years ago

aidanheerdegen commented 6 years ago

Goal: Reproducible runs

Method: Input file and executable tracking

Rationale: by knowing exactly which input files and executables were used for a run it should be possible to reproduce the model run.

Use cases:

Methods:

aidanheerdegen commented 6 years ago

I'm looking at the input and restart file logic in payu. If the copy_restarts and/or copy_inputs flag are set payu copies input and restart files rather than symlink:

https://github.com/marshallward/payu/blob/master/payu/models/model.py#L175 https://github.com/marshallward/payu/blob/master/payu/models/model.py#L188

Currently oasis and cice5 are the only models to set these flags, and both are set to True.

I would like all models to only use symlinks, and I'll tell you why ...

My current manifest files look like this (just a small sample):

work/INPUT/topog.nc:
  fullpath: /short/v45/aph502/mom/input/mom10/mosaic/topog.nc
  hashes:
    md5: ed5a7bc7481d62da566f28ab87a9a72c
    nchash: 00069ef31d3a27cb65f007ddc42896c7
work/INPUT/u_10.0001.nc:
  fullpath: /short/v45/fbd581/mom/input/CNYF_v2/u_10.0001.nc
  hashes:
    md5: 8b5b0c55c0a9e455be6fb8ee8d11c78a
    nchash: e2e875ed603a086c4e29d11edf8b437d
work/INPUT/v_10.0001.nc:
  fullpath: /short/v45/fbd581/mom/input/CNYF_v2/v_10.0001.nc
  hashes:
    md5: 6f4c414b7d7259bfe599b25b3b8089c0
    nchash: d1ff14f07cb593000850e0afef37d00c

yamanifest is run by just listing the local filepaths to check (in this case relative to the work directory) and the code does a os.path.realpath() call on the file and stores the value of the actual path to the file in fullpath.

This is useful for a couple of reasons:

If there is a manifest file in the payu control directory we can use it to populate the work directory with symlinks to input files, and possibly restart files if we are re-running an experiment. This is an improvement over the current mode of operation, as the manifest file can be edited to include only the files used in the run. Currently any files in a specified input directory are symlinked to the work input directory even if they are not used. In many cases there is cruft, old versions, scripts etc in the input directory that are simply not required. An edited set of inputs in the manifest file means we know exactly what was required and used in the run.

It gives us the opportunity to try and find existing copies of files if we clone an experiment or the original input file directories are moved/renamed/exist in someone else's short space. When cloning other people's runs. We don't need to make any changes to the run config. As long as I have read access to the other person's input directories where the manifest file points, I can run.

Once there are actual files copied into the work directory the manifest model I describe above breaks down. I can hack around it by forcing the fullpath to be where the file was copied from, but at that point the manifest is tracking the file listed in fullpath which is not guaranteed identical to the one it was copied to. It's a good bet, but not guaranteed.

I propose moving the copy_restarts and/or copy_inputs logic to the archive function. If we have issues with recursive links can this be solved by copying the respective files to the archive directory once the run is over? Would that work?

ping @nicjhan @marshallward

marshallward commented 6 years ago

I agree that symlinks should be used as much as possible for data files. I believe the copy_restarts and copy_input flags were added to OASIS and CICE5 in order to fix some particular design problems, but I had hoped that they would get fixed someday and we could return to symlinks.

And just for clarification, I do think that config (text) files should be copied, not linked, since they're small, easily tracked via git, and subject to change, but I think we agree on this.

Moving to archive is probably ok, though generally I have avoiding copying the actual files in order to save space. So ideally, maybe we should never be copying anything?

aidanheerdegen commented 6 years ago

I'm trying to understand why the flags were required to know if it is sufficient to do the copy after the run has done, presumably to a restart directory? @nicjhan can you comment?

aidanheerdegen commented 6 years ago

Ok, crowd sourcing some more opinions on restart manifest usage.

Currently I have a separate manifest for input files and restart files (I will have one for executables as well, but that is trivial and not covered by the same usage logic).

The manifest for input files is, or should be, relatively static. Once generated it will not change much, unless an input file is altered in some way. The restart file manifest is a different case. It should be generated anew for each run, as the restarts should naturally change from run to run.

However, there is at least one scenario where this is not the case. If we wish to be able to reproduce a run, it will require using the restart manifest from that run and checking that all restart file hashes are correct.

I have a couple of ideas how to implement this, but am interested in opinions on which seems best, or suggestions for other approaches.

Restart file strategies:

  1. Delete the restart file manifest at the end of a successful run (after it has been added to the git repo and the repo committed). In this way if there is a restart manifest present that would signal that it should be used and file checksums correct or payu will flag an error and abort.

  2. Retain the restart file manifest at the end of each run and overwrite for a new run unless a flag is set in config.yaml, say reproduce: true, in which case it will do as above, and reuse the restart file manifest, and ensure all checksums are correct.

Option 2 has the drawback that the reproduce flag needs to be changed after the first successful re-run. This could be done automatically I suppose, but would this set a precedent? Are there any other occasions when payu changes/edits config.yaml?

Option 1 is relatively clean, but requires the correct restart manifest to checked out from the git repo. This could be incorporated into a

payu run --reproduce -i <run_num>

or even

payu run --reproduce ---id <githash>

option.

Option 1 has the drawback that it leaves the git repo in a strange state (with a deleted file), and the restart manifest file is not visible to users to inspect, existing in the control directory for only as long as the run lasts.

Comments @marshallward @nicjhan ?

aidanheerdegen commented 6 years ago

Actually, I think that was a long winded way of me answering my own question. Use a command line flag instead of one in config.yaml. Duh.

marshallward commented 6 years ago

It's very early and I'm very sleepy but I would think the restart manifest should be kept in the restart directory, as a "receipt". Shouldn't that work?

Like I said, I'm very tired and might be missing some subtlety (or maybe got already explained why this is not an option)

On 2 Feb 2018 7:41 am, "Aidan Heerdegen" notifications@github.com wrote:

Actually, I think that was a long winded way of me answering my own question. Use a command line flag instead of one in cnfig.yaml`. Duh.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/marshallward/payu/issues/90#issuecomment-362395335, or mute the thread https://github.com/notifications/unsubscribe-auth/AAcN65mn91nx_WA-_OUEvOMGU9YoBKSNks5tQiGTgaJpZM4QqQh1 .

marshallward commented 6 years ago

No I guess it would no longer be part of the repo (presumably what we want)

I reckon a single file that grows over time would be ok. No precedent for editing config, but no reason not to allow a manifest file to be changed

On 2 Feb 2018 7:54 am, "Marshall Ward" marshall.ward@gmail.com wrote:

It's very early and I'm very sleepy but I would think the restart manifest should be kept in the restart directory, as a "receipt". Shouldn't that work?

Like I said, I'm very tired and might be missing some subtlety (or maybe got already explained why this is not an option)

On 2 Feb 2018 7:41 am, "Aidan Heerdegen" notifications@github.com wrote:

Actually, I think that was a long winded way of me answering my own question. Use a command line flag instead of one in cnfig.yaml`. Duh.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/marshallward/payu/issues/90#issuecomment-362395335, or mute the thread https://github.com/notifications/unsubscribe-auth/AAcN65mn91nx_WA-_OUEvOMGU9YoBKSNks5tQiGTgaJpZM4QqQh1 .

aidanheerdegen commented 6 years ago

Sorry, as I tried to allude to above, I answered my own question. The restart manifest will be recreated from scratch every time. I have some ideas to reuse manifest files from the directory of origin to speed things up, but essentially the manifest will be overwritten for every run EXCEPT when there is a command line option to tell payu to reproduce a run. In that case the logic for restarts will be the same as for inputs, which is to use the manifest to populate the work directory with symlinks to original files.

I think that will work well. Sorry for spamming you ... sort of think out loud (if loud is writing it on the internet)

aidanheerdegen commented 5 years ago

Support added in https://github.com/marshallward/payu/pull/146