payu-org / payu

A workflow management tool for numerical models on the NCI computing systems
Apache License 2.0
18 stars 25 forks source link

Proposal: explicitly support branches #330

Closed aidanheerdegen closed 6 months ago

aidanheerdegen commented 2 years ago

Currently payu doesn't prevent the use of branches in the git repo of the model experiment directory, but it has no explicit knowledge or support for it.

I propose a change to the way payu names the archive and work directories by appending the branch name to the model name to work and archive directories.

This has the advantage that a single experiment control directory can be used for perturbations/tests/modifications and they can happily co-exist as fully-formed experiments. Simply changing experiment with git branch will automatically switch between archive directories.

This would require changing the symbolic link to the archive directory when git branch called. This could be done using git hooks.

aidanheerdegen commented 2 years ago

The idea of a unique experiment ID also suggested appending the experiment ID to uniquely identify work and archive directories

https://github.com/payu-org/payu/issues/191#issuecomment-623206691

So if creating a new branch also triggered a new unique experiment ID, that would cover this use case.

aidanheerdegen commented 2 years ago

This was discussed at the COSIMA II technical workshop today. It was suggested that if branch name were used as suggested, that this not be done for the default case of main or master branch, to make it backwards compatible.

aidanheerdegen commented 2 years ago

An alternative to use git hooks to update the archive link, would be to create a new payu branch command, which wraps git branch and does whatever housekeeping is necessary, which might be to generate a new UUID if the this is a new branch, and update the archive symlink.

jo-basevi commented 8 months ago

Below are some ideas for experiment uuids and payu git branch support:

payu commands payu clone [--keep-id] [--branch BRANCH_NAME] [--new-branch NEW_BRANCH_NAME] REPOSITORY [DIRECTORY]

  1. Clone control repo or a specific branch in control repo, i.e.--branch BRANCH_NAME.
  2. Unless specified not to, i.e.--keep_id, create a new experiment uuid, add to config.yaml and commit changes for new uuid
  3. If --new-branch NEW_BRANCH_NAME, cd into DIRECTORY and call payu checkout -b NEW_BRANCH_NAME

payu checkout [-b] BRANCH_NAME [COMMIT_HASH] [--restart RESTART_PATH]

  1. If -b, create new branch and create a new uuid. Check experiment archive path doesn't already exist. If COMMIT_HASH is specified, set this new branch to start from this commit.
  2. If --restart RESTART_PATH, set restart in config.yaml
  3. Change the symbolic links to archive (if it exists)

payu branch List branches and their associated experiment IDs (if they exist).

Example usage

$ payu clone --branch 1deg_jra55do_ryf FORKED_GIT_URL 1deg_jra55do_ryf_experiment
$ cd 1deg_jra55do_ryf_experiment
$ payu checkout -b 1deg_jra55do_ryf_specific_experiment

# or 

$ payu clone --branch 1deg_jra55do_ryf --new-branch 1deg_jra55do_ryf_specific_experiment FORKED_GIT_URL 1deg_jra55do_ryf_experiment

To list branches and uuids:

$ payu branch
* 1deg_jra55do_ryf_specific_experiment - uuid: 97dd2
1deg_jra55do_ryf - uuid: ae065

Note at this point, the git log could look like this:

commit <commit_id2> (HEAD -> 1deg_jra55do_ryf_specific_experiment1)
    Add new experiment ID 97dd2 for new branch: 1deg_jra55do_ryf_specific_experiment1

commit <commit_id1> (1deg_jra55do_ryf)
    Add new experiment ID ae065 (was 77190) for branch clone of 1deg_jra55do_ryf

commit <commit_id000> (origin/1deg_jra55do_ryf)
    Last commit of 1deg_jra55do_ryf
...

To avoid new uuid being created for 1deg_jra55do_ryf, use --keep-id in payu clone

Size of uuid (TODO) Check how small a uuid can be for uniqueness - can it go as small as 5 characters? An option is to use a longer uuid in config.yaml - but shorter version in experiment name used in archive.

Experiment Archive Name: If there's going to be multiple branches for 1deg_jra55do_ryf_experiment, and if the branch name is going to be descriptive, should the experiment name used for archive be the branch name e.g. {BRANCH_NAME}-{SHORT_UUID} or if on main/master branch {CONTROL_DIR}-{SHORT_UUID}?

Backwards combatibility: The idea is to create an ID for every experiment? So in payu setup, add and commit a uuid if it doesn't exist?

What should not change for old experiments is the experiment name used for archival (i.e use the default control directory name or whatever it has been set to in config.yaml)

For new experiments cloned/created with payu clone/branch, the commands could automatically set the experiment name in config.yaml, or have a new flag to include a short uuid in experiment name used for archival.

aidanheerdegen commented 8 months ago

cd into DIRECTORY and call payu checkout -b NEW_BRANCH_NAME

Note it possible to run git commands from outside of the git director with the -C option

https://git-scm.com/docs/git#Documentation/git.txt--Cltpathgt

Check how small a uuid can be for uniqueness - can it go as small as 5 characters? An option is to use a longer uuid in config.yaml - but shorter version in experiment name used in archive.

Apologies if I was unclear, but we definitely want to use the full hash in config.yaml and metadata.yaml, as well as embed it in outputs etc. Use of the short hash would be for use with naming archive and work directories, listing experiments for the user and also when cloning.

In an ideal world it should be possible to payu checkout <hash> and use the minimal hash length to be unique within a repo. However this places an unreasonable burden on payu, so as discussed I think it will be sufficient to define the shortened length of the experiment uuid and then make sure it is unique for a particular repository. It still isn't bullet-proof: users could have the same experiment name in different local clones (or their own forks) and then attempt to sync to the same shared directory in /g/data. So we should probably err on the side of caution in terms of using shortened versions.

There are a number of packages/posts/code snippets which use an approach of converting the UUID to a number and then re-encoding using a larger base encoding to represent the number in a shorter string, e.g.

https://pypi.org/project/shortuuid/

https://github.com/Devskiller/friendly-id

(Effectively BASE64 but with dropping some of the less safe characters)

So typically this reduces a 36 character uuid4 string to 22 characters. It also means a shortened version contains more entropy.

Maybe not worth the hassle, and not being an acknowledged standard, but thought it was interesting.

jo-basevi commented 8 months ago

I agree with having a longer uuid in config.yaml and a truncated version in experiment name for archive or work.

Yeah, I saw that shortuuid was recommended for having few collisions, generally being more readable and good to use in urls. It has a small time cost but that'll be insignificant in payu's use case. It also has significantly less collisions for truncated ids. Results from a quick test comparing truncated shortuuid with the built-in uuid.uuid4():

For 1000000 short uuid trials: 
Total uuid collisions: 0
Truncated uuids:
 length   collisions  
 4       : 56816       
 5       : 980         
 6       : 22          
 7       : 1           
 8       : 0           
 9       : 0           
 10      : 0           
For 1000000 uuid4 trials: 
Total uuid4 collisions: 0
Truncated uuids:
 length   collisions  
 4       : 934464      
 5       : 355694      
 6       : 29225       
 7       : 1902        
 8       : 141         
 9       : 141         
 10      : 5     

To avoid over-writing output in remote directory in payu sync, an option could be to check for matching uuid prior to syncing files.

aidanheerdegen commented 8 months ago

Experiment Archive Name: If there's going to be multiple branches for 1deg_jra55do_ryf_experiment, and if the branch name is going to be descriptive, should the experiment name used for archive be the branch name e.g. {BRANCH_NAME}-{SHORT_UUID} or if on main/master branch {CONTROL_DIR}-{SHORT_UUID}?

Good question.

So .. digression

Common user workflow

Up to this point the most common (and encouraged) user workflow was to git clone an experiment, e.g. 1deg_jra55do_ryf, to a different target directory. So

git clone git@github.com:COSIMA/1deg_jra55_ryf.git 1deg_jra55do_ryf_experiment

The experiment name wasn't set in config.yaml, so this means the user would automatically have created a new experiment that was unique for them (uniqueness being constrained by the directory their experiment was cloned into, and the project code it was being run under).

I'll refer to this later as the legacy workflow.

ACCESS-OM3 Organisation

COSIMA (@aekiss and @micaeljtoliveira) are organising their closely related experiments in single repos, with branches for the different combinations of resolution and atmospheric forcing:

https://github.com/COSIMA/MOM6-CICE6

This is a good idea from a maintenance point of view: it reduces the number of repos and makes it simpler to alter related configurations by, for example, rebasing from a common shared ancestor.

However it (slightly) alters the currently used workflow: users will need to either do an additional step of checking out a specific branch for the experiment configuration they want, or include the branch name in the git clone with the --branch argument. The MOM6-CICE6 instructions suggest users clone into a directory named for their new experiment name, which is faithful to the old workflow. However they also suggest creating a new, uniquely named branch, so it is obvious this is a separate experiment.

Workflow Proposal: Utilising Branches

How can explicit support for branching work with the legacy workflow, but work well with the new COSIMA repo organisation?

From a user perspective the COSIMA repo organisation doesn't change much about how they work. Apart from being asked to work from a fork, they are still encouraged to clone into a local directory that is named for their proposed experiment name.

However the stated purposes of this issue was to allow users to have a single repository for their related experiments, and use branches for each unique experiment. So locally users were doing something similar to the COSIMA organisation, but whereas COSIMA has branches for very different configurations, users would be branching from a single model configuration.

If we think about this from a namespace point of view, from a users perspective the cloned experiment represents a namespace from which perturbation experiments can be run. So it makes sense to utilise this idea to reduce the length and complexity of branch names by automatically utilising the experiment directory name. Branch names must be unique within a repo, so we could default to assuming the experiment

So a proposed workflow could be:

  1. Fork COSIMA repository
  2. `payu clone --branch 1deg_jra55_ryf --new-branch perturb git@github.com:COSIMA/1deg_jra55_ryf.git 1deg_jra55_ryf

and alter the payu logic which generates the experiment name to include the branch name: {CONTROL_DIR}-{BRANCH_NAME}. So in this case the experiment name would be 1deg_jra55_ryf-perturb (or 1deg_jra55_ryf_perturb if using underscore to join them).

The issue proposing unique experiment ids suggested adding a shortened ID to the archive and work directories to allow for multiple experiments from a single repo. With this proposed workflow it would not be necessary to use the ID in this way as we have a 1:1 mapping from branches to experiment IDs. I can't think of a situation where the same branch would (a) have different experiment IDs and (b) want to retain a related archive directory.

However, this has the downside of potential namespace conflicts between researchers when copying to a shared space. This risk already exists, but could be mitigated with using experiment IDs in the naming of the archive.

So belt and braces approach would be to use {CONTROL_DIR}-{BRANCH_NAME}-{UUID_SHORT}.

Backwards combatibility: The idea is to create an ID for every experiment? So in payu setup, add and commit a uuid if it doesn't exist?

Yes. I agree setup is the right place to that.

What should not change for old experiments is the experiment name used for archival (i.e use the default control directory name or whatever it has been set to in config.yaml)

Yes, checking if running from main or master is one way to do that. There might be some legacy experiments which might use a run branch (or similar) which we might have trouble automatically detecting as legacy.

We could make a config.yaml option to specify if we're using this branching/experiment-id approach, but perhaps we can support legacy experiments by using an older version of payu.

We could add a payu version flag/requirement in config.yaml to check and use that as a flag for behaviour.

For new experiments cloned/created with payu clone/branch, the commands could automatically set the experiment name in config.yaml, or have a new flag to include a short uuid in experiment name used for archival.

See above.