payu-org / payu

A workflow management tool for numerical models on the NCI computing systems
Apache License 2.0
19 stars 26 forks source link

Adding experiment uuid, metadata and branch support #384

Closed jo-basevi closed 9 months ago

jo-basevi commented 10 months ago

As discussed in this issue (#330), this adds in support for branches in a control repository and storing metadata (experiment name, uuid, contact/email) in a metadata.yaml file. This should close #330 and #191

The proposed workflow is explained by @Aidan here

Experiment names

Experiment name: Name used for work and archive directories in laboratory (usually in /scratch/) Legacy experiment name: Control directory name or experiment value in config.yaml Proposed new experiment names:

Note: {CONTROL_DIR} would be replaced with the experiment value in config.yaml if it's defined.

Metadata file

Metadata file is checked or created/updated at experiment initialisation as the experiment name is used for archive/work paths in the laboratory directory. The experiment name is added to/read from metadata file: this will either be in the legacy or new name format.

A simple metadata.yaml could look like this:

uuid: MtniSbiaJgxoT2aCZVq3wu
experiment: ControlDirName-branchName-MtniS
contact: Example Name
email: example@email.com

Note: If user name and email are set in git config, these will also be added to the metadata file. It won't replace existing name and email unless they are filler values like "Your Name Here" etc - This is so they will work with existing metadata.yaml templates.

Setting up metadata uuid and experiment name

payu setup will only commit the metadata file if runlog is set to true. This is to support experiments that aren't git repositories, or development branches. If using payu checkout, payu clone or payu uuid commands, they will commit changes to the metadata file. This is to keep track of when uuid has changed.

New Workflow: Create new experiments from either of the following:

New Workflow: Continuation of pre-existing experiments - keep uuid unchanged:

Legacy Workflow with git clone/branch instead of payu clone/branch:

Command breakdown

`payu checkout --help` ``` usage: payu checkout [-h] [--model MODEL_TYPE] [--config CONFIG_PATH] [--laboratory LAB_PATH] [-b] [--start-from-restart RESTART_PATH] branch_name [start_point] A wrapper around git checkout. Create a new branch (if specified), checkout branch, setup experiment metadataand create/switch archive/work symlinks positional arguments: branch_name The name of the git branch to create/checkout start_point The new branch head will point to this commit optional arguments: -h, --help show this help message and exit --model MODEL_TYPE, -m MODEL_TYPE Model type --config CONFIG_PATH, -c CONFIG_PATH Configuration file path --laboratory LAB_PATH, --lab LAB_PATH, -l LAB_PATH The laboratory, this will over-ride the value given in config.yaml -b Create new branch --restart RESTART_PATH, -r RESTART_PATH The absolute restart path from which to start the model run ```
`payu clone --help` ``` usage: payu clone [-h] [--model MODEL_TYPE] [--config CONFIG_PATH] [--laboratory LAB_PATH] [-k] [--branch BRANCH] [--new-branch NEW_BRANCH_NAME] [--start-from-restart RESTART_PATH] repository local_directory A wrapper around git clone. Clones a control repository and setup new experiment metadata positional arguments: repository The repository to clone from. This can be either a local path or git url local_directory The directory to clone into optional arguments: -h, --help show this help message and exit --model MODEL_TYPE, -m MODEL_TYPE Model type --config CONFIG_PATH, -c CONFIG_PATH Configuration file path --laboratory LAB_PATH, --lab LAB_PATH, -l LAB_PATH The laboratory, this will over-ride the value given in config.yaml -k, --keep-uuid If an experiment uuid exists, leave it unchanged --branch BRANCH, -B BRANCH Clone and checkout this branch --new-branch NEW_BRANCH_NAME, -b NEW_BRANCH_NAME The name of the git branch to create and checkout --restart RESTART_PATH, -r RESTART_PATH The absolute restart path from which to start the model run ```
`payu branch --help` ``` usage: payu branch [-h] [--config CONFIG_PATH] [--verbose] List git branches and corresponding metadata optional arguments: -h, --help show this help message and exit --config CONFIG_PATH, -c CONFIG_PATH Configuration file path --verbose, -v Display all contents of metadata file ```

Third party libs

In the code I've added dependencies on a couple third-party libraries:

Short(ish) Summary:

Currently the code uses the experiment name that is stored in metadata file, for creating archive and work paths. The alternatives to having an experiment name in metadata and instead to auto-generate them, could be to add a flag in config to use, or not use, branch-uuid aware names. A specific version of payu could also be used with the new version of payu only using branch-uuid aware experiment names but easy access to older versions of payu isn't currently set up.

TODO:

pep8speaks commented 10 months ago

Hello @jo-basevi! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 39:80: E501 line too long (93 > 79 characters)

Comment last updated at 2023-12-22 01:00:15 UTC
coveralls commented 10 months ago

Coverage Status

coverage: 52.591% (+5.2%) from 47.415% when pulling db2d0f1d42c9ec0f2531d766e0defb07a3668978 on jo-basevi:191-experiment-uuids into d738c9fd2891f6a839324026229ce36d1b3dc33f on payu-org:master.

aidanheerdegen commented 10 months ago

Metadata file

I like the idea of using metadata.yaml as the "source of truth" for UUIDs. It makes the logic of when to generate a new UUID a lot clearer/cleaner. Nice!

If using payu checkoutpayu clone or payu uuid commands, they will commit changes to the metadata file. This is to keep track of when uuid has changed.

Makes sense.

Create branch with payu checkout -b NEW_BRANCH_NAME, where -b flag is for create a new branch

Do we need a -b flag? I think just assume user wants to create a new branch (and therefore experiment) and let them know? Only downsides I can think of are spurious creation of directories etc with a typo. Not a big issue if we provide easy ways to clean up. Could have branch name as an optional argument to sweep?

payu sweep branch

To delete all the lab stuff:

payu sweep branch --hard

So probably need a nuclear option:

payu sweep --all --hard

with a serious warning/confirmation dialog.

Continuation of pre-existing experiments

  • If wishing to run branch from a pre-existing local control directory and start from a restart: payu checkout BRANCH --start-from-restart RESTART_PATH.

This is interesting, a "crossing of the rubicon" as I don't believe we've tried to write to config.yaml before. I can see why you need ruayml, to round-trip config.yaml.

Do we want to go here? Could we use the restart manifest for this?

Small point, could --start-from-restart just be --restart to save a bit of typing?

If experiment branch is in forked remote repository and a local archive does exist:
payu clone --keep-uuid --branch BRANCH_NAME FORKED_REPO CONTROL_DIR.

I don't have a problem with this, but it does require synchronisation of multiple bits of information so CONTROL_DIR, BRANCH_NAME and shortpath (or $PROJECT) match the existing archive. A lot could go wrong!

The payu clone could just have FORKED_REPO and CONTROL_DIR as the arguments and infer BRANCH_NAME from the CONTROL_DIR? Would that help with making sure CONTROL_DIR and BRANCH_NAME are consistent?

It might make sense to just check if the corresponding CONTROL_DIR exists. I guess it will throw an exception if it is wrong, but in what state will that leave the experiment repository?

If experiment branch is in forked repo and local archive does not exist:
payu clone --keep-uuid --branch BRANCH_NAME --start-from-restart RESTART_PATH FORKED_REPO CONTROL_DIR
Note: --keep-uuid in clone command will leave metadata file unchanged if uuid exists. If the
name of the control directory has changed, this will raise an error on checks on the experiment name.

Again could --start-from-restart just be --restart? Similar to above there is a lot going on here, though I recognise you've specified this will check for errors.

  • To start a new experiment (create a new uuid) but opt of branch-id-aware experiment names,
    run payu uuid --legacy.

We already have payu init. Should that be used instead, and add a --legacy option? Do we need to be able to call payu uuid at any other time? If a user wants a new UUID generated I guess they can always just remove the existing UUID from the metadata.yaml.

Future features could include listing branch names with corresponding uuid.

I think this is an important feature to include in the first iteration of the branching model. Maybe that was always the intention, and if so I would agree.

Third party libs

shortuuid

In your test It was clear that short uuid was a lot better for avoiding collisions with truncated hashes.

In terms of what is acceptable, I think 1 in 20 is probably too high (shortuuid length 4), but 1 in 1000 seems fine to me (shortuuid 5) and 1 in 500000 for shortuuid 6 is probably overkill. Using the same criteria then uuid4 would need to be at least length 7 (1 in 500).

So the decision to use shortuuid comes down to 5 v 7 characters in the shortened URL, adding another dependency, does it have to be supported? (not sure it matters), and maybe legibility? Will it help users if it is easier to remember? Not sure which is easier TBH: 5 mixed case letters and numbers, or 7 numbers.

gitpython

Sounds like a good idea. Makes the code much simpler and more robust.

ruamel.yaml

Only really necessary if we need to modify config.yaml and round-trip comments etc, but I have noticed it before and it does seem a nice implementation.

So .. I haven't had a chance to have a good look at the code yet, this is mostly just commenting on your comprehensive and well structured summary of the problem and your solution.

jo-basevi commented 10 months ago

Thanks @aidanheerdegen for those comments!

Do we need a -b flag? I think just assume user wants to create a new branch (and therefore experiment) and let them know? Only downsides I can think of are spurious creation of directories etc with a typo.

I've added in extra error messages if the branch already exists, to remove the -b flag if intending to checkout the existing branch. Conversely, if branch name does not exist locally or in remote repository, a message to suggest adding the new branch -b flag to create a new branch. I personally prefer -b as it's explicit about what the command will do, its similar to git checkout and nothing will be changed if it fails. I also feel the error will be a lot more visible to the user rather than logging out that a new branch was created if it doesn't already exist. It also catches the case if the user wants to create a new branch that already exists in the remote repository.

Let me know if you still think its better to leave -b out, and I can remove it.

payu sweep branch

Would that just remove a local branch from git?

payu sweep branch --hard

This may not be needed as currently if there's a branch-uuid aware experiment name in metadata, payu sweep --hard will remove what is in the lab archive under that experiment name.

Is the intention of this command so it can be run from another branch?

Do we want to go here? Could we use the restart manifest for this?

Thanks, that's good to know - I'll have a look into that.

I don't have a problem with this, but it does require synchronisation of multiple bits of information so CONTROL_DIR, BRANCH_NAME and shortpath (or $PROJECT) match the existing archive. A lot could go wrong!

Yeah I agree, a lot would go wrong. I somewhat hope that --keep-uuid wouldn't be needed often (if at all).. Currently I don't have any good ideas on how to improve it though I will try keep thinking.

The payu clone could just have FORKED_REPO and CONTROL_DIR as the arguments and infer BRANCH_NAME from the CONTROL_DIR? Would that help with making sure CONTROL_DIR and BRANCH_NAME are consistent?

I am not sure on that as I think the default branch should be the main branch of the forked repo. Also having branch name equal control directory, would make the experiment name start with ${CONTROL-DIR}-${CONTROL-DIR}.

Again could --start-from-restart just be --restart?

Yup, I can change that.

We already have payu init. Should that be used instead, and add a --legacy option? Do we need to be able to call payu uuid at any other time? If a user wants a new UUID generated I guess they can always just remove the existing UUID from the metadata.yaml.

payu init currently initialises the laboratory directories but not the specific experiment work and archive directories. So I previously thought creating metadata there was a little out of place. Though it could be added on there as metadata class only requires Laboratory and the configuration path.

Ideally payu checkout/clone should be used for creating new experiments (by extension new uuids) vs using payu uuid. Though removing it directly from metadata is also a way to achieve a new uuid. So yeah, maybe it'll be good to remove.

aidanheerdegen commented 10 months ago

Thanks @aidanheerdegen for those comments!

Do we need a -b flag? I think just assume user wants to create a new branch (and therefore experiment) and let them know? Only downsides I can think of are spurious creation of directories etc with a typo.

I've added in extra error messages if the branch already exists, to remove the -b flag if intending to checkout the existing branch. Conversely, if branch name does not exist locally or in remote repository, a message to suggest adding the new branch -b flag to create a new branch. I personally prefer -b as it's explicit about what the command will do, it's similar to git checkout and nothing will be changed if it fails. I also feel the error will be a lot more visible to the user rather than logging out that a new branch was created if it doesn't already exist. It also catches the case if the user wants to create a new branch that already exists in the remote repository.

This all sounds very sensible and helps the user a lot, so agree 100%

Let me know if you still think its better to leave -b out, and I can remove it.

You've convinced me. Keep it.

payu sweep branch

Would that just remove a local branch from git?

I was trying to think of how to undo remove a branch and clean up artefacts, but this doesn't seem like a very well thought through suggestion.

payu sweep branch --hard

This may not be needed as currently if there's a branch-uuid aware experiment name in metadata, payu sweep --hard will remove what is in the lab archive under that experiment name.

Good point.

Is the intention of this command so it can be run from another branch?

My thinking was that there was a high likelihood of creating a bunch of branches, with corresponding work and archive directories, and we should provide some commands to make cleaning them up easy. So yeah I think we might need a way to sweep --hard other branches, or all branches (with warnings!).

Do we want to go here? Could we use the restart manifest for this?

Thanks, that's good to know - I'll have a look into that.

Could we repurpose the reproduce framework for this?

Another option is to take restart out of config.yaml. Seems extreme though. Hrm.

I don't have a problem with this, but it does require synchronisation of multiple bits of information so CONTROL_DIR, BRANCH_NAME and shortpath (or $PROJECT) match the existing archive. A lot could go wrong!

Yeah I agree, a lot would go wrong. I somewhat hope that --keep-uuid wouldn't be needed often (if at all).. Currently I don't have any good ideas on how to improve it though I will try keep thinking.

Thanks. You make a good point about it being a pretty unusual case.

The payu clone could just have FORKED_REPO and CONTROL_DIR as the arguments and infer BRANCH_NAME from the CONTROL_DIR? Would that help with making sure CONTROL_DIR and BRANCH_NAME are consistent?

I am not sure on that as I think the default branch should be the main branch of the forked repo. Also having branch name equal control directory, would make the experiment name start with ${CONTROL-DIR}-${CONTROL-DIR}.

Sorry, when I said "infer BRANCH_NAME from the CONTROL_DIR" I should have said "infer BRANCH_NAME from information in the CONTROL_DIR, such as the metadata.yaml?

Again could --start-from-restart just be --restart?

Yup, I can change that.

Tah. If you don't agree though feel free to push back.

We already have payu init. Should that be used instead, and add a --legacy option? Do we need to be able to call payu uuid at any other time? If a user wants a new UUID generated I guess they can always just remove the existing UUID from the metadata.yaml.

payu init currently initialises the laboratory directories but not the specific experiment work and archive directories. So I previously thought creating metadata there was a little out of place. Though it could be added on there as metadata class only requires Laboratory and the configuration path.

Good point. TBH I've always thought it was an omissions for payu init not to create the archive and symlinks. There is a lot of information contained in the those symlinks for the user, and they don't get to see them until they payu run (or maybe payu sweep?).

If you think it is reasonable to add that to payu init then I'm all for it. Happy to hear counter-arguments though.

Ideally payu checkout/clone should be used for creating new experiments (by extension new uuids) vs using payu uuid. Though removing it directly from metadata is also a way to achieve a new uuid. So yeah, maybe it'll be good to remove.

I've already piled so much other stuff into this, this is my concession to trying to simplify it! :)

But I do think if we can not make another new command it is probably worth (not) doing so.

Perhaps if a user wants a new UUID, running payu init in an existing experiment could do that? With a warning and confirmation if -f/--force isn't used.

jo-basevi commented 10 months ago

Re using restart manifest and reproduce framework for setting restart path.

Ok I’ve had a good look into using restart manifests and the reproduce framework. If I was able to get a restart manifest created in payu checkout, I think it would the require reproduce: restart: True in config.yaml to use the files in the restart manifest and add them to work/INPUT in the next payu setup/run. But I think after the initial run, that configuration would need to be removed so it uses the latest restart files in archive?

Currently prior_restart_path is set using existing restarts in archive or restart in config.yaml. A bunch of model drivers use prior_restart_path in their respective setups (e.g. UM, mitgcm, CICE, mom6). If reproduce: restart: true, prior_restart_path will be None (as there’s no restarts in archive if new branch and no restart in config), does everything still work fine? I guess this more of a question on how reproduce with restarts works in general…

Anyway, I think no matter how I do this (using restart config or using manifests), still would require setting something in config.yaml. An alternative to modifying config.yaml could be creating a symlink in archive to the restart path.

Sorry, when I said "infer BRANCH_NAME from the CONTROL_DIR" I should have said "infer BRANCH_NAME from information in the CONTROL_DIR, such as the metadata.yaml?

I’m not sure how that would work as the metadata.yaml would be branch specific e.g. would have to check out the branch first before reading the metadata.

My thinking was that there was a high likelihood of creating a bunch of branches, with corresponding work and archive directories, and we should provide some commands to make cleaning them up easy. So yeah I think we might need a way to sweep --hard other branches, or all branches (with warnings!).

Ok, yeah I am thinking that option might need a few user prompts of Are you sure you want to remove this archive directory? as it could delete quite a bit depending on how many local branches there are.

Good point. TBH I've always thought it was an omissions for payu init not to create the archive and symlinks. There is a lot of information contained in the those symlinks for the user, and they don't get to see them until they payu run (or maybe payu sweep?). If you think it is reasonable to add that to payu init then I'm all for it

Yeah, I can have a look into that and I'll remove that extra payu uuid command to try keep things more simple.

aidanheerdegen commented 10 months ago

Re using restart manifest and reproduce framework for setting restart path.

Ok I’ve had a good look into using restart manifests and the reproduce framework. If I was able to get a restart manifest created in payu checkout, I think it would the require reproduce: restart: True in config.yaml to use the files in the restart manifest and add them to work/INPUT in the next payu setup/run. But I think after the initial run, that configuration would need to be removed so it uses the latest restart files in archive?

As it is currently implemented that is true. I was thinking about changing that ... but the logic of how to do that does my head in.

Currently prior_restart_path is set using existing restarts in archive or restart in config.yaml. A bunch of model drivers use prior_restart_path in their respective setups (e.g. UM, mitgcm, CICE, mom6). If reproduce: restart: true, prior_restart_path will be None (as there’s no restarts in archive if new branch and no restart in config), does everything still work fine? I guess this more of a question on how reproduce with restarts works in general…

Oh dear. I didn't remember (or realise) it was being abused quite so much.

Anyway, I think no matter how I do this (using restart config or using manifests), still would require setting something in config.yaml. An alternative to modifying config.yaml could be creating a symlink in archive to the restart path.

Symlinks is how people have done it in the past, and probably still do so.

Ok. It seems too difficult/convoluted/potentially dangerous to do anything other than set restart: in config.yaml, so keep going with that I think. It is clear what has been done, and can be un-done by git if required.

Sorry to add noise/confusion.

Sorry, when I said "infer BRANCH_NAME from the CONTROL_DIR" I should have said "infer BRANCH_NAME from information in the CONTROL_DIR, such as the metadata.yaml?

I’m not sure how that would work as the metadata.yaml would be branch specific e.g. would have to check out the branch first before reading the metadata.

Good point. It is possible to interrogate a file in another branch without checking out the whole branch (can checkout to stdout and read into a buffer for example). I'm not saying that is a good idea, just that it is possible.

But your original point is a good one. This sounds difficult so should be fine to just warn if there are inconsistency and leave it at that.

My thinking was that there was a high likelihood of creating a bunch of branches, with corresponding work and archive directories, and we should provide some commands to make cleaning them up easy. So yeah I think we might need a way to sweep --hard other branches, or all branches (with warnings!).

Ok, yeah I am thinking that option might need a few user prompts of Are you sure you want to remove this archive directory? as it could delete quite a bit depending on how many local branches there are.

👍

Yeah, I can have a look into that and I'll remove that extra payu uuid command to try keep things more simple.

If you think that sounds do-able and a good idea then sounds good to me.

jo-basevi commented 10 months ago

Thanks again for all the feedback! As discussed, I will park the work for the feature to add in options to sweep all branches in a local repo. The adding metadata to payu init isn't urgent either I think as metadata setup is run in Experiment initialisation in payu setup/run/sweep commands. If on an existing branch with an existing archive and have not moved onto the branch using payu checkout, you can use payu checkout CURRENT_BRANCH_NAME and that would setup the symlinks.

I think the code is ready for review though I do intend to add documentation and more tests for the metadata class.

aidanheerdegen commented 10 months ago

The access-nri-intake-catalog has a schema for experiments

https://github.com/ACCESS-NRI/schema/blob/main/experiment_asset.json

{
    "$schema": "https://json-schema.org/draft/2020-12/schema",
    "type": "object",
    "properties": {
        "name": {
            "type": "string",
            "description": "The name of the experiment (string)"
        },
        "experiment_uuid": {
            "type": "string",
            "format": "uuid",
            "description": "Unique uuid for the experiment (string)"
        },
        "description": {
            "type": "string",
            "description": "Short description of the experiment (string, < 150 char)"
        },
        "long_description": {
            "type": "string",
            "description": "Long description of the experiment (string)"
        },
        "model": {
            "type": "array",
            "items": {"type": ["string", "null"]},
            "description": "The name(s) of the model(s) used in the experiment (string)"
        },
        "realm": {
            "type": "array",
            "items": {
                "oneOf": [
                    {"type": "null"},
                    {
                        "type": "string",
                        "enum": [
                            "aerosol",
                            "atmos",
                            "atmosChem",
                            "land",
                            "landIce",
                            "none",
                            "ocean",
                            "ocnBgchem",
                            "seaIce",
                            "unknown"
                        ]
                    }
                ]
            },
            "description": "The realm(s) included in the experiment (string)"
        },
        "frequency": {
            "type": "array",
            "items": {
                "oneOf": [
                    {"type": "null"},
                    {
                        "type": "string",
                        "oneOf": [
                            {
                                "pattern": "^fx$"
                            },
                            {
                                "pattern": "^subhr$"
                            },
                            {
                                "pattern": "^\\d+hr$"
                            },
                            {
                                "pattern": "^\\d+day$"
                            },
                            {
                                "pattern": "^\\d+mon$"
                            },
                            {
                                "pattern": "^\\d+yr$"
                            },
                            {
                                "pattern": "^\\d+dec$"
                            }
                       ]
                    }
                ]
            },
            "description": "The frequency(/ies) included in the experiment (string)"
        },
        "variable": {
            "type": "array",
            "items": {
                "type": ["string", "null"]
            },
            "description": "The variable(s) included in the experiment (string)"
        },
        "nominal_resolution": {
            "type": "array",
            "items": {"type": ["string", "null"]},
            "description": "The nominal resolution(s) of model(s) used in the experiment (string)"
        },
        "version": {
            "type": ["number", "string", "null"],
            "description": "The version of the experiment (number, string)"
        },
        "contact": {
            "type": ["string", "null"],
            "description": "Contact name for the experiment (string)"
        },
        "email": {
            "type": ["string", "null"],
            "description": "Email address of the contact for the experiment (string)"
        },
        "created": {
            "type": ["string", "null"],
            "description": "Initial creation date of experiment (string)"
        },
        "reference": {
            "type": ["string", "null"],
            "description": "Citation or reference information (string)"
        },
        "license": {
            "type": ["string", "null"],
            "description": "License of the experiment (string)"
        },
        "url": {
            "type": ["string", "null"],
            "description": "Relevant url, e.g. github repo for experiment configuration (string)"
        },
        "parent_experiment": {
            "type": ["string", "null"],
            "description": "experiment_uuid for parent experiment if appropriate (string)"
        },
        "related_experiments": {
            "type": "array",
            "items": {
                "type": ["string", "null"]
            },
            "description": "experiment_uuids for any related experiment(s) (string)"
        },
        "notes": {
            "type": ["string", "null"],
            "description": "Additional notes (string)"
        },
        "keywords": {
            "type": "array",
            "items": {
                "type": ["string", "null"]
            },
            "description": "Keywords to associated with experiment (string)"
        }
    },
    "required": [
        "name",
        "experiment_uuid",
        "description",
        "long_description"
    ],
    "additionalProperties": false
}

We should try and match that as much as possible. So that would mean experiment_uuid and parent_uuid. Sorry for the late notice.

jo-basevi commented 10 months ago

Ok that is really good to know! I'll need to do a few name updates. uuid -> experiment_uuid, previous_uuid -> parent_experiment. Looks like uuid.uuid4() format is used for UUID's so I'll change that too rather than using the short_uuid lib. So if experiment_uuid: 658c95cc-c299-450c-82a1-b2b2308f7c6e, could use the first part 658c95cc in experiment names?

jo-basevi commented 10 months ago

Hi @dougiesquire, I have a few questions on the intake catalogue fields, is that's ok? Firstly, are there any fields that are auto-generated from metadata.yaml in outputs?

In this branch, we are intending adding a metadata.yaml file with a experiment_uuid to payu control directories. For pre-existing experiments in payu setup/run, if there's not an existing UUID field in the metadata, it would add a new experiment_uuid. Would that work ok for pre-existing experiments?

Is name field usually the control directory name and if so, would this ever be different from the name of directory used for archived outputs? For new experiments, we were thinking of adding the branch name and an short part of the uuid to the archive directory name - so payu can work with branches for related experiments inside one control repo.

In this branch, if email and contact, are not in metadata.yaml, payu will check if there's a name and email in git config and add it to the metadata file. In future, there's a few other fields payu could possible generate if they aren't pre-existing:

I could also add in the metadata.yaml template (access-nri-intake-catalog/metadata.yaml), or at least the template values for required fields, if payu ends up creating a new metadata.yaml file.

Should metadata.yaml be added to the base archived output directory or/and added to each output subdirectory, e.g. output000?

Sorry this has turned into a lot of questions! Just want to make sure I'm not going to screw up the metadata.yaml files.

aidanheerdegen commented 10 months ago

Looks like uuid.uuid4() format is used for UUID's so I'll change that too rather than using the short_uuid lib. So if experiment_uuid: 658c95cc-c299-450c-82a1-b2b2308f7c6e, could use the first part 658c95cc in experiment names?

Yeah that sounds fine I think. It is certainly more standard and 8 characters gives enough randomness I think.

dougiesquire commented 10 months ago

Hi @jo-basevi. Thanks for checking in. I'll do my best to answer your questions below.

Firstly, are there any fields that are auto-generated from metadata.yaml in outputs?

Yes. Basically if any required metadata is not available within a given datastore, the catalog will look for it in the metadata.yaml file. This is used to populate name, description and model (in some cases - i.e. if the model name(s) is not available from the datastore).

In this branch, we are intending adding a metadata.yaml file with a experiment_uuid to payu control directories. For pre-existing experiments in payu setup/run, if there's not an existing UUID field in the metadata, it would add a new experiment_uuid. Would that work ok for pre-existing experiments?

Yes. We'll (obviously) just need to be sure that the metadata.yaml is kept with output data, e.g. if/when the data are moved to a different location.

Is name field usually the control directory name and if so, would this ever be different from the name of directory used for archived outputs? For new experiments, we were thinking of adding the branch name and an short part of the uuid to the archive directory name - so payu can work with branches for related experiments inside one control repo.

In the past, the name field has just been whatever the user decides to call the experiment (usually not the archive directory name). But I really like the idea of having a default built from the directory name, branch name and short uuid. From a cataloguing perspective, these just need to be unique (we don't want two products with the same name). A user can always change the name if they want to (ideally prior to adding to the catalogue).

In this branch, if email and contact, are not in metadata.yaml, payu will check if there's a name and email in git config and add it to the metadata file. In future, there's a few other fields payu could possible generate if they aren't pre-existing:

  • model - would that be the model and submodels payu driver's model_type? e.g. access-om2, mom, yatm etc.
  • created - if new experiment was created on a new branch via payu
  • url - if new experiment was cloned via payu

Great! The model name is not enforced by any sort of controlled vocab, but it would be good to be consistent where possible for good search functionality. There are already ACCESS-OM2 and ACCESS-ESM1-5 experiments in the catalog with the model names "ACCESS-OM2" and "ACCESS-ESM1-5" respectively (these are their CMIP names). So using the driver's model_type will result in inconsistencies.

I could also add in the metadata.yaml template (access-nri-intake-catalog/metadata.yaml), or at least the template values for required fields, if payu ends up creating a new metadata.yaml file.

Sounds like a good idea. Then people might fill in the fields that aren't auto-populated. It might be best to do this directly from the schema, rather than from access-nri-intake-catalog/metadata.yaml?

Should metadata.yaml be added to the base archived output directory or/and added to each output subdirectory, e.g. output000?

Just the base output directory.

Let me know if you have any further questions.

jo-basevi commented 10 months ago

Thank you @dougiesquire for answering those questions!

Yes. We'll (obviously) just need to be sure that the metadata.yaml is kept with output data, e.g. if/when the data are moved to a different location.

I've added that the metadata file will be copied over to base archive directory.

In the past, the name field has just been whatever the user decides to call the experiment (usually not the archive directory name). But I really like the idea of having a default built from the directory name, branch name and short uuid. From a cataloguing perspective, these just need to be unique (we don't want two products with the same name). A user can always change the name if they want to (ideally prior to adding to the catalogue).

Ok, I'll add in a default name for the metadata.

Great! The model name is not enforced by any sort of controlled vocab, but it would be good to be consistent where possible for good search functionality. There are already ACCESS-OM2 and ACCESS-ESM1-5 experiments in the catalog with the model names "ACCESS-OM2" and "ACCESS-ESM1-5" respectively (these are their CMIP names). So using the driver's model_type will result in inconsistencies.

That is good to know! I won't use model_type there.

Sounds like a good idea. Then people might fill in the fields that aren't auto-populated. It might be best to do this directly from the schema, rather than from access-nri-intake-catalog/metadata.yaml?

Thanks, I'll look into using the schema instead.

Thanks again for answering those questions! I'll be sure to ask more when I have more questions

jo-basevi commented 10 months ago

@aidanheerdegen since the last review I've added a few changes, with the main ones generating an experiment name on metadata.setup rather than using a set value in metadata.yaml and adding in additional metadata fields.

Generating Experiment names

For new branches or clones created using payu, it will generate a new UUID and branch-uuid experiment name. Then it will copy the metadata file over to a new archive directory.

On payu setup/run/sweep etc, when metadata is setup on Experiment initialisation, it will check for an existing metadata file and existing branch-uuid and legacy experiment names in the archive. If there are no existing archives, it will generate a new UUID and branch-uuid experiment name and create a new archive. This means it will still work for legacy experiments with existing archives. It will also work well with control directories with no legacy archive names if git checkout -b/git clone is used instead to create new branches as it will automatically generate a new UUID (as there's no existing archive).

The experiment value in config.yaml is now an override for experiment names. It can be set to ensure the experiment uses a set experiment name in the archive directory.

For development branches or branches that will be merged, could use --keep-uuid flag (e.g. payu clone/checkout --keep-uuid), it will still generate a new UUID if there isn't a pre-existing UUID, but if there's a pre-existing UUID it will leave it unchanged and not add in new auto-generated fields. It will still have an archive per branch but it'll reduce the number of commits to changes to the metadata file. An alternative could be to add a config option to disable generating and committing metadata files and an option to use $CONTROL_DIR-$BRANCH experiment names (without UUID) so can still work with branches with separate archives in one control directory.

Generating additional metadata fields

If a new UUID has been generated, and if there was a pre-existing UUID in metadata, it'll record the UUID of the parent experiment and also the current commit hash as suggested in this issue #387.

If a new UUID has been generated and if it's a branch-uuid aware experiment, it will add the created date, experiment name, and git URL of the origin remote if it exists. This will run when payu checkout -b, payu clone --new-branch, or if no pre-existing archive exists and a new UUID is generated in payu setup/run/sweep. This was not done for all new UUIDs as UUID could be generated from a pre-existing legacy experiment that has been running for a while and don't want to overwrite the existing created date, name, and URL fields.

If new UUID was generated and it called from a payu checkout/clone, it will add in commented fields of properties obtained from experiment_asset.json schema that are not filled. This is not done for new UUIDs created in payu setup/run/sweep as there may not be internet access. For example, a new branch created using payu checkout -b, would look like the following:

experiment_uuid: ef175f2f-6e32-476a-8ad6-427a35c1b8af
created: '2023-12-07'
name: test_metadata-test_branch-ef175f2f
url: git@github.com:jo-basevi/mom6_double_gyre.git
description:  # Short description of the experiment (string, < 150 char)
long_description: # Long description of the experiment (string)
model: # The name(s) of the model(s) used in the experiment (string)
realm: # The realm(s) included in the experiment (string)
frequency: # The frequency(/ies) included in the experiment (string)
variable: # The variable(s) included in the experiment (string)
nominal_resolution: # The nominal resolution(s) of model(s) used in the experiment (string)
version: # The version of the experiment (number, string)
contact: # Contact name for the experiment (string)
email: # Email address of the contact for the experiment (string)
reference: # Citation or reference information (string)
license: # License of the experiment (string)
parent_experiment: 2f5d8e69-c020-4e10-bb9e-6678822f5009 # experiment_uuid for parent experiment if appropriate (string)
related_experiments: # experiment_uuids for any related experiment(s) (string)
notes: # Additional notes (string)
keywords: # Keywords to associated with experiment (string)
parent_experiment_commit: 71496b02ab18e098ce4b42bce9401cf99141ab89

Happy to remove that feature or improve it if there are suggestions!

TODO:

jo-basevi commented 10 months ago

I've added parent_experiment_commit field to the generated metadata as suggested in this issue #387. @dougiesquire would I need update the experiment_asset.json schema as it doesn't allow additional properties? If so, would that be as straightforward as just opening a PR on that repo?

Thinking of adding the following or similar to properties:

    "parent_experiment_commit": {
            "type": ["string", "null"],
            "description": "Last Git commit hash of parent experiment if appropriate (string)"
    },
dougiesquire commented 9 months ago

@dougiesquire would I need update the experiment_asset.json schema as it doesn't allow additional properties? If so, would that be as straightforward as just opening a PR on that repo?

I think we might have multiple definitions of "parent experiment". I could be wrong, but I think you are defining the parent experiment as the experiment configuration from which the current experiment is branched using git. I was thinking of parent experiment as the model run that provides the restart(s) for the current experiment (it's common to run a long "control" experiment and then kick off "perturbation" experiments from this run - confusingly, this is also called "branching"). Perhaps we need fields for both of these definitions?

But to answer your question, yes a PR is the way to go. Feel free to suggest any other edits that you think would make the schema more generally usable.

jo-basevi commented 9 months ago

I think we might have multiple definitions of "parent experiment". I could be wrong, but I think you are defining the parent experiment as the experiment configuration from which the current experiment is branched using git. I was thinking of parent experiment as the model run that provides the restart(s) for the current experiment (it's common to run a long "control" experiment and then kick off "perturbation" experiments from this run - confusingly, this is also called "branching"). Perhaps we need fields for both of these definitions?

That is good point, thanks for raising that. I’ll bring that up with @aidanheerdegen when he comes into the office tomorrow.

aidanheerdegen commented 9 months ago

The model name is not enforced by any sort of controlled vocab, but it would be good to be consistent where possible for good search functionality. There are already ACCESS-OM2 and ACCESS-ESM1-5 experiments in the catalog with the model names "ACCESS-OM2" and "ACCESS-ESM1-5" respectively (these are their CMIP names). So using the driver's model_type will result in inconsistencies.

I like the idea of providing this sort of information by default to remove a (boring) burden from users.

Any suggestions for how to provide this model name : intake catalogue model name mapping?

aidanheerdegen commented 9 months ago

I could be wrong, but I think you are defining the parent experiment as the experiment configuration from which the current experiment is branched using git. I was thinking of parent experiment as the model run that provides the restart(s) for the current experiment (it's common to run a long "control" experiment and then kick off "perturbation" experiments from this run - confusingly, this is also called "branching"). Perhaps we need fields for both of these definitions?

@jo-basevi and I had a chat and agree that your characterisation is correct, and that yes a parent experiment suggests using restarts from the parent experiment.

Jo pointed out we could add logic to only add a parent experiment in the case where a restart directory/experiment is specified.

parent_experiment_commit only makes sense if there is a parent_experiment.

The normal git commit log is sufficient for tracking the relationships between related configurations, i.e. not experiments.

dougiesquire commented 9 months ago

I like the idea of providing this sort of information by default to remove a (boring) burden from users.

Me too. Is the Access driver actually used for anything other than ACCESS-ESM1-5? That's the only problematic one. For everything else, I think capitalizing the model_name would work

aidanheerdegen commented 9 months ago

I like the idea of providing this sort of information by default to remove a (boring) burden from users.

Me too. Is the Access driver actually used for anything other than ACCESS-ESM1-5?

It used to be used for some ACCESS-CM2 stuff back in the day.

That's the only problematic one. For everything else, I think capitalizing the model_name would work

We could add a specific model for ACCESS-ESM1-5 and just make it an alias for ACCESS.

Or (ad probably better), add a specific model_name option to config.yaml to use for things like the metadata.yaml.

So default to capitalised model, but use capitalised model_name if it is available.

jo-basevi commented 9 months ago

Changes since last update:

The logic just looks if there's a metadata file and UUID defined in the parent directory containing the restart.

When --restart flag is used, added checks for if there isn't already pre-existing restarts in archive, as the restart in config.yaml has no effect in this case. In this case, it raises a warning and skips adding it to config.yaml and passing that path to metadata.

E.g.

metadata:
    enable: false
    model: ACCESS-ESM1-5
dougiesquire commented 9 months ago

Added metadata config for model names to use when generating metadata. Otherwise it defaults to the model name

Defaults to the model name, or the capitalised model name? Can we do the latter please :)

jo-basevi commented 9 months ago

Added metadata config for model names to use when generating metadata. Otherwise it defaults to the model name

Defaults to the model name, or the capitalised model name? Can we do the latter please :)

Sorry that's bad description on my part - It currently capitalises the model name before adding it to the metadata

jo-basevi commented 9 months ago

@aidanheerdegen I've added a test for fsops.list_archive_dirs. I've also squashed a bunch of commits to clean up the git history. Thanks for the reviews!

aidanheerdegen commented 9 months ago

I forced merge. There were unresolved conversations, but the commits no longer existed as the commit history had been cleared up, so I couldn't mark them as resolved.