payu-org / payu

A workflow management tool for numerical models on the NCI computing systems
Apache License 2.0
21 stars 27 forks source link

Embed experiment identifiers in model outputs #510

Open aidanheerdegen opened 2 months ago

aidanheerdegen commented 2 months ago

Embedding experiment ID and run commit hashes into model output diagnostics is essential for experiment provenance: it establishes a link between the outputs of an experiment and all the provenance data of the experiment. It means consumers of the data, regardless of where they find it, have the possibility of finding this essential information.

These identifying hashes then have the ability to become persistent identifiers (PIDs) once there is a service to resolve them and expose the related metadata to users. Such a service doesn't exist ... yet. But embedding this information is a necessary precursor.

Proposal

  1. Use experimentID+git commit hash as a unique identifier (exptrunID?) for each run of a model, where an experiment constitutes a number of such consecutive runs.
  2. Embed exptrunID as a metadata field in all model output diagnostics, e.g. global netCDF attribute.
  3. Where possible add exptrunID as an configuration input to the model so the metadata is added when the diagnostic is written. If this isn't possible add metadata after the run has completed.

Implementation

exptrunID = experimentID.gitcommithash

Where possible the exptrunID should be added as a model configuration input option and written directly into the model outputs. This has two benefits:

  1. It is model output format agnostic: as long as the model can write the metadata into a header in the output diagnostic file it doesn't matter what the format is
  2. Reduces the number of post-processing steps. From a provenance point of view every transformation is a step that should be captured in the provenance chain, so it adds unnecessary complexity and ambiguity. This may require code changes in the models themselves. This doesn't have to happen immediately, in the first case post-processing could be utilised until the code supported direct metadata injection. This would be tricky to manage, as it would be model version dependent.

Each model should take care of adding this metadata to the model diagnostic outputs. This means the model class should have a stub method add_output_metadata that is either not implemented, or has some useful default like adding global attribute to netCDF files.

add_output_metadata should be called at setup and archive stages so that exptrunID can be added either before a run, or after it has completed. The method needs to have logic to decide if it runs at setup or archive. If there isn't a better way, like some call-graph inspection, then the stage should be passed to the method.

Notes

  1. FMS (the GFDL coupler infrastructure used by MOM models) has an mpp_write_meta routine for MOM5

For MOM6 Global attributes can be written by calling register_global_attribute. Scalar and 1d real and integers (32 and 64 bit) and scalar string values are supported

call register_global_attribute(fileobj, "global_attribute_name", value)

This interface can be used with any FMS2_io fileobj, but the open_file needs to be called before using it.

  1. CICE5 should be straightforward

netCDF: https://github.com/COSIMA/cice5/blob/edcfa6f9c76ed05b63196ce4b5355fa5a8f4fe3a/io_netcdf/ice_history_write.F90#L922-L978

pio: https://github.com/COSIMA/cice5/blob/edcfa6f9c76ed05b63196ce4b5355fa5a8f4fe3a/io_pio/ice_history_write.F90#L877-L934

  1. UM? Not sure
jo-basevi commented 2 months ago

Would the commit hash for the exptrunID be the runlog commit right before the model is run? E.g. https://github.com/payu-org/payu/blob/68d8482e5307af62603431fe95f1426a28056948/payu/experiment.py#L653-L654

If so, the model method add_output_metadata adding the ID to configuration files might need to be run then rather than at setup? And then, would this feature only be enabled only if runlog is enabled? As it might not make as much sense to use just the experimentId unless it was experimentId.runNumber- but then there could be clash between run numbers.. A small initial payu PR could be a metadata method that generates an exptRunId after the runlog commit? This can then be passed to model drivers methods.

aidanheerdegen commented 2 months ago

Would the commit hash for the exptrunID be the runlog commit right before the model is run

Yes.

If so, the model method add_output_metadata adding the ID to configuration files might need to be run then rather than at setup?

Good point. And yes. I kinda thought I'd get it wrong and need you to say where we should put it.

And then, would this feature only be enabled only if runlog is enabled?

Yes. It doesn't really make sense otherwise.

A small initial payu PR could be a metadata method that generates an exptRunId after the runlog commit? This can then be passed to model drivers methods.

I like it.

make-it-so-picard

aidanheerdegen commented 2 months ago

I think I might have changed my mind about concatenating the IDs together. The motivation was to make it simpler, just embed a single metadata item. But it makes everything else more complicated. Also the experiment ID will be used widely, in intake catalogues etc, so I think it makes sense to have that as a separate, unambiguous, easy to access metadata attribute.

jo-basevi commented 2 months ago

Ok, so are you saying there should be two fields added to outputs? An experiment_uuid and a experiment_run_id which is just the runlog commit hash?

aidanheerdegen commented 2 months ago

Ok, so are you saying there should be two fields added to outputs? An experiment_uuid and a experiment_run_id which is just the runlog commit hash?

Yep.