Open anton-seaice opened 1 month ago
As it is used in a user-script (rather than payu project code), it might make more sense to add access-nri-intake-catalog
directly to the payu conda environment? Similarly to what was done for post-processing dependencies mule
and um2nc
for ESM1.5 configs https://github.com/ACCESS-NRI/payu-condaenv/blob/main/env.yml
I'm suggesting it gets used in payu project code ... instead of a user script its added to the payu archive or payu sync step in payu code.
For new model runs, providing a uniform interface for interacting with the output data through Intake-ESM datastores, improves the sharability and portability of analysis. And means that analysis workflows are ready for sharing in working groups.
Yeah this is a great idea, and definitely something we want to support for all ACCESS-NRI models.
I think I agree that we should incorporate this in the payu
codebase itself. Given there is a fairly consistent layout of outputs this could/should be implemented as a general function called from model drivers, passing information like the builder required.
We'll run into issues with any model output that has a post-processing conversion step like the UM. One option would be to pull the post-processing into the payu
codebase, another would be to utilise payu
in a userscript
to generate the catalogues after post-processing.
My assumption is that the catalogue would live at the top level of archive
and updated with every new outputXXX
directory.
We'll run into issues with any model output that has a post-processing conversion step like the UM. One option would be to pull the post-processing into the
payu
codebase, another would be to utilisepayu
in auserscript
to generate the catalogues after post-processing.
Possibly this is an argument to have do it at the payu sync
step ? It may be impossible to get every case given that userscripts can start new PBS jobs
My assumption is that the catalogue would live at the top level of
archive
and updated with every newoutputXXX
directory.
Yes sounds good. There is two small files: intake_esm_ds.csv.gz intake_esm_ds.json
Possibly this is an argument to have do it at the
payu sync
step ? It may be impossible to get every case given that userscripts can start new PBS jobs
Actually this reminds me, the datastore in the sync location should reference the files in the sync location. We can make datastores that reference the original data in /scratch
, but it'll be deleted the datastore will be a bit useless.
So do we generate the datastore in archive
, sync it over and update paths? The local datastore would probably have more data because not everything gets synced. So that is an argument to re-generate the datastore on the sync location as part of sync.
I don't think we could, or should, do this automatically, but at the end of an experiment a user could add the datastore files to their repo (referencing the final sync location), which makes a nice connection between the control repo and the output data.
Oh good point. I guess I will say we should generate it after both archive
and sync
stages then
Far from an original thought, but no one has made an issue yet :)
For current model shared output, we are making some of these accessible to users through the ACCESS-NRI Intake Catalog. This is a catalog of Intake-ESM datastores, where mostly one model run is one datastore.
For new model runs, providing a uniform interface for interacting with the output data through Intake-ESM datastores, improves the sharability and portability of analysis. And means that analysis workflows are ready for sharing in working groups.
To generate these datastores, a builder is required. These builders:
are defined (currently) in https://github.com/accESS-NRI/access-nri-intake-catalog. There should be a builder for MOM6/SIS2 (i.e. GFDL-OM4 / COSIMA Panantarctic) soon too. These builders are parallelised but don't use dask. There's no capability to modify existing datastores.
For ACCESS-OM3, we've implemented this as an archive userscript. I believe this uses all the cores available within the run job. At the current resolutions and short run times this works ok. It would probably be better to qsub this into a different job, and make an intentional decision about the number of resources needed.
Currently, most users load the intake catalog through hh5 - which is not available to the CI worker for model configs. To implement in payu, we would need to add "access-nri-intake-catalog" as a dependency in
pyproject.toml
. And then I suggest we try and add as a general step during payu archive for the models with builders.