payu-org / payu

A workflow management tool for numerical models on the NCI computing systems
Apache License 2.0
19 stars 26 forks source link

Building intake-esm datastores #521

Open anton-seaice opened 7 hours ago

anton-seaice commented 7 hours ago

Far from an original thought, but no one has made an issue yet :)

For current model shared output, we are making some of these accessible to users through the ACCESS-NRI Intake Catalog. This is a catalog of Intake-ESM datastores, where mostly one model run is one datastore.

For new model runs, providing a uniform interface for interacting with the output data through Intake-ESM datastores, improves the sharability and portability of analysis. And means that analysis workflows are ready for sharing in working groups.

To generate these datastores, a builder is required. These builders:


are defined (currently) in There should be a builder for MOM6/SIS2 (i.e. GFDL-OM4 / COSIMA Panantarctic) soon too. These builders are parallelised but don't use dask. There's no capability to modify existing datastores.

For ACCESS-OM3, we've implemented this as an archive userscript. I believe this uses all the cores available within the run job. At the current resolutions and short run times this works ok. It would probably be better to qsub this into a different job, and make an intentional decision about the number of resources needed.

Currently, most users load the intake catalog through hh5 - which is not available to the CI worker for model configs. To implement in payu, we would need to add "access-nri-intake-catalog" as a dependency in pyproject.toml. And then I suggest we try and add as a general step during payu archive for the models with builders.

jo-basevi commented 4 hours ago

As it is used in a user-script (rather than payu project code), it might make more sense to add access-nri-intake-catalog directly to the payu conda environment? Similarly to what was done for post-processing dependencies mule and um2nc for ESM1.5 configs

anton-seaice commented 4 hours ago

I'm suggesting it gets used in payu project code ... instead of a user script its added to the payu archive or payu sync step in payu code.

aidanheerdegen commented 3 hours ago

For new model runs, providing a uniform interface for interacting with the output data through Intake-ESM datastores, improves the sharability and portability of analysis. And means that analysis workflows are ready for sharing in working groups.

Yeah this is a great idea, and definitely something we want to support for all ACCESS-NRI models.

I think I agree that we should incorporate this in the payu codebase itself. Given there is a fairly consistent layout of outputs this could/should be implemented as a general function called from model drivers, passing information like the builder required.

We'll run into issues with any model output that has a post-processing conversion step like the UM. One option would be to pull the post-processing into the payu codebase, another would be to utilise payu in a userscript to generate the catalogues after post-processing.

My assumption is that the catalogue would live at the top level of archive and updated with every new outputXXX directory.

anton-seaice commented 2 hours ago

We'll run into issues with any model output that has a post-processing conversion step like the UM. One option would be to pull the post-processing into the payu codebase, another would be to utilise payu in a userscript to generate the catalogues after post-processing.

Possibly this is an argument to have do it at the payu sync step ? It may be impossible to get every case given that userscripts can start new PBS jobs

My assumption is that the catalogue would live at the top level of archive and updated with every new outputXXX directory.

Yes sounds good. There is two small files: intake_esm_ds.csv.gz intake_esm_ds.json

aidanheerdegen commented 2 hours ago

Possibly this is an argument to have do it at the payu sync step ? It may be impossible to get every case given that userscripts can start new PBS jobs

Actually this reminds me, the datastore in the sync location should reference the files in the sync location. We can make datastores that reference the original data in /scratch, but it'll be deleted the datastore will be a bit useless.

So do we generate the datastore in archive, sync it over and update paths? The local datastore would probably have more data because not everything gets synced. So that is an argument to re-generate the datastore on the sync location as part of sync.

I don't think we could, or should, do this automatically, but at the end of an experiment a user could add the datastore files to their repo (referencing the final sync location), which makes a nice connection between the control repo and the output data.

anton-seaice commented 2 hours ago

Oh good point. I guess I will say we should generate it after both archive and sync stages then