Proposed Recipes for CESM2 Superparameterization Emulator

mspritch commented 2 years ago

Source Dataset

Several years of high-frequency (15 min) GCM-timestep level output from the Superparameterized CESM2 isolating state variables "before" and "after" a key code region containing computationally intensive superparameterization calculations. For use in making a superparameterization emulator of explicitly resolved clouds + their radiation influence + turbulence, that can be used in a real-geography framework CESM2 framework, to side step the usual computational cost of SP. Similar in spirit to proof of concept in Rasp, Pritchard & Gentine (2018) and Mooers et al. (2021) but with new refinements by Mike Pritchard and Tom Beucler towards compatibility with operational, real-gegoraphy CESM2 (critically, isolating only tendencies up to surface coupling and including outputs relevant to CLM land model's expectations; see Mooers et al. concluding discussion)

Link to the website / online documentation for the data N/A
The file format (e.g. netCDF, csv) Raw model output is CESM2-formatted NetCDF history files
How are the source files organized? (e.g. one file per day) One file per day across 8-10 sim-years each forced with the same annual cycle of SSTs.
How are the source files accessed (e.g. FTP) Not publicly posted yet. Staged on a few XSEDE or NERSC clusters. Mike Pritchard's group at UC Irvine can help with access to these.
- provide an example link if possible
Any special steps required to access the data (e.g. password required)

Transformation / Alignment / Merging

**Apologies in advance if this is TMI. A starter core-dump from Mike Pritchard on a busy afternoon:

There are multiple pre-processing steps. The raw model output contains many more variables than one would want to analyze so there is trimming. But users may want to experiment with different inputs and outputs. So this trimming may be user-specific. Can provide guidance on specific variable names for inputs/outputs of published emulators worth competing with on request.

Important: A key subset of variables (surface fluxes) that probably everyone would want in their input vector will need to be time-shifted backward relative by one time step to avoid information leaks, having to due to which phase of the integration cycle these fluxes were saved at on history file vs. the emulated regions. Some users may want to make emulators that include memory of state variables from previous time steps in the input vector (e.g. as in Han et al., JAMES, 2020) in which case there is the same preprocessing issue of backwards time shifting made flexible to additional variables (physical caveat: likely no more than a few hours i.e. <= 10 temporal samples at most so never any reason to include contiguous temporal adjacency beyond that limit).

Many users may want to subsample lon,lat,time first to reduce data volume and promote independence of samples due to spatial and temporal autocorrelations riddled throughout the data. Other users may prefer to include all of these samples as fuel for ambitious architectures that require very data-rich limits to find good fits in. This sub-sampling is user-specific.

Many users wanting to make "one-size-fits-all" emulators (i.e. same NN for all grid cells) will want to flatten lon,lat,time into a generic "sample" dimension (retaining level variability) and shuffle that for ML and split into training/validation/test splits. Such users would also want to pre-normalize by means and range/stds defined independently for separate levs, but which lump together the flattened lon/lat/time statistics. Advanced users may want to train regional- or regime-specific emulators, which might then use regionally-aware normalizations, such that flexibility here would help.

Some users may want to convert the specific humidity and temperature input state variables into an equivalent relative humidity as an alternate input that is less prone to out of sample extrapolation when the emulator is tested prognostically. The RH conversion should use a fixed set of assumptions that are consistent with a f90 module for use in identical testing online; can provide python and f90 code when the time comes.

The surface pressure is vital information to make the vertical discretization physically relevant per CESM2's hybrid vertical eta coordinate. So should always be made available. The pressure mid points and pressure thickness of each vertical level can also be derived from this field but vary with lon,lat,time. Mass-weighting the outputs of vertically resolved variables like diabatic heating using the derived pressure thickness could be helpful to users wishing to prioritize the column influence of different samples as in Beucler et al. (2021, PRL).

Output Dataset

I am not qualified to assess the trade-offs of the options listed here but interested in learning.

rabernat commented 2 years ago

Mike thanks for getting this started!

In order to move forward here, we would need a concrete set of files to ingest, and the files would need to be accessible over the network somehow (http, ftp, scp, globus, etc.)

If you could post a link to one of the raw output files via any of these protocols, that would allow us to start experimenting with ingesting them.

mspritch commented 2 years ago

Hi Ryan,

Here is a link to a google drive folder with a handful of files:

https://drive.google.com/drive/folders/1P68TfWcAc-KsCdvu1R9uGIaA9iAd8bku?usp=sharing

The full data are housed on an internal UCI server that can only be accessed with UCI VPN. Normally I would just fill in an external collaborator permission form to give outside collaborators access to it. Will that work here? If not I will see if our admins can help with the globus model, as I do believe we have a globus end point that should connect to it.

Mike.

On Mar 3, 2022, at 7:32 AM, Ryan Abernathey @.***> wrote:

Mike thanks for getting this started!

In order to move forward here, we would need a concrete set of files to ingest, and the files would need to be accessible over the network somehow (http, ftp, scp, globus, etc.)

If you could post a link to one of the raw output files via any of these protocols, that would allow us to start experimenting with ingesting them.

— Reply to this email directly, view it on GitHub https://github.com/pangeo-forge/staged-recipes/issues/100#issuecomment-1058161345, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGPXKXCFVX6D3SNP4ZINM2DU6DLRXANCNFSM5M7MX6HA. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub. You are receiving this because you authored the thread.

rabernat commented 2 years ago

Normally I would just fill in an external collaborator permission form to give outside collaborators access to it. Will that work here?

No. I don't think we will be able to pull files over the vpn (at least not without complicated workarounds). We need to get it so that a machine (rather than a human) can get the files. Globus is probably the best bet here.

Another option would be for us to create some sort of ingestion upload point, basically just a temporary bucket that can accept uploads, and then stage the recipe from there. So someone from your team would have to directly upload the files (push instead of pull). This might be a good solution for the many scenarios in which we just can't get access to the machine where the data live. @cisaacstern - what do you think about that idea?

mspritch commented 2 years ago

Hi Ryan,

Thanks for this! I have a query out to our sysadmins to see if globus will work here, and if not totally happy to upload to your ingestion point.

Mike.

On Mar 9, 2022, at 5:51 AM, Ryan Abernathey @.***> wrote:

Normally I would just fill in an external collaborator permission form to give outside collaborators access to it. Will that work here?

No. I don't think we will be able to pull files over the vpn (at least not without complicated workarounds). We need to get it so that a machine (rather than a human) can get the files. Globus is probably the best bet here.

Another option would be for us to create some sort of ingestion upload point, basically just a temporary bucket that can accept uploads, and then stage the recipe from there. So someone from your team would have to directly upload the files (push instead of pull). This might be a good solution for the many scenarios in which we just can't get access to the machine where the data live. @cisaacstern https://github.com/cisaacstern - what do you think about that idea?

— Reply to this email directly, view it on GitHub https://github.com/pangeo-forge/staged-recipes/issues/100#issuecomment-1062941159, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGPXKXD52T5SZ73I4LQZTT3U7CUEXANCNFSM5M7MX6HA. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub. You are receiving this because you authored the thread.

cisaacstern commented 2 years ago

This might be a good solution for the many scenarios in which we just can't get access to the machine where the data live.

This is certainly a recurring scenario and this solution will definitely work from a technical perspective. In the interest of reproducibility and transparency, I suppose we'd just want to consider how to document provenance of pushed data.

rabernat commented 2 years ago

Via email we have been investigating the possibility of using Globus. Repeating some of that conversation here

Ryan: A lot of the details are in this github thread: https://github.com/pangeo-forge/pangeo-forge-recipes/issues/222#issuecomment-1065504550 Here is the relevant piece

When using the latest Globus Connect version 5 endpoints, data access is via what we call "collections". Each collection has an associated DNS name, such that you can refer to collections directly. By default collections are assigned DNS names in the data.globus.org subdomain, but they can also be customized by deployments.

v5 also supports HTTP/S access to data, while enforcing a common security policy across the access mechanism. For example, here is a publicly accessible file: https://a4969.36fe.dn.glob.us/public/read-only/logo.png. For non-public data, users must authenticate (and be authorized to access the data) before it can be downloaded via HTTPS.

So the ideal solution from our point of view would be for you to have GCv5 endpoint and create a public collection for the associated data. Then we can pull it directly over HTTPS. Would this be possible?

Nate: I believe that Globus Collections functionality requires an upgraded site license. Our campus couldn't justify the pretty significant cost, and I don't think the effort to get a UC-wide license went anywhere.

Falling back to an SSH-based method might be best. Is the client system on a single IP that we can allow through our firewall?

rabernat commented 2 years ago

Following up on this, over in https://github.com/pangeo-forge/pangeo-forge-recipes/issues/222#issuecomment-1080830165, I managed to make a proof of concept of the Globus collection via HTTPS. So from the Pangeo Forge POV, using Globus collections is definitely the easiest and quickest option here.

Alternatively, we could move the data to another system (maybe Cheyenne) that has the Globus collections feature enabled.

Falling back to an SSH-based method might be best. Is the client system on a single IP that we can allow through our firewall?

I'm not sure if we can know a-priori what IPs the requests will be coming from. That's a bit beyond our DevOps capability right now. I suppose "somewhere in Google Cloud" is too vague? And would there be a VPN involved?

pangeo-forge / staged-recipes