Example pipeline for CM2.6

## Source Dataset

CM2.6 is a high-resolution global climate model run by GFDL. There are two scenarios: a preindustrial control and a 1% CO2 increase. We already have some CM2.6 data in google cloud: https://catalog.pangeo.io/browse/master/ocean/GFDL_CM2_6/ I created it manually.

https://www.gfdl.noaa.gov/cm2-6/
Format: netCDF4, one file per month, files grouped into different variable classes (e.g. surface, interior, etc.). File names look like 01800101.ocean.nc.
Access: The data are stored in two places:
- On the GFDL supercomputer (accessible to very few, with high security)
- On CyVerse (special permission required)
The files can be downloaded from CyVerse with IRODS. A download command looks like igetwild /iplant/home/shared/iclimate/control field_u.nc e &. Special access tokens must be configured.

Transformation / Alignment / Merging

In general, we want to concatenate the files in time. However, different variables in different files have different time resolutions (monthly, 5-day, daily).

Getting the files to concatenate cleanly required some manual tweaks (dropping variables and overwriting coordinates). There are weird glitches and inconsistencies between different files from the same output set. Some workflows are documented in this repo.

Output Dataset

I think we would like one zarr dataset for all variables with the same grid and temporal resolution. Chunked in time. For 3D data, we also need to chunk in space, probably the vertical dimension makes most sense.

pangeo-forge / staged-recipes

Example pipeline for CM2.6 #2

Transformation / Alignment / Merging

Output Dataset