pangeo-forge / staged-recipes

A place to submit pangeo-forge recipes before they become fully fledged pangeo-forge feedstocks
https://pangeo-forge.readthedocs.io/en/latest/
Apache License 2.0
39 stars 63 forks source link

Proposed Recipes for MOM6 NeverWorld2 data #141

Open NoraLoose opened 2 years ago

NoraLoose commented 2 years ago

Source Dataset

The NeverWorld2 dataset is output from idealized primitive equation MOM6 simulations, and is useful for studying ocean mesoscale turbulence over a hierarchy of grid resolutions. The dataset spans a hierarchy of resolutions: 1/4, 1/8, 1/16, 1/32 degree. In total, we have 8 experiments because the simulations were run with two different choices of hmix, which determines the depth of the idealized top boundary layer. The two choices for hmix are 5m and 20m.

The NeverWorld2 dataset is described in detail in Marques et al. (2022), in review. The model has intermediate complexity, incorporating basin-scale geometry with idealized Atlantic and Southern oceans, and with non-uniform ocean depth to allow for mesoscale eddy interactions with topography. The model is perfectly adiabatic and spans the equator, and thus fills a gap between quasi-geostrophic models, which cannot span two hemispheres, and idealized general circulation models, which generally have diabatic processes and buoyancy forcing.

  1. averages_*.nc (holds 5-day averages); one file per 500 days for the resolutions 1/4, 1/8, 1/16; one file per 100 days for the resolution 1/32
  2. snapshots_*.nc (holds snapshots at 5-day frequency); one file per 500 days for the resolutions 1/4, 1/8, 1/16; one file per 100 days for the resolution 1/32
  3. longmean_*.nc (holds 100-day averages, but over a longer time period than averages_*.nc and snapshots_*.nc); one file per 2000 days for the resolutions 1/4, 1/8; one file per 1000 days for the resolution 1/16; one file per 200 days for the resolution 1/32.
  4. static.nc (holds the grid information); 1 file
  5. ocean.stats.nc (holds time series of domain-integrated metrics like APE, KE over full spin-up); 1 file
  6. one restart file (so users can extend the runs); 1 file

Transformation / Alignment / Merging

For 1. - 3. described above, the files should be concatenated along the time dimension.

Output Dataset

Zarr

Please edit and/or comment @gustavo-marques, @rabernat. The discussion started over here.

NoraLoose commented 2 years ago

I wonder if it would be better to store 5. and 6. (ocean.stats.nc and restart files) within the NeverWorld2 github repo, where we provide input files for interested users. These files are pretty small.

@gustavo-marques?

gustavo-marques commented 2 years ago

The restart files can be large. For the 1/32 deg, we have 3 files for each restart time which together are > 10 GB. I thought that storing large ncfiles on Github was not ideal, but perhaps that has changed?

gustavo-marques commented 2 years ago

one restart file (so users can extend the runs); 1 file

Restart files (so users can extend the runs): one file for the 1/4, 1/8, and 1/16 deg. configurations; 3 files for the 1/32 deg configurations.

NoraLoose commented 2 years ago

The reason I suggested to store the restart files elsewhere is that the goal of pangeo-forge is to provide "analysis-ready datasets". No-one will analyze the restart files. 😄 If they are ~10GB, we could think about more "traditional" storage options such as figshare?

gustavo-marques commented 2 years ago

Traditional storage options sound good.

rabernat commented 2 years ago

I have made progress! I have created a Globus Guest Collection on this data. The files are now publicly available over HTTPS. For example: https://g-f83d26.7a577b.6fbd.data.globus.org/nw2_0.03125deg_N15_baseline_hmix20/available_diags.000000

Now I can move forward with the recipe.

rabernat commented 2 years ago

@NoraLoose and @gustavo-marques -- it appears that these are netCDF3 files, not netCDF4 files. Can you confirm that MOM6 writes netCDF3 classic format?

Unfortunately there are some challenges working with netCDF3 in the cloud, see e.g. https://github.com/pangeo-forge/pangeo-forge-recipes/issues/361

Some of these files are close to 700GB, so this could get really bad. However, thanks to https://github.com/fsspec/kerchunk/pull/131 we should now be able to use kerchunk on netCDF3 files.

gustavo-marques commented 2 years ago

Thanks, @rabernat! It's possible to write netCDF4 file with MOM, but we have unintentionally chosen netCDF3 (64-bit offset) instead.

rabernat commented 2 years ago

Ok no worries. We will find a way forward!

LaureZanna commented 2 years ago

@rabernat et al: we are meeting for the revisions of the NW2 paper. Do you need anything from us to help find a way forward to upload the data? Thanks so much!

rabernat commented 2 years ago

I'll give you an honest answer. If you could manually reformat the data from netcdf3 to netcdf4, e.g. using nco, that would unblock this problem immediately.

Other than that, we are probably looking at a timescale of 1 month to address the issues upstream. (There is progress--see https://github.com/fsspec/kerchunk/pull/131. But it will take a while to propagate through to the point where we can run Pangeo Forge with those new features.)

rabernat commented 2 years ago

Here's an alternative idea: we don't have to use Pangeo Forge at all right now. Could the data be deposited in NCAR RDA? If so, that would give us the desired citeable public data artifact, while also leaving the door open down the line for ingesting into Pangeo Forge.

gustavo-marques commented 2 years ago

Thanks, @rabernat. I will look into hosting the data on RDA and will report back here.

gustavo-marques commented 2 years ago

We got the permission to share the datasets via Geoscience Data Exchange.

rabernat commented 2 years ago

@bonnland check out https://github.com/pangeo-forge/staged-recipes/issues/141#issuecomment-1172411649 for an example of accessing data on Glade via Globus.

rabernat commented 2 years ago

Just noting that we are also moving forward on the Pangeo-Forge side

The following recipe works:

from pangeo_forge_recipes.patterns import ConcatDim, FilePattern, FileType
from pangeo_forge_recipes.recipes import XarrayZarrRecipe
from pangeo_forge_recipes.storage import StorageConfig, FSSpecTarget, CacheFSSpecTarget, MetadataTarget

def make_snapshot_url(time):
    url_format = (
        'https://g-f83d26.7a577b.6fbd.data.globus.org/'
        f'nw2_0.25deg_N15_baseline_hmix5/snapshots_{time:08d}.nc'
    )
    return url_format.format(time=time)

time_concat_dim = ConcatDim(
    "time",
    [30005], # 30505, 31005, 31505],
    nitems_per_file=100
)

pattern = FilePattern(make_snapshot_url, time_concat_dim, file_type=FileType.netcdf3)

recipe = XarrayZarrRecipe(
    pattern,
    subset_inputs={'time': 20},
    xarray_open_kwargs = {"decode_times": False},
    open_input_with_kerchunk=True
)

with https://github.com/pangeo-forge/pangeo-forge-recipes/pull/383 plus the latest kerchunk release (similar situation to #140)