Open NoraLoose opened 2 years ago
I wonder if it would be better to store 5. and 6. (ocean.stats.nc
and restart files) within the NeverWorld2 github repo, where we provide input files for interested users. These files are pretty small.
@gustavo-marques?
The restart files can be large. For the 1/32 deg, we have 3 files for each restart time which together are > 10 GB. I thought that storing large ncfiles on Github was not ideal, but perhaps that has changed?
one restart file (so users can extend the runs); 1 file
Restart files (so users can extend the runs): one file for the 1/4, 1/8, and 1/16 deg. configurations; 3 files for the 1/32 deg configurations.
The reason I suggested to store the restart files elsewhere is that the goal of pangeo-forge is to provide "analysis-ready datasets". No-one will analyze the restart files. 😄 If they are ~10GB, we could think about more "traditional" storage options such as figshare?
Traditional storage options sound good.
I have made progress! I have created a Globus Guest Collection on this data. The files are now publicly available over HTTPS. For example: https://g-f83d26.7a577b.6fbd.data.globus.org/nw2_0.03125deg_N15_baseline_hmix20/available_diags.000000
Now I can move forward with the recipe.
@NoraLoose and @gustavo-marques -- it appears that these are netCDF3 files, not netCDF4 files. Can you confirm that MOM6 writes netCDF3 classic format?
Unfortunately there are some challenges working with netCDF3 in the cloud, see e.g. https://github.com/pangeo-forge/pangeo-forge-recipes/issues/361
Some of these files are close to 700GB, so this could get really bad. However, thanks to https://github.com/fsspec/kerchunk/pull/131 we should now be able to use kerchunk on netCDF3 files.
Thanks, @rabernat! It's possible to write netCDF4 file with MOM, but we have unintentionally chosen netCDF3 (64-bit offset) instead.
Ok no worries. We will find a way forward!
@rabernat et al: we are meeting for the revisions of the NW2 paper. Do you need anything from us to help find a way forward to upload the data? Thanks so much!
I'll give you an honest answer. If you could manually reformat the data from netcdf3 to netcdf4, e.g. using nco, that would unblock this problem immediately.
Other than that, we are probably looking at a timescale of 1 month to address the issues upstream. (There is progress--see https://github.com/fsspec/kerchunk/pull/131. But it will take a while to propagate through to the point where we can run Pangeo Forge with those new features.)
Here's an alternative idea: we don't have to use Pangeo Forge at all right now. Could the data be deposited in NCAR RDA? If so, that would give us the desired citeable public data artifact, while also leaving the door open down the line for ingesting into Pangeo Forge.
Thanks, @rabernat. I will look into hosting the data on RDA and will report back here.
We got the permission to share the datasets via Geoscience Data Exchange.
@bonnland check out https://github.com/pangeo-forge/staged-recipes/issues/141#issuecomment-1172411649 for an example of accessing data on Glade via Globus.
Just noting that we are also moving forward on the Pangeo-Forge side
The following recipe works:
from pangeo_forge_recipes.patterns import ConcatDim, FilePattern, FileType
from pangeo_forge_recipes.recipes import XarrayZarrRecipe
from pangeo_forge_recipes.storage import StorageConfig, FSSpecTarget, CacheFSSpecTarget, MetadataTarget
def make_snapshot_url(time):
url_format = (
'https://g-f83d26.7a577b.6fbd.data.globus.org/'
f'nw2_0.25deg_N15_baseline_hmix5/snapshots_{time:08d}.nc'
)
return url_format.format(time=time)
time_concat_dim = ConcatDim(
"time",
[30005], # 30505, 31005, 31505],
nitems_per_file=100
)
pattern = FilePattern(make_snapshot_url, time_concat_dim, file_type=FileType.netcdf3)
recipe = XarrayZarrRecipe(
pattern,
subset_inputs={'time': 20},
xarray_open_kwargs = {"decode_times": False},
open_input_with_kerchunk=True
)
with https://github.com/pangeo-forge/pangeo-forge-recipes/pull/383 plus the latest kerchunk release (similar situation to #140)
Source Dataset
The NeverWorld2 dataset is output from idealized primitive equation MOM6 simulations, and is useful for studying ocean mesoscale turbulence over a hierarchy of grid resolutions. The dataset spans a hierarchy of resolutions: 1/4, 1/8, 1/16, 1/32 degree. In total, we have 8 experiments because the simulations were run with two different choices of
hmix
, which determines the depth of the idealized top boundary layer. The two choices forhmix
are 5m and 20m.The NeverWorld2 dataset is described in detail in Marques et al. (2022), in review. The model has intermediate complexity, incorporating basin-scale geometry with idealized Atlantic and Southern oceans, and with non-uniform ocean depth to allow for mesoscale eddy interactions with topography. The model is perfectly adiabatic and spans the equator, and thus fills a gap between quasi-geostrophic models, which cannot span two hemispheres, and idealized general circulation models, which generally have diabatic processes and buoyancy forcing.
/glade/campaign/univ/unyu0004/NeverWorld2/
, and can only accessed by users with an account.averages_*.nc
(holds 5-day averages); one file per 500 days for the resolutions 1/4, 1/8, 1/16; one file per 100 days for the resolution 1/32snapshots_*.nc
(holds snapshots at 5-day frequency); one file per 500 days for the resolutions 1/4, 1/8, 1/16; one file per 100 days for the resolution 1/32longmean_*.nc
(holds 100-day averages, but over a longer time period thanaverages_*.nc
andsnapshots_*.nc
); one file per 2000 days for the resolutions 1/4, 1/8; one file per 1000 days for the resolution 1/16; one file per 200 days for the resolution 1/32.static.nc
(holds the grid information); 1 fileocean.stats.nc
(holds time series of domain-integrated metrics like APE, KE over full spin-up); 1 fileTransformation / Alignment / Merging
For 1. - 3. described above, the files should be concatenated along the time dimension.
Output Dataset
Zarr
Please edit and/or comment @gustavo-marques, @rabernat. The discussion started over here.