pangeo-forge / staged-recipes

A place to submit pangeo-forge recipes before they become fully fledged pangeo-forge feedstocks
https://pangeo-forge.readthedocs.io/en/latest/
Apache License 2.0
39 stars 63 forks source link

Proposed Recipes for Antarctic ice sheet paleo PISM ensemble #90

Open jkingslake opened 2 years ago

jkingslake commented 2 years ago

Source Dataset

Simulations of the Antarctic ice sheet over the last 20ka performed by @talbrecht using the Parallel Ice Sheet Model (PISM)

Albrecht, Torsten (2019): PISM parameter ensemble analysis of Antarctic Ice Sheet glacial cycle simulations. PANGAEA, https://doi.pangaea.de/10.1594/PANGAEA.909728

Transformation / Alignment / Merging

All ensemble members and time snapshots should be combined in one xarray with dimensions corresponding to x, y, time, and four model parameters. Also all 'timeseries.nc' files (each one corresponding to one ensemble member should be collated together into another single xarray.
This involves an unstack step to get the four parameters their own dimensions in the xarray, as discussed here.

Output Dataset

One zarr directory for each of xarrays described above (two in total).

Progress so far

Much of this work has been done using a larger version of the model output (with more timeslices, one every kyr instead of one every 5kyr): -- all the timeslices and ensemble members were collated and unstacked'd into the correctly shaped xarray, then uploaded to GCS: https://github.com/ldeo-glaciology/pangeo-pismpaleo/blob/main/pism_paleo_nc_to_zarr.ipynb (note that this was done on the University of Potsdam's HPC and did NOT start with the zip file linked to above). -- then we made an intake catalog, [here] (https://github.com/ldeo-glaciology/pangeo-pismpaleo/blob/48b16dca56d3b736b6f05acdb63ca83744c4f8d4/intake_catalog_setup.ipynb)

As described here, these data are now accessible from a google bucket, e.g.

cat = intake.open_catalog('https://raw.githubusercontent.com/ldeo-glaciology/pangeo-pismpaleo/main/paleopism.yaml')
snapshots1ka  = cat["snapshots1ka"].to_dask()
mask_score_time_series  = cat["mask_score_time_series"].to_dask()

These two zarrs are the result of collating all the timeseries.nc and the snapshots_*.nc, respectively (as described above). Additionally we have

vels5ka  = cat["vels5ka"].to_dask()
present  = cat["present"].to_dask()

which contain just the velocities at 5 kyr resolution and the present day state (t = 0 kyr BP) of the model, respectively.

Here is a notebook showing how to access these data in pangeo.

Question for @talbrecht and @rabernat: should we make this recipe with the smaller dataset contained in the zip, or do we want to use the larger dataset? I like the larger dataset because it is large enough to start really needing clusters and it is more useful for comparing to data when you have that higher time resolution. What do you think?

talbrecht commented 2 years ago

Thanks @jkingslake for formulating this recipe. Yes, let's do this with the larger dataset, as the community seems to be interested in different variables (e.g. velocities) at different periods, than available in the subset I published at pangaea, which only contains the data necessary for the plots in the related journal publications.

cisaacstern commented 2 years ago

@jkingslake, thanks for submitting this request. Assuming we do go with the larger dataset, where do the source files live for that? Apologies if I missed that in your initial comment; it looked to me as if the info you've provided under Source Dataset above apply to the smaller dataset only?

Also please note Pangeo Forge currently does not support unzipping of source files. Source files must be individually accessible over HTTP, FTP, etc.

Looking forward to supporting you in making this recipe a reality!

talbrecht commented 2 years ago

The (larger) Source dataset has not been published yet, it is stored on a High Performance Computer in Germany. As an temporary option I could produce a FTP link with password, that could be used to convert individual netCDF data files to zarr?

cisaacstern commented 2 years ago

I could produce a FTP link with password

This would work. We've done something similar before.

Out of curiosity, how large is the (larger) Source dataset? In terms of number of files and number of (giga)bytes.

talbrecht commented 2 years ago

It will be of the order of 50GB and about 500 individual files (ensemble of 256 with each one timeseries file of spatially aggregated variables (t) and one outputfile containing the 2D variables over time (x,y,t))... Yes, I could prepare the FTP link...

cisaacstern commented 2 years ago

Great. This sounds quite manageable.

Please let me know when the FTP link is available and we can begin the recipe development.

jkingslake commented 2 years ago

@cisaacstern, thanks for the engagement in this.

@talbrecht, you mention that people have been interested in velocities on higher time resolution than the 5kyr we have currently. So, does this mean we should aim for 1kyr of thickness, bed elevation etc (as we had before) plus the two components of velocity?

talbrecht commented 2 years ago

Yes, in the README you find that for the whole ensemble I have the following variables available every 1000 years: 'thk','mask','topg','usurf','velbar_mag','dbdt','bmelt' while the two velocity components 'u_ssa','v_ssa' are only available every 5000 years. Yet, for the reference simulation (6165c) I reran the simulation with velocity output every 1000 years, so this could be a separate subset?

rabernat commented 2 years ago

Also please note Pangeo Forge currently does not support unzipping of source files. Source files must be individually accessible over HTTP, FTP, etc.

This is not actually true! Fsspec can see inside zip files!

import xarray as xr
from fsspec.implementations.zip import ZipFileSystem
url = "https://hs.pangaea.de/model/PISM/Albrecht-etal_2019/parameter-ensemble/Part2_pism_paleo_ensemble_v2.zip"
fs = ZipFileSystem(url)
fs.ls("datapub)  # -> list the files

import xarray as xr
with fs.open('datapub/model_data/pism1.0_paleo06_6255/snapshots_-10000.000.nc') as fp:
    ds = xr.open_dataset(fp)
    ds.load()

ds.thk.plot()

img

We just need how to encode the compound url: https://hs.pangaea.de/model/PISM/Albrecht-etal_2019/parameter-ensemble/Part2_pism_paleo_ensemble_v2.zip + datapub/model_data/pism1.0_paleo06_6255/snapshots_-10000.000.nc into a single URL that fsspec understands. @martindurant will know how to do that.

martindurant commented 2 years ago

Should be


of = fsspec.open("zip://datapub/model_data/pism1.0_paleo06_6255/snapshots_-10000.000.nc::https://hs.pangaea.de/model/PISM/Albrecht-etal_2019/parameter-ensemble/Part2_pism_paleo_ensemble_v2.zip")
with of as f:
    ds = xr.open_dataset(f)
    ...
talbrecht commented 2 years ago

Hej, I have uploaded 35GB of data (not zipped), that can be seen in this temporary link:

rsync rsync://rsync.pik-potsdam.de/paleo_ensemble/

or downloaded with

rsync -r rsync://rsync.pik-potsdam.de/paleo_ensemble model_data

The velocity snapshots are concatenated into one netCDF file for each ensemble member (5ka), while all other data can be found in the extra files (1ka). I added a simulation 6165c, equivalent to the reference simulation (6165) that shows velocity snapshots every 1ka.

cisaacstern commented 2 years ago

Great! A description of the recipe development process is given here:

https://pangeo-forge.readthedocs.io/en/latest/intro_tutorial.html

(This documentation is still quite fresh, so we definitely welcome feedback on it!)

As you will see, the first step is forking this repo and creating a new subdirectory for your recipe within it. Once that happens, you don't have to wait until the recipe is complete before opening a PR. I encourage you to open a PR against this repo with an early draft, that way we can all provide feedback and support you along the development process.

rabernat commented 2 years ago

Just noting that we do not necessarily need an unzipped copy of the data. As demonstrated above (https://github.com/pangeo-forge/staged-recipes/issues/90#issuecomment-932760875) we can open and download the data directly from a zip file over HTTP. It would be better to use the "official" source of the data (via Pangaea) than to create a "temporary link" because the former is more likely to be persistent.

When creating the recipe, the FilePattern formatting function could return paths of the form zip://datapub/model_data/pism1.0_paleo06_6255/snapshots_-10000.000.nc::https://hs.pangaea.de/model/PISM/Albrecht-etal_2019/parameter-ensemble/Part2_pism_paleo_ensemble_v2.zip. This would eliminate the need for a temporary mirror of the unzipped data.

cisaacstern commented 2 years ago

Based on https://github.com/pangeo-forge/staged-recipes/issues/90#issuecomment-932584463, it seems the high resolution dataset of interest is not yet published. It would be preferable from a provenance and recipe standpoint to build the Zarr store from published sources (zipped or otherwise). @talbrecht, how long do we expect before the data is available in published form?

jkingslake commented 2 years ago

@talbrecht, I am happy to start the process of making the recipe, but I will wait until we have finalized which dataset we use.

I am guessing that you weren't actually planning on publishing the higher resolution version in pangaea.

jkingslake commented 2 years ago

OK, I forked the repo and put in the example meta.yaml and recipe.py files to get things started. https://github.com/ldeo-glaciology/staged-recipes/tree/paleo-pism

talbrecht commented 2 years ago

Thanks @jkingslake for starting the recipe process. Well, my experience is that publishing the data in PANGAEA takes a couple of weeks and the dataset would then be limited to 15GB. I can zip the data, if this is preferred. As the data are already deflated, zipping would not bring much effect, so I would try to split them into two publications or reduce precision. Yes, the rsync link is temporary. I though we could convert it to zarr format and store (publish) it somewhere permanently (in the cloud)?

cisaacstern commented 2 years ago

we could convert it to zarr format and store (publish) it somewhere permanently (in the cloud)?

Yes, with the exception of the permanent part. While any zarr store we write to the cloud is likely to persist for some time, the current design of Pangeo Forge does not allow for it to serve as a permanently published version.

In cases where publishing is impractical, we can write zarr stores from temprorary sources, but it's best if we can do so from published/permanent sources, so that if the zarr store were ever to disappear in the future, it can be rebuilt from the same source. In addition, working from a permanent source means that downstream data users have the option of rebuilding the same zarr store with different parametrization to suit their research objectives (i.e., to a different location, or with different chunking, etc.).

talbrecht commented 2 years ago

Yes, ok, makes sense. Then I will contact PANGAEA...

jkingslake commented 2 years ago

@talbrecht, how do we tell what parameter values are used for each ensemble member from the netcdfs in the zip file? I started trying to collate them (just to see if I could, not to make the final recipe), but I realized I didnt know how to tell what parameters correspond to each one. In your NB, the parameter values come from some .csv files, not the netcdfs.

talbrecht commented 2 years ago

Yes, the csv file is located in the folder "aggregated_data", available in both the pangaea and the rsync link (which I just realized seems not to be complete yet). Here the two tables are attached: pism1.0_paleo06_6000.csv le_all06_16km.txt

rabernat commented 2 years ago

In cases where publishing is impractical, we can write zarr stores from temprorary sources, but it's best if we can do so from published/permanent sources, so that if the zarr store were ever to disappear in the future, it can be rebuilt from the same source.

I think we need to think through this scenario more carefully. Under what circumstances can we actually just publish the data? Not all contributors will have their data in an existing repository? I think we should support this somehow. As @talbrecht said, most existing repositories have very small limits on the size of their archives. Pangeo Forge can help get around that limitation.

Despite what I said above in https://github.com/pangeo-forge/staged-recipes/issues/90#issuecomment-938729722, at this [early, experimental] stage of development of Pangeo Forge, I don't think we should exclude data that is just stored temporarily on an FTP server, especially if its size exceeds what is possible in existing "official" repositories. Perhaps we should move ahead with the recipe outside of PANGAEA.

cisaacstern commented 2 years ago

Under what circumstances can we actually just publish the data? Not all contributors will have their data in an existing repository?

Should we start an Issue in https://github.com/pangeo-forge/roadmap/issues to discuss a generalized "policy" for this?

(That discussion doesn't have to block work on this particular recipe, of course.)

talbrecht commented 2 years ago

I have submitted a new dataset to Pangaea, including ice thickness, bed topography, basal melt rates (about 15GB) and ice flow velocity components (about 7GB). Unfortunately, they are "currently facing a high rate of data submissions ... and thus the editorial process and minting of DOI names might take up to 12 weeks." I'll keep you updated...

rabernat commented 2 years ago

Then let's go ahead and make the data submission via the FTP server.

talbrecht commented 2 years ago

OK, the rsync link mentioned above should now point to the two zipped datasets, which will be hopefully published in PANGAEA in some weeks...

rsync rsync://rsync.pik-potsdam.de/paleo_ensemble/

or downloaded with

rsync -r rsync://rsync.pik-potsdam.de/paleo_ensemble model_data

The velocity snapshots are concatenated into one netCDF file for each ensemble member (5ka), while all other data can be found in the extra files (1ka). I added a simulation 6165c, equivalent to the reference simulation (6165) that shows velocity snapshots every 1ka.

rabernat commented 2 years ago

Thanks a lot @talbrecht! I'm really excited about this.

I want to clarify that Pangeo Forge cannot use rsync protocol. (We only support protocols that fsspec has implemented.) So we need to be able to access the data via http / [s]ftp, etc. Is there any other protocol available? @martindurant - do you have any insight on this?

talbrecht commented 2 years ago

You could try this link from my personal website: http://www.pik-potsdam.de/~albrecht/pism_pangeo/ ?!

talbrecht commented 2 years ago

Did anyone try the http link?

cisaacstern commented 2 years ago

Did anyone try the http link?

Hi @talbrecht thanks for following up.

@jkingslake, is this recipe something you are interested in working on? Pangeo Forge is designed as a platform to support recipe contributions from the community, and we would love to have your participation. If so, the Introduction Tutorial is a great place to start, and I can happily respond to any questions you have. (If that Tutorial is unclear in any way, we can also use your feedback to improve it.)

jkingslake commented 2 years ago

Yes, this is something I am interested in working on.

I just wont be able to get to it for a while, unfortunately.

rabernat commented 2 years ago

I'm going to try to convince someone to work on this at the OSM tutorial tomorrow.

jkingslake commented 2 years ago

That would be great @rabernat !

Thanks to @talbrecht the larger dataset discussed above is now online.

https://doi.org/10.1594/PANGAEA.940149

jkingslake commented 2 years ago

https://github.com/pangeo-forge/staged-recipes/issues/90#issuecomment-932815548 mentions a format for a compound url that allows us to look inside the remote zip files.

I get a FileNotFoundError error when trying to us it. The data are still there as I can load in the more convoluted way shown in https://github.com/pangeo-forge/staged-recipes/issues/90#issuecomment-932760875

Any ideas what could be causing this?

of = fs.open("zip://datapub/model_data/pism1.0_paleo06_6255/snapshots_-10000.000.nc::https://hs.pangaea.de/model/PISM/Albrecht-etal_2019/parameter-ensemble/Part2_pism_paleo_ensemble_v2.zip")
with of as f:
    ds = xr.open_dataset(f)

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Input In [57], in <cell line: 1>()
----> 1 of = fs.open("zip://datapub/model_data/pism1.0_paleo06_6255/snapshots_-10000.000.nc::https://hs.pangaea.de/model/PISM/Albrecht-etal_2019/parameter-ensemble/Part2_pism_paleo_ensemble_v2.zip")
      2 with of as f:
      3     ds = xr.open_dataset(f)

File /srv/conda/envs/notebook/lib/python3.9/site-packages/fsspec/spec.py:1037, in AbstractFileSystem.open(self, path, mode, block_size, cache_options, compression, **kwargs)
   1035 else:
   1036     ac = kwargs.pop("autocommit", not self._intrans)
-> 1037     f = self._open(
   1038         path,
   1039         mode=mode,
   1040         block_size=block_size,
   1041         autocommit=ac,
   1042         cache_options=cache_options,
   1043         **kwargs,
   1044     )
   1045     if compression is not None:
   1046         from fsspec.compression import compr

File /srv/conda/envs/notebook/lib/python3.9/site-packages/fsspec/implementations/zip.py:97, in ZipFileSystem._open(self, path, mode, block_size, autocommit, cache_options, **kwargs)
     95 if mode != "rb":
     96     raise NotImplementedError
---> 97 info = self.info(path)
     98 out = self.zip.open(path, "r")
     99 out.size = info["size"]

File /srv/conda/envs/notebook/lib/python3.9/site-packages/fsspec/archive.py:42, in AbstractArchiveFileSystem.info(self, path, **kwargs)
     40     return self.dir_cache[path + "/"]
     41 else:
---> 42     raise FileNotFoundError(path)

FileNotFoundError: datapub/model_data/pism1.0_paleo06_6255/snapshots_-10000.000.nc::https://hs.pangaea.de/model/PISM/Albrecht-etal_2019/parameter-ensemble/Part2_pism_paleo_ensemble_v2.zip
cisaacstern commented 2 years ago

@martindurant, any idea why the compound url given in the previous comment is not opening as expected?

martindurant commented 2 years ago

Did you mean fsspec.open() ?

jkingslake commented 2 years ago

yes! Good call! I am used to importing fsspec as fs, but hadnt this time. It's working as expected now. Thanks!