pangeo-forge / staged-recipes

A place to submit pangeo-forge recipes before they become fully fledged pangeo-forge feedstocks
https://pangeo-forge.readthedocs.io/en/latest/
Apache License 2.0
39 stars 63 forks source link

Example pipeline for SWOT-Xover #14

Open roxyboy opened 3 years ago

roxyboy commented 3 years ago

Source Dataset

SWOT-Xover is a subset of a few basin-scale model outputs with the resolution of ~1/50° surface hourly and interior daily data. The subsets will cover the cross-over regions of the SWOT fast-sampling phase.

Transformation / Alignment / Merging

Files should be concatenated along the time dimension.

Output Dataset

The zarrification of data should be automated via the pangeo-forge pipeline following the pangeo-forge recipe. In order to facilitate the automation, we would ask each modelling group to have the outputs in netcdf4 format and make it available via an ftp server. A single monthly file of daily-averaged 3D data of u, v, w, T & S in one region is ~30Gb. With the four regions, six months and five models, this would sum up to ~3.6Tb in total on the cloud storage. The chunks of the zarr dataset will be on the order of {'time':30, 'z':5, 'y':100, 'x':100}. For the surface, a single daily file of hourly averaged data of SST, SSS, SSH, wind stress & buoyancy fluxes in one region is ~380Mb. With the regions, months and models, this sums up to ~45Gb. The chunks of the zarr dataset will be on the order of {'time':100, 'y':100, 'x':100}

rabernat commented 3 years ago

Can you provide more details about the input files? How big are they? What URLs will we use to download them?

roxyboy commented 3 years ago

I was hoping that each modelling group could upload their zarrified data to the Wasabi cloud storage...

rabernat commented 3 years ago

I was hoping that each modelling group could upload their zarrified data to the Wasabi cloud storage...

Then this is not a Pangeo Forge pipeline. The point of Pangeo Forge is to automatically put together the Zarr in the cloud.

What you propose is fine--it's just not part of Pangeo Forge. Let's leave this open for now as we figure out the best path forward.

roxyboy commented 3 years ago

Then this is not a Pangeo Forge pipeline. The point of Pangeo Forge is to automatically put together the Zarr in the cloud.

What you propose is fine--it's just not part of Pangeo Forge. Let's leave this open for now as we figure out the best path forward.

I think we're going to try the Pangeo Forge pipeline for the eNATL60 data. Pending on how this goes, we may recommend other modelling centers to follow the pipeline.

rabernat commented 3 years ago

Great! To move forward, we need some more details about exactly where to find the data and how it is formatted. Please edit your original issue to conform to the template (https://github.com/pangeo-forge/staged-recipes/issues).

roxyboy commented 3 years ago

Yes, I'm still working on extracting the cross-over regions (which surprisingly takes time dealing with massive netCDF files) but I will update the details as soon as I get this hashed out.

rabernat commented 3 years ago

(which surprisingly takes time dealing with massive netCDF files)

If only there were a better format! 🤣 😉

roxyboy commented 3 years ago

This is getting a bit ahead of ourselves but in the case we would ask the modelling groups to provide their data via ftp or opendap links for the pangeo-forge pipeline, would the "computation" costs to upload them to the cloud come out from the payments we'll be making to 2i2c? I'm only asking because I think it would be best if we could reduce the amount of hassle each modelling group goes through. The idea I had in mind was to develop the pipeline on the SWOT-AdAC Jupyterhub.

roxyboy commented 3 years ago

@rabernat I added a bit more detail in the output data section. Is this sufficient?

rabernat commented 3 years ago

Is this sufficient?

Can you provide an actual working FTP link to one of the datasets?

would the "computation" costs to upload them to the cloud come out from the payments we'll be making to 2i2c?

No, they will be supported by pangeo forge and our NSF grant. 2i2c is for the jupyterhub.

A single monthly file of daily-averaged 3D data of u, v, w, T & S in one region is ~30Gb.

This will require https://github.com/pangeo-forge/pangeo-forge/issues/49, a feature that is not yet implemented. We are working on it.

roxyboy commented 3 years ago

Can you provide an actual working FTP link to one of the datasets?

Sorry for the lagged response. Here is a working link: https://ige-meom-opendap.univ-grenoble-alpes.fr/thredds/catalog/meomopendap/extract/SWOT-Adac/Interior/eNATL60/catalog.html

roxyboy commented 3 years ago

Here is another working ftp link for lNALT60: https://data.geomar.de/downloads/20.500.12085/0e95d316-f1ba-47e3-b667-fc800afafe22/data/

rabernat commented 3 years ago

Ok thanks for these. Will have a look soon.

I talked with @lesommer, and we decided to try putting this data in OSN for now.

roxyboy commented 3 years ago

The eNATL60 regional outputs for regions 1-3 are now all available here: https://ige-meom-opendap.univ-grenoble-alpes.fr/thredds/catalog/meomopendap/extract/SWOT-Adac/catalog.html

rabernat commented 3 years ago

FYI, that server is giving HTTP certificate errors.

$ curl -I https://ige-meom-opendap.univ-grenoble-alpes.fr/thredds/fileServer/meomopendap/extract/SWOT-Adac/Surface/eNATL60/Region03-surface-hourly_2010-04.nc
curl: (60) SSL certificate problem: certificate has expired
More details here: https://curl.haxx.se/docs/sslcerts.html

Would it be possible to get this fixed?

roxyboy commented 3 years ago

FYI, that server is giving HTTP certificate errors.

$ curl -I https://ige-meom-opendap.univ-grenoble-alpes.fr/thredds/fileServer/meomopendap/extract/SWOT-Adac/Surface/eNATL60/Region03-surface-hourly_2010-04.nc
curl: (60) SSL certificate problem: certificate has expired
More details here: https://curl.haxx.se/docs/sslcerts.html

Would it be possible to get this fixed?

@auraoupa Do you know why this is happening...?

AurelieAlbert commented 3 years ago

Yes it is a know issue of our opendap (expired certificate), we get by it by adding --no-check-certificate to our wget commands, it would be --insecure for curl (did not try it), but it maybe more efficient (and cleaner) to have it fixed ... I'll try to make it happen !

rabernat commented 3 years ago

We need to get the files via fsspec and unfortunately I don't (yet) know how to work around the certificate error...but there must be a way! I'll try to dig deeper on my end too.

rabernat commented 3 years ago

Now it looks like the server https://ige-meom-opendap.univ-grenoble-alpes.fr/ is down completely? This is making it hard to develop the recipe.

auraoupa commented 3 years ago

Sorry about that, it should be ok now. About the certificate, the University should be fixing it soon they say ! I keep you posted

auraoupa commented 3 years ago

The certificate is now valid, I hope it helps for the development of the recipe !

rabernat commented 3 years ago

Success!

url = 'https://ige-meom-opendap.univ-grenoble-alpes.fr/thredds/fileServer/meomopendap/extract/SWOT-Adac/Surface/eNATL60/Region03-surface-hourly_2010-04.nc'
with fsspec.open(url) as fp:
    ds = xr.open_dataset(fp)
    display(ds)
<xarray.Dataset>
Dimensions:        (time_counter: 720, x: 574, y: 675)
Coordinates:
    nav_lon        (y, x) float32 ...
    nav_lat        (y, x) float32 ...
    time_centered  (time_counter) datetime64[ns] 2010-04-01T00:30:00 ... 2010...
  * time_counter   (time_counter) datetime64[ns] 2010-04-01T00:30:00 ... 2010...
    depth          (y, x) float32 ...
    lat            (y, x) float32 ...
    lon            (y, x) float32 ...
    e1t            (y, x) float64 ...
    e2t            (y, x) float64 ...
    e1f            (y, x) float64 ...
    e2f            (y, x) float64 ...
    e1u            (y, x) float64 ...
    e2u            (y, x) float64 ...
    e1v            (y, x) float64 ...
    e2v            (y, x) float64 ...
Dimensions without coordinates: x, y
Data variables:
    sossheig       (time_counter, y, x) float32 ...
    sozocrtx       (time_counter, y, x) float32 ...
    somecrty       (time_counter, y, x) float32 ...
    sosstsst       (time_counter, y, x) float32 ...
    sosaline       (time_counter, y, x) float32 ...
    sozotaux       (time_counter, y, x) float32 ...
    sometauy       (time_counter, y, x) float32 ...
    qt_oce         (time_counter, y, x) float32 ...
    sowaflup       (time_counter, y, x) float32 ...
    tmask          (y, x) int8 ...
    umask          (y, x) int8 ...
    vmask          (y, x) int8 ...
    fmask          (y, x) int8 ...
roxyboy commented 3 years ago

Success!

url = 'https://ige-meom-opendap.univ-grenoble-alpes.fr/thredds/fileServer/meomopendap/extract/SWOT-Adac/Surface/eNATL60/Region03-surface-hourly_2010-04.nc'
with fsspec.open(url) as fp:
    ds = xr.open_dataset(fp)
    display(ds)
<xarray.Dataset>
Dimensions:        (time_counter: 720, x: 574, y: 675)
Coordinates:
    nav_lon        (y, x) float32 ...
    nav_lat        (y, x) float32 ...
    time_centered  (time_counter) datetime64[ns] 2010-04-01T00:30:00 ... 2010...
  * time_counter   (time_counter) datetime64[ns] 2010-04-01T00:30:00 ... 2010...
    depth          (y, x) float32 ...
    lat            (y, x) float32 ...
    lon            (y, x) float32 ...
    e1t            (y, x) float64 ...
    e2t            (y, x) float64 ...
    e1f            (y, x) float64 ...
    e2f            (y, x) float64 ...
    e1u            (y, x) float64 ...
    e2u            (y, x) float64 ...
    e1v            (y, x) float64 ...
    e2v            (y, x) float64 ...
Dimensions without coordinates: x, y
Data variables:
    sossheig       (time_counter, y, x) float32 ...
    sozocrtx       (time_counter, y, x) float32 ...
    somecrty       (time_counter, y, x) float32 ...
    sosstsst       (time_counter, y, x) float32 ...
    sosaline       (time_counter, y, x) float32 ...
    sozotaux       (time_counter, y, x) float32 ...
    sometauy       (time_counter, y, x) float32 ...
    qt_oce         (time_counter, y, x) float32 ...
    sowaflup       (time_counter, y, x) float32 ...
    tmask          (y, x) int8 ...
    umask          (y, x) int8 ...
    vmask          (y, x) int8 ...
    fmask          (y, x) int8 ...

Sorry, I missed this. This is great news! Could you let us know what the status is regarding the data storage on OSN @rabernat ??

rabernat commented 3 years ago

The status is that I'm still working on it. I hope to be able to start ingesting data soon (next week). I'm deeply sorry for the delays and I thank you for your patience.

lesommer commented 3 years ago

thanks for all your work with this @rabernat !

roxyboy commented 3 years ago

I started a PR #24 for the recipe.

roxyboy commented 3 years ago

@rabernat Could we prioritize pushing the surface data to the cloud for all available models (in #26, #27, #29) before the interior 3D data? Since we have a few different models ready to push, I think there are already a few inter-model analyses that could be done with just the surface data :)

roxyboy commented 3 years ago

@rabernat @cisaacstern I've started analyzing the SWOT-AdAC data (#24 #26 #29 ) on a Google Cloud based Jupyterhub but does the OSN storage also support storing of analysis data?

rabernat commented 3 years ago

does the OSN storage also support storing of analysis data?

No, we cannot provide write access to OSN.

Can you explain more about the use case you have in mind? How much data do you imagine needing to write? Does it need to be shared across users?

For writing data, you have a few options:

cisaacstern commented 3 years ago

I believe all of the surface datasets are now on OSN. Returning to this main thread to provide an high-level "flyover" of how it's organized. Note that below, fs_osn and swot are always defined as:

import s3fs
endpoint_url = 'https://ncsa.osn.xsede.org'
fs_osn = s3fs.S3FileSystem(anon=True, client_kwargs={'endpoint_url': endpoint_url},)
swot = "Pangeo/pangeo-forge/swot_adac"

💺 Fasten your seatbelt, this will be a long one!

INALT60 #26

fs_osn.ls(f"{swot}/INALT60")
['Pangeo/pangeo-forge/swot_adac/INALT60/grid.zarr',
 'Pangeo/pangeo-forge/swot_adac/INALT60/surf_flux_1d.zarr',
 'Pangeo/pangeo-forge/swot_adac/INALT60/surf_ocean_4h.zarr',
 'Pangeo/pangeo-forge/swot_adac/INALT60/surf_ocean_5d.zarr']

We currently a single zarr store for each surface dataset. The time dimension for these data is non-contiguous as seen in the recipe here. If it's useful, I can separate each of these surface datasets into separate seasonal stores, as demonstrated in the other recipes below.

GIGATL #27

fs_osn.ls(f"{swot}/GIGATL")
['Pangeo/pangeo-forge/swot_adac/GIGATL/Region01',
 'Pangeo/pangeo-forge/swot_adac/GIGATL/Region02',
 'Pangeo/pangeo-forge/swot_adac/GIGATL/surf_reg_01.zarr']

@roxyboy, unless you need it for something, I will delete surf_reg_01.zarr which is missing the input for Jan 28 as you identified in https://github.com/pangeo-forge/staged-recipes/pull/27#issuecomment-853104775.

For each region's surface data, there are both aso (Aug, Sep, Oct) and fma (Feb, Mar, Apr) stores:


```python fs_osn.ls(f"{swot}/GIGATL/Region01/surf") ``` ``` ['Pangeo/pangeo-forge/swot_adac/GIGATL/Region01/surf/aso.zarr', 'Pangeo/pangeo-forge/swot_adac/GIGATL/Region01/surf/fma.zarr'] ``` ```python fs_osn.ls(f"{swot}/GIGATL/Region02/surf") ``` ``` ['Pangeo/pangeo-forge/swot_adac/GIGATL/Region02/surf/aso.zarr', 'Pangeo/pangeo-forge/swot_adac/GIGATL/Region02/surf/fma.zarr'] ```

The fma stores should both contain the previously missing Jan 28 data. (h/t @rabernat for showing me how to ammend and reuse the existing cache.)

HYCOM50 #29

fs_osn.ls(f"{swot}/HYCOM50")
['Pangeo/pangeo-forge/swot_adac/HYCOM50/Region01_GS',
 'Pangeo/pangeo-forge/swot_adac/HYCOM50/Region02_GE',
 'Pangeo/pangeo-forge/swot_adac/HYCOM50/Region03_MD',
 'Pangeo/pangeo-forge/swot_adac/HYCOM50/grid_01.zarr',
 'Pangeo/pangeo-forge/swot_adac/HYCOM50/grid_02.zarr',
 'Pangeo/pangeo-forge/swot_adac/HYCOM50/grid_03.zarr',
 'Pangeo/pangeo-forge/swot_adac/HYCOM50/surf_01.zarr',
 'Pangeo/pangeo-forge/swot_adac/HYCOM50/surf_02.zarr',
 'Pangeo/pangeo-forge/swot_adac/HYCOM50/surf_03.zarr']

For each region defined in the recipe, there are both aso and fma stores:

```python fs_osn.ls(f"{swot}/HYCOM50/Region01_GS/surf") ``` ``` ['Pangeo/pangeo-forge/swot_adac/HYCOM50/Region01_GS/surf/aso.zarr', 'Pangeo/pangeo-forge/swot_adac/HYCOM50/Region01_GS/surf/fma.zarr'] ``` ```python fs_osn.ls(f"{swot}/HYCOM50/Region02_GE/surf") ``` ``` ['Pangeo/pangeo-forge/swot_adac/HYCOM50/Region02_GE/surf/aso.zarr', 'Pangeo/pangeo-forge/swot_adac/HYCOM50/Region02_GE/surf/fma.zarr'] ``` ```python fs_osn.ls(f"{swot}/HYCOM50/Region03_MD/surf") ``` ``` ['Pangeo/pangeo-forge/swot_adac/HYCOM50/Region03_MD/surf/aso.zarr', 'Pangeo/pangeo-forge/swot_adac/HYCOM50/Region03_MD/surf/fma.zarr'] ```


@roxyboy, surf_01.zarr, surf_02.zarr, and surf_03.zarr are the earlier drafts where non-contiguous data is concatenated together. Do you have any use for them now that the seasonal stores are up? If not, I'll delete.

eNATL60 #24

fs_osn.ls(f"{swot}/eNATL60")
['Pangeo/pangeo-forge/swot_adac/eNATL60/Region01',
 'Pangeo/pangeo-forge/swot_adac/eNATL60/Region02',
 'Pangeo/pangeo-forge/swot_adac/eNATL60/Region03']

For each of the regions, aso and fma stores are provided for the surface_hourly data:

```python fs_osn.ls(f"{swot}/eNATL60/Region01/surface_hourly") ``` ``` ['Pangeo/pangeo-forge/swot_adac/eNATL60/Region01/surface_hourly/aso.zarr', 'Pangeo/pangeo-forge/swot_adac/eNATL60/Region01/surface_hourly/fma.zarr'] ``` ```python fs_osn.ls(f"{swot}/eNATL60/Region02/surface_hourly") ``` ``` ['Pangeo/pangeo-forge/swot_adac/eNATL60/Region02/surface_hourly/aso.zarr', 'Pangeo/pangeo-forge/swot_adac/eNATL60/Region02/surface_hourly/fma.zarr'] ``` ```python fs_osn.ls(f"{swot}/eNATL60/Region03/surface_hourly") ``` ``` ['Pangeo/pangeo-forge/swot_adac/eNATL60/Region03/surface_hourly/aso.zarr', 'Pangeo/pangeo-forge/swot_adac/eNATL60/Region03/surface_hourly/fma.zarr'] ```


Next steps

@roxyboy, please let me know if you run into any issues with any of the above. Also, what should we work on next? Adding the interior data?

roxyboy commented 3 years ago

This is great! Thanks @cisaacstern .

INALT60 #26

We currently a single zarr store for each surface dataset. The time dimension for these data is non-contiguous as seen in the recipe here. If it's useful, I can separate each of these surface datasets into separate seasonal stores, as demonstrated in the other recipes below.

The time metadata for INALT60 is in Gregorian so I think it's fine that we keep it as it currently is because it's much easier to parse out the seasons.

GIGATL #27

@roxyboy, unless you need it for something, I will delete surf_reg_01.zarr which is missing the input for Jan 28 as you identified in #27 (comment).

Yes, please delete surf_reg_01.

HYCOM50 #29

@roxyboy, surf_01.zarr, surf_02.zarr, and surf_03.zarr are the earlier drafts where non-contiguous data is concatenated together. Do you have any use for them now that the seasonal stores are up? If not, I'll delete.

Please feel free to delete surf_01.zarr, surf_02.zarr, and surf_03.zarr.

@roxyboy, please let me know if you run into any issues with any of the above. Also, what should we work on next? Adding the interior data?

Yes, fluxing the interior data to the cloud would be greatly appreciated :)

roxyboy commented 3 years ago

@cisaacstern Model outputs from FESOM are in the making but can the recipes handle netcdf4 files compressed with tar?

rabernat commented 3 years ago

but can the recipes handle netcdf4 files compressed with tar?

It would be ideal if we could avoid tarring inputs. But if this is unavoidable, we will find a way to deal with it.

rabernat commented 3 years ago

@roxyboy - Today @cisaacstern and I met to discuss this. It will introduce significant complexity in Pangeo Forge to handle the tarred files. We are not sure this effort is worth it since there is an easy workaround--can we just ask the data provider to un-tar the files before putting them online? That is a reasonable request, no?

roxyboy commented 3 years ago

@roxyboy - Today @cisaacstern and I met to discuss this. It will introduce significant complexity in Pangeo Forge to handle the tarred files. We are not sure this effort is worth it since there is an easy workaround--can we just ask the data provider to un-tar the files before putting them online? That is a reasonable request, no?

Yes, I've asked them to untar it. Will be making a PR for FESOM soon.

roxyboy commented 3 years ago

We've (@lesommer and I) decided that hosting regional extracts from LLC4320 on OSN is probably better than us pulling the data from ECCO portal for analysis. Dimitris asked if he could directly push the data himself from where LLC4320 sits after the extraction but is this possible? Otherwise, I can ask him to (temporarily) put it on an ftp server.

cisaacstern commented 3 years ago

@roxyboy, as far as I'm aware, we're not able to provide write access to OSN. If you point me to the files on a temporary ftp server, however, I can write them to the swot_adac bucket for you. Will these files be netCDFs? How many of them are there and what is their total size?

cisaacstern commented 3 years ago

Brief progress report below. Simulations with no emojis mean that we haven't started a recipe yet.

Name Recipe Surface Interior
eNATL60
MEDWEST60
Mediterranean
GIGATL
HYCOM50
llc4320
lNALT60
FESOM
SM-telescope

Edit (July 2): As mentioned in https://github.com/pangeo-forge/staged-recipes/pull/29#issuecomment-873177433, updated table to reflect that HYCOM50 int data is online.

Edit (July 3): Updated table to reflect that GIGATL int data is online; xref https://github.com/pangeo-forge/staged-recipes/pull/27#issuecomment-873427139.

Edit (July 19): FESOM surface data added to project catalog: https://github.com/pangeo-data/swot_adac_ogcms/pull/2

Edit (July 20): eNATL60 surface data added to catalog: https://github.com/pangeo-data/swot_adac_ogcms/pull/3

roxyboy commented 2 years ago

@rabernat @cisaacstern What do the Pangeo folks think about hosting the global 1/25 HYCOM surface data of u, v, and SSH developed by Brian Arbic's group on OSN? The storage will likely be on the order of 8Tb. The idea is that it'll benefit the SWOT-AdAC community by having global access to both LLC4320 and HYCOM25. As an example, we/I can work on hosting a Jupyter notebook showing the transition scales on the Pangeo gallery. @lesommer can fill in the details of discussion he had with Brian if necessary.

rabernat commented 2 years ago

YES to global HYCOM on OSN.

cisaacstern commented 2 years ago

Standing by to assist with the recipe once @roxyboy and/or @lesommer points us to the source files.

This will be a good test for our new Google Cloud Bakery once it comes online.