leap-stc / data-management

Collection of code to manually populate the persistent cloud bucket with data
https://catalog.leap.columbia.edu/
Apache License 2.0
0 stars 6 forks source link

Ocean Reanalysis System 5 [ORAS5 ECMWF] #49

Open sckw opened 1 year ago

sckw commented 1 year ago

Dataset Name

Ocean Reanalysis System 5

Dataset URL

https://cds.climate.copernicus.eu/cdsapp#!/dataset/reanalysis-oras5?tab=form

Description

This is an ocean reanalysis dataset from ECMWF. It provides 3D global, gridded (.25x.25), monthly mean data from 1958 to present.

Variables include: velocity (meridional, zonal), wind stress, MLDs, net heat flux, SSTs, potential temp, SSS, salinity, sea ice, SSH, etc.

The dataset is large and takes a while to download for individual uses. It would be useful to have this downloaded and stored at the LEAP Data Library instead of users downloading sub-datasets and storing it on their personal storage.

Size

3D datasets (e.g. meridional velocity) are larger ~10GB+. Single level datasets (e.g. SST) are <200MB

License

https://cds.climate.copernicus.eu/api/v2/terms/static/licence-to-use-copernicus-products.pdf

Data Format

NetCDF

Data Format (other)

No response

Access protocol

HTTP(S)

Source File Organization

There is one file per month, which is equal to one timestep. One year of data would have 12 netcdf files (months) that can be concatenated.

Files are downloaded via API request. Documentation on the download how-to is found here: https://cds.climate.copernicus.eu/api-how-to

Example of an API request for meridional velocity data, for 2009-2011 (all months), is shown below.

Example URLs

import cdsapi

c = cdsapi.Client()

c.retrieve(
    'reanalysis-oras5',
    {
        'format': 'zip',
        'vertical_resolution': 'all_levels',
        'product_type': 'consolidated',
        'variable': 'meridional_velocity',
        'year': [
            '2009', '2010', '2011',
        ],
        'month': [
            '01', '02', '03',
            '04', '05', '06',
            '07', '08', '09',
            '10', '11', '12',
        ],
    },
    'example_meridional_download.zip')

Authorization

API Token

Transformation / Processing

Since all files are one timestep ( one month), it would be useful to have the datasets be combined into at least years or the whole 1958-present range.

The files are also in 0.25x0.25 resolution, and so a 1x1 regridding could be useful.

Target Format

Zarr

Comments

No response

cisaacstern commented 1 year ago

Thanks for the request and detailed explanation, @sckw. Since the data is from ECMWF, I think we should probably use weather-dl to cache it, before transforming to Zarr with Pangeo Forge.

@alxmrs may be able to advise. Alex, I see weather-dl is documented as a CLI, but is there an opportunity for using its objects directly in a Pangeo Forge pipeline; e.g., in very coarse pseudocode (with the weather-dl bits referenced from here):

from pangeo_forge_recipes.transforms import StoreToZarr
from weather_dl.fetcher import Fetcher

recipe = (
    beam.Create(...)
    # the weather-dl part
    | 'GroupBy Request Limits' >> beam.GroupBy(...)
    | 'Fetch Data' >> beam.ParDo(Fetcher(...))
    # some tbd adapter
    | SomeAdapterTransform()
    # the Pangeo Forge part
    | StoreToZarr()
)

?

alxmrs commented 1 year ago

I will take a deeper look at how PGF could use weather-tools (specifically weather-dl) as a library this Tuesday. I have a few ideas and words of caution.

I want to make sure that @mahrsee1997 has seen this, as he is now our weather-dl expert.

Some initial thoughts:

Would you be open to a semi-beam or non-beam based solution? With weather-dl 1.5 or 2, we’ve found more stability in downloading and higher utilization of ECMWF licenses.

I think this could integrate well into PGF in other ways; for example, by having Zarr conversion react to raw data appearing in a bucket (more in line with the streaming stuff we’ve been talking about).

cisaacstern commented 1 year ago

I think this could integrate well into PGF in other ways; for example, by having Zarr conversion react to raw data appearing in a bucket (more in line with the streaming stuff we’ve been talking about).

I love this idea and discussed with @rabernat who agreed it's a great direction for us to take generally. Opened https://github.com/pangeo-forge/pangeo-forge-recipes/issues/598 to discuss details. Thanks for summarizing the pros/cons of using weather-dl transforms directly in Pangeo Forge. That seems sufficiently difficult, and the streaming alternative sufficiently elegant, that I think we should focus 100% on the streaming option. 😄