pangeo-forge / staged-recipes

A place to submit pangeo-forge recipes before they become fully fledged pangeo-forge feedstocks
https://pangeo-forge.readthedocs.io/en/latest/
Apache License 2.0
39 stars 63 forks source link

Proposed Recipes for ERA5 #92

Open jhamman opened 3 years ago

jhamman commented 3 years ago

There are currently a few subset's of the ERA5 dataset on cloud storage (example but none are complete or updated regularly. It wont be a trivial recipe to implement with Pangeo-Forge but it would be a good stretch goal to support such a dataset.

Source Dataset

Transformation / Alignment / Merging

Most likely, the best way to access and arrange the data is in 1-day chunks, concatenating along the time dimension. Given the large user pool for this dataset, I would suggest this recipe does as little data processing as possible.

Output Dataset

One (or more?) Zarr stores. Hourly data for all available variables, all pressure levels, etc.

rabernat commented 2 years ago

I was just talking about this with @spencerahill. We also had a meeting with ECMWF about this last spring.

It is a really big job.

spencerahill commented 2 years ago

Yes. Context: 3-yr NSF CLD grant starting hopefully within next month or two, 6 mo / yr my time and hopefully next year a graduate student. We're doing wavenumber-frequency spectral analysis of energy transports in low latitudes using ERA5. So that requires 6-hourly or higher resolution, up to a dozen or so vertically defined variables. Many TBs.

My default plan was to just use the CDSAPI to download it to local cluster but talking w/ Ryan sounds like this could plug into pangeo efforts nicely! Which would be fun for me, having been mostly watching from the sidelines for a few years now :)

jhamman commented 2 years ago

It is a really big job.

It is! But this is the crew to do it!

We're in a similar position to you @spencerahill . We are currently using the AWS public dataset but we're about to outgrow the offerings there. We could pull data from the CDSAPI to our own cloud bucket but that runs counter to the mission here.

We also had a meeting with ECMWF about this last spring.

@rabernat - Any takeaways you can share?

rabernat commented 2 years ago

Here are some notes from Baudouin Raoult after our meeting last April

I would still like another virtual meeting with you, during which we present to you our data handling system, because I don’t think we have yet a common understanding of the work to be done, and discussing it in writting would be very inefficient. Below is a short list of things that comes to mind immediately:

  • The MARS system does not handle files, but provide a direct access to 2D fields using an hypercube-based index (https://confluence.ecmwf.int/download/attachments/45754389/server_architecture.pdf?api=v2 , a paper from 1996, before “data cubes” where popular 😊. The whole archive can be seen has a 12-dimensions single dataset of 300PB/3e+12 fields . A 2D fields is the smallest accessible object (think chunk), and users can access the archive along any dimensions (dates, parameters, levels,…). A lot of the data is on tape, so one cannot just scan a directory full of files. Any recipe will have to access MARS.
  • Most of the upper-air fields are in spherical harmonics, and the few upper-air grid-point fields are on a reduced gaussian grid. All surface fields are also on the reduced gaussian grid. These are not supported by CF/Xarray. The data needs to be interpolated to a regular lat/lon grid. This requires a non-trivial amount of time and resources.
  • We still have to address the issue of the two dimensions of time of forecasts fields (although this is not so bad with ERA5). Same issue for all accumulations, fluxes and other maximum (e.g. the ‘cell-methods’). That discussion has been going on from a couple of decades (see https://www.ecmwf.int/sites/default/files/elibrary/2014/13706-grib-netcdf-setting-scene.pdf, some slides I did in 2011 that illustrate most of the points I am making here)
  • We need to decide how to organise the dataset, not all variables having the same time frequency (although I think is may be OK with ERA5), and how much metadata are we ready to provide.
  • The full ERA5 is around 5 PB using 16 bits per values. The size may explode if we are not careful with the packing/compression of the target. There will be a lot of trial and error at the beginning, and this is something we cannot afford to much, considering the volumes involved.

Having done similar exercises in the past (such as delivering previous reanalysis to NCAR and other institutes), I think the “recipe” will have to run for many months to extract, interpolate, reformat, transfer and store the data. This needs to be done in parallel and having some checkpoint/restart facility as one cannot run a single process for months (planned system sessions, network issues, disks issues, etc). Furthermore, the data transfers will certainly be interrupted when our data centre move from the UK to Italy later this year.

So before we start writing some code, I would like that we have a clear definition of the scope of that project. We also need to decide who is going to be part of that activity and where the heavy lifting is happening

That was sufficiently intimidating that we tabled the discussion for the time being. Now that Pangeo Forge is farther along, and we have people who are interested in working on it, I think we can pick it up again.

cisaacstern commented 2 years ago

Exciting! A few questions/comments:

rabernat commented 2 years ago

To clarify, what Baudouin was proposing was that we go around the CDS API and talk directly to MARS, their internal archival system.

DCSCHUS commented 2 years ago

As a heads up, there is a full copy of the Copernicus 0.25 degree version of the ERA-5 Reanalysis dataset maintained at NCAR under the following dataset collections, which are preserved as time-series NetCDF-4 files: https://doi.org/10.5065/YBW7-YG52 https://doi.org/10.5065/BH6N-5N20

These may be easier to access and stage to public cloud resources unless you need the raw spherical harmonics and reduced gaussian grids at model level resolution, which are only available through ECMWF MARS. You can also access the NCAR maintained datasets by direct read from NCAR HPC systems as an NSF funded researcher. https://www2.cisl.ucar.edu/user-support/allocations/university-allocations

alxmrs commented 2 years ago

Cross-reference: #22

spencerahill commented 2 years ago

Question (which may reveal just how little I've worked with and understand the cloud): would it be useful to this effort to have some nontrivial chunk of the ERA5 data downloaded to the cluster at Columbia (berimbau) we're using for our project, to subsequently be uploaded to the cloud? My big concern w/r/t tying my project's science tightly to this pangeo-ERA5 effort is our project's science potentially getting held up, maybe in a big way, if there end up being unforeseen delays etc. in getting the data onto the cloud. Whereas I already have a functional pipeline for downloading the ERA5 data I'll need directly to that cluster via the CDS API, as well as the computational power I'll need at least for the preliminary analyses.

So, in this scenario, I'd start downloading the data I need basically right away to our cluster, and then once on the pangeo side things are ready I could upload from our cluster to the cloud. The upside for pangeo of this direct transfer from us would be no waiting on the CDS system queue etc.

Thoughts?

rabernat commented 2 years ago

would it be useful to this effort to have some nontrivial chunk of the ERA5 data downloaded to the cluster at Columbia (berimbau) we're using for our project, to subsequently be uploaded to the cloud?

Short answer: no, it would not be particularly useful for you to manually download data and store in on your cluster. That is sort of the status quo that we are trying to escape with Pangeo Forge. The problems with that workflow are

The goal with Pangeo Forge is to develop fully automated, reproducible pipelines for building datasets.

However, I recognize the tension here: you want to move forward with your science, and you need ERA5 data to do so. You can't wait a year for us to sort out all of these issues.

Here is an intermediate plan that I think might balance the two objectives. You should write a Pangeo Forge recipe to build the dataset you need on berimbau. The process of writing this recipe will be very useful for the broader effort. Note that this won't be possible until https://github.com/pangeo-forge/pangeo-forge-recipes/pull/245 is done, since that will be required to get data from the CDS API.

@alxmrs has also been working on a "ERA5 downloader" tool which could be very useful here.. Alex is that released yet?

So a possible order of operations would be:

spencerahill commented 2 years ago

However, I recognize the tension here: you want to move forward with your science, and you need ERA5 data to do so. You can't wait a year for us to sort out all of these issues.

Exactly.

Thanks @rabernat. That all makes sense. I'm subscribed to the relevant pangeo repos now and in particular will keep an eye on https://github.com/pangeo-forge/pangeo-forge-recipes/pull/245.

And going through the docs/tutorials sounds like a great task when I'm procrastinating on a review / revision / etc. in the near-ish term future.

cisaacstern commented 2 years ago

@spencerahill, once you've started on your recipe, please feel free to @ me in a comment here with any questions. The documentation is far from exhaustive, so don't be discouraged if there's something that doesn't make sense. I'll make sure any questions you have get answered, and we can use any insights we gain to improve the official docs.

spencerahill commented 2 years ago

Excellent, thanks! Also IIRC you are at least sometimes in-person at Lamont(?) If so would be fun to meet+chat in person too

alxmrs commented 2 years ago

@alxmrs has also been working on a "ERA5 downloader" tool which could be very useful here.. Alex is that released yet?

Hey Ryan! The release is in progress – I have just submitted the codebase for internal approval. Not sure about the ETA, since we are in late December. Usually, this last part of the process takes about ~1 week.

As soon as it's public, I will post about it here.

cisaacstern commented 2 years ago

at least sometimes in-person at Lamont(?)

Sadly I'm rarely there as I work out of a home office in California. Even if you sail through the recipe development process without any issues, I'd love to set aside some time to catch up over video either way. 😊

alxmrs commented 2 years ago

I'm happy to announce that the aforementioned tools to help download era 5 data are public, as of today! Please check out weather-dl at https://github.com/google/weather-tools.

@spencerahill @jhamman I'm happy to answer and questions you have along the way.

CC: @rabernat