pangeo-forge / staged-recipes

A place to submit pangeo-forge recipes before they become fully fledged pangeo-forge feedstocks
https://pangeo-forge.readthedocs.io/en/latest/
Apache License 2.0
38 stars 63 forks source link

Example pipeline for GFS Archive #50

Open raybellwaves opened 3 years ago

raybellwaves commented 3 years ago

Source Dataset

import xarray as xr

url = "https://rda.ucar.edu/thredds/dodsC/files/g/ds084.1/2020/20200201/gfs.0p25.2020020100.f000.grib2"
ds = xr.open_dataset(url)

or

import requests
import cfgrib

login_url = "https://rda.ucar.edu/cgi-bin/login"
ret = requests.post(
    login_url,
    data={"email": EMAIL, "passwd": PASSWD, "action": "login"},
)
file = "https://rda.ucar.edu/data/ds084.1/2020/20200201/gfs.0p25.2020020100.f000.grib2"
req = requests.get(file, cookies=ret.cookies, allow_redirects=True)
open("gfs.0p25.2020020100.f000.grib2", "wb").write(req.content)
dss = cfgrib.open_datasets("gfs.0p25.2020020100.f000.grib2")

or

import s3fs
fs = s3fs.S3FileSystem()
fs.get("s3://noaa-gfs-bdp-pds/gfs.20210914/12/atmos/gfs.t12z.pgrb2.0p25.f000", "gfs.0p25.2021091412.f000.grib2")
dss = cfgrib.open_datasets("gfs.0p25.2021091412.f000.grib2")

Transformation / Alignment / Merging

Concat along reftime (init time) and time

Output Dataset

zarr store.

I imagine one giant zarr store would be crazy so could be stored for one init time and all forecast times. Ideally with init time an expanded dim so it can be concatenated later.

cisaacstern commented 3 years ago

@raybellwaves thanks for opening this issue, and apologies for the delay in responding. Just tagging a few others who may be more familiar with grib-specific considerations.

Does anyone of @rabernat, @TomAugspurger, or @martindurant know if we can handle .grib2 inputs at this time?

martindurant commented 3 years ago

I don't see why not - xarray can load them, so long as they are cached on a local filesystem.

On June 24, 2021 7:57:29 PM EDT, Charles Stern @.> wrote: @. thanks for opening this issue, and apologies for the

delay in responding. Just tagging a few others who may be more familiar with grib-specific considerations.

Does anyone of @rabernat, @TomAugspurger, or @martindurant know if we can handle .grib2 inputs at this time?

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/pangeo-forge/staged-recipes/issues/50#issuecomment-868076915

-- Sent from my Android device with K-9 Mail. Please excuse my brevity.

rabernat commented 3 years ago

Noting the similarity to #17 and #18.

We should have no problem with grib, as long as xarray can open the files (which the example code above already illustrates). For this to work, you will need to set copy_input_to_local_file=File in XarryZarrRecipe.

cisaacstern commented 3 years ago

@raybellwaves, looks like we're good to go. :smile:

Are you interested in learning how to develop recipes yourself? If so, I'd be delighted, and will be happy to guide you through the process. Like conda-forge, the strength of this initiative will ultimately grow from the community of recipe developers who've learned to use these tools, and it would be great to have you onboard.

The first step would be for you to make a PR to this repo, which contains a new draft recipe under recipes/NCEP_GFS/ncep_gfs_recipe.py. This module will instantiate a pangeo_forge_recipes.recipes.XarrayZarrRecipe object, and can be your best guess of how to approach that based on the docs here. Once you've pushed a first commit, I can jump in and start making suggestions and/or commits to your PR.

adair-kovac commented 2 years ago

Is writing a separate zarr store for every init time a good idea? I've been struggling a lot with how to build timeseries from the hrrrzarr data which was written that way. Opening thousands of hours in a loop can take hours and isn't nearly as simple or efficient to parallelize as if time were just a dimension in the dataset from the getgo.

That said, that dataset isn't optimized for xarray––e.g. because of the way it's written as a zarr hierarchy, the .zmetadata isn't visible to xarray so I can't use the consolidated option. But I noticed that both #17 and #18 are making init time a dimension rather than creating separate stores (IIUC). How would you decide between the two approaches?

rabernat commented 2 years ago

Is writing a separate zarr store for every init time a good idea?

I don't think so. I think we want init_time and lead_time both as dimension. In order for this to work, we need to resolve https://github.com/pangeo-forge/pangeo-forge-recipes/issues/140.

adair-kovac commented 2 years ago

@rabernat Is there any concern with xarray's handling of the time dimension for continuously-updating data sets? I assume the GFS (like the HRRR and GEFS) produces new model runs frequently. Some of my colleagues have been avoiding creating a time dimension in these situations because of cases where it's been painful but it's not clear to me if any of those apply to situations like this. Does the .zmetadata get updated efficiently when you just append data?

Also do we actually need 140 for this one? Shouldn't you be able to just do it in stages, look at a single init_time and concat over lead_time, then concat the result over init_time? Or do recipes have to be 1 stage?

raybellwaves commented 2 years ago

Good question. I imagine there are open questions regarding one giant zarr store versus smaller zarr stores which can be concatenated. It may be use case driven. There are probably lessons learnt from what people do with tabular (parquet) which can also be stored as separate files or appended (row wise or as a new row group partition e.g. partition on reftime). I imagine a step beyond which people do with tabular data would be streaming data. I imagine once you get it out of grib into a zarr store of some kind you can iterate through these questions quicker.

martindurant commented 2 years ago

Quick note that the "reference" views I have been working with could provide both, without having to copy or reorganise the data. It can be used to produce a single logical zarr over many zarr datasets.

adair-kovac commented 2 years ago

@martindurant Where would I get started if I wanted to try that out?