pangeo-forge / staged-recipes

A place to submit pangeo-forge recipes before they become fully fledged pangeo-forge feedstocks
https://pangeo-forge.readthedocs.io/en/latest/
Apache License 2.0
39 stars 63 forks source link

Example pipeline for the High Resolution Rapid Refresh (HRRR) model #18

Open rsignell-usgs opened 3 years ago

rsignell-usgs commented 3 years ago

See: https://github.com/blaylockbk/HRRR_archive_download/issues/2#issuecomment-763713889

@rabernat , I hope the info at the above URL suffices to describe the workflow.

@chiaral, note that this workflow starts with GRIB2 files as in #17.

After downloading the GRIB2 files, I convert them to NetCDF using wgrib2 (installed from conda-forge) before using rechunker. (I first tried using cfgrib but it was very slow so I went looking for other faster solutions).

chiaral commented 3 years ago

Hi @rsignell-usgs ,

Short answer: thanks so much for those notebooks with examples, they will help a lot.

Longer answer: Your workflow is definitely what I was evaluating for my data as well. I have few questions: so is cfgrib able to read files directly from the bucket? i am not sure what the following line does:

tmp_file = fsspec.open_local(f'simplecache::s3://{flist[0]}', 
                              s3=dict(anon=True), simplecache={'cache_storage': '/tmp'})

However I have had problems using cfgrib with some of my files, even when downloaded locally, maybe with some guidance i can push them tru that engine; it would be good to test both engines on the grib files, since pynio can give very different read of the files compared to cfgrib.

In fact, many of my variables have an ad-hoc time averaging/accumulation (see here), which pynio tries to encode in some ways (and I think that is the root cause for cfgrib not liking my data), by splitting the data in two time series of accumulated/averaged data across the two different interval length (3hourly and 6hourly). These differences in the forecast time window make a workflow like download -> convert to NetCDF -> convert to zarr problematic in my case. When I simply download -> convert to NetCDF , wgrib2 actually just outputs the data as one time series. I tried to explore the grib file with wgrib2 and dig out the longname, raw variable name, etc, but the two sets of data are identical in that respect, the only difference is in the :0-3 hour acc fcst: field. pynio actually makes a great job in figuring out they are different. It would be curious to see how\if cfgrib can handle it in some way, after adding some backend_kwargs.. any input?

Moreover, the differences in forecast intervals would be obvious in accumulated values - because you have large jumps in the values - but more subtle when you have an average (i.e. 3hourly avg is not so different from a 6hrly avg). I guess this is a cautionary tale, for me that I am a beginner in using grib files, to always use wgrib2 to explore them...

So I have to figure out what to do with my data - leave the two time series as separated or do some manipulation to make them all 3hourly. The idea so far is to leave them in fact as they are - separated time series - and just add attributes or time bounds to the zarr files.