Open rsignell-usgs opened 3 years ago
Hi @rsignell-usgs ,
Short answer: thanks so much for those notebooks with examples, they will help a lot.
Longer answer:
Your workflow is definitely what I was evaluating for my data as well. I have few questions: so is cfgrib
able to read files directly from the bucket? i am not sure what the following line does:
tmp_file = fsspec.open_local(f'simplecache::s3://{flist[0]}',
s3=dict(anon=True), simplecache={'cache_storage': '/tmp'})
However I have had problems using cfgrib
with some of my files, even when downloaded locally, maybe with some guidance i can push them tru that engine; it would be good to test both engines on the grib files, since pynio
can give very different read of the files compared to cfgrib
.
In fact, many of my variables have an ad-hoc time averaging/accumulation (see here), which pynio
tries to encode in some ways (and I think that is the root cause for cfgrib
not liking my data), by splitting the data in two time series of accumulated/averaged data across the two different interval length (3hourly and 6hourly).
These differences in the forecast time window make a workflow like download -> convert to NetCDF -> convert to zarr problematic in my case. When I simply download -> convert to NetCDF , wgrib2
actually just outputs the data as one time series. I tried to explore the grib file with wgrib2
and dig out the longname, raw variable name, etc, but the two sets of data are identical in that respect, the only difference is in the :0-3 hour acc fcst:
field. pynio
actually makes a great job in figuring out they are different. It would be curious to see how\if cfgrib
can handle it in some way, after adding some backend_kwargs
.. any input?
Moreover, the differences in forecast intervals would be obvious in accumulated values - because you have large jumps in the values - but more subtle when you have an average (i.e. 3hourly avg is not so different from a 6hrly avg). I guess this is a cautionary tale, for me that I am a beginner in using grib files, to always use wgrib2
to explore them...
So I have to figure out what to do with my data - leave the two time series as separated or do some manipulation to make them all 3hourly. The idea so far is to leave them in fact as they are - separated time series - and just add attributes or time bounds to the zarr files.
See: https://github.com/blaylockbk/HRRR_archive_download/issues/2#issuecomment-763713889
@rabernat , I hope the info at the above URL suffices to describe the workflow.
@chiaral, note that this workflow starts with GRIB2 files as in #17.
After downloading the GRIB2 files, I convert them to NetCDF using
wgrib2
(installed from conda-forge) before using rechunker. (I first tried using cfgrib but it was very slow so I went looking for other faster solutions).