Open amsnyder opened 2 years ago
@amsnyder ,
I wrote the what-does-chunking-do tutorial notebook using that NCAR reference, and in the process found a few ways to economize on processing time (and probably memory). The end result is that re-chunking would likely run a little faster.
Check out the Cloud example (about 2/3 of the way down) in that chunking tutorial notebook to see the differences. https://github.com/USGS-python/hytest_notebook_tutorials/blob/dev/Syllabus/L2/xx_Chunking.ipynb
I just realized that the greatest economy came from changes in the way I selected features based on gage_id -- if you're not doing that sort of sub-setting, then my code won't run much differently than the core process in that NCAR example.
Variables of interest, extracted from https://ral.ucar.edu/sites/default/files/public/WRFHydroV5_OutputVariableMatrix_V5.pdf
bold == priority
Variable | Domain | Description |
---|---|---|
ACCET | LDASOUT | Accumulated total evapotranspiration |
SNEQV | LDASOUT | Snow water equivalent |
FSNO | LDASOUT | Fraction of surface covered by snow |
ACCPRCP | LDASOUT | Accumulated precipitation |
SOIL_M | LDASOUT | Volumetric soil moisture |
CANWAT | LDASOUT | Total canopy water (liquid + ice) |
CANICE | LDASOUT | Canopy ice water content |
depth | GWOUT | Groundwater bucket water level |
sfcheadrt | RTOUT | Surface head (from HYDRO) |
SFCRNOFF | LDASOUT | Surface runoff: accumulated |
UGDRNOFF | LDASOUT | Underground runoff: accumulated |
Data is accessed via the AWS Open Data registry. See https://registry.opendata.aws/nwm-archive/
Datasets are available as netcdf or zarr file format via S3 buckets. We prefer the zarr, so will prioritize reading data from https://noaa-nwm-retrospective-2-1-zarr-pds.s3.amazonaws.com/index.html
As I'm actually starting to look at this data, I'm seeing that rechunking into a single dataset doesn't seem feasible.
I believe the gw, lake, and streamflow datasets have data associate with features (but a different set for each). We could rechunk each of these datasets by feature id (like we did for streamflow), but they would still need to be 3 different datasets because the features are different.
ldasout, precip, and rtout are gridded - these could be rechunked and combined into a single file (if we want to). I'd like some input from @rsignell-usgs or @pnorton-usgs on how to think about rechunking these data.
@sfoks and @pnorton-usgs (or anyone more familiar with NWM output) - is my understanding of the data correct? And if so, who can I talk to about optimal ways to rechunk the gridded data?
Aubrey Dugger at NCAR is someone who knows this best, but I tried to compile notes here.
WRFHydro Variables | File format/name | Description | Notes |
---|---|---|---|
CHRTOUT_DOMAIN | Streamflow output at all channel reaches/cells | NHD reaches | |
CHANOBS_DOMAIN | Streamflow output at forecast points or gage reaches/cells | n=7994 gages for NWM v2.1 | |
CHRTOUT_GRID | Streamflow on the 2D high resolution routing grid | high res grid is 250m | |
RTOUT_DOMAIN | Terrain routing variables on the 2D high resolution routing grid | high res is 250 m | |
LAKEOUT_DOMAIN | Lake output variables | 1 km, I think a grid cell is coded as lake or not lake? Aubrey would know | |
GWOUT_DOMAIN | Ground water output variables | 1 km, pretty sure this is gridded | |
LDASOUT_DOMAIN | Land surface model output | 1 km |
We will ask Alicia Rhoades to rechunk the NWM data - starting with the 3 priority variables identified by @gzt5142 above, which come from the ldasout.zarr store on AWS Open Data Registry: https://noaa-nwm-retrospective-2-1-zarr-pds.s3.amazonaws.com/index.html.
@rsignell-usgs will prepare a rechunking tutorial for gridded datasets that he will contribute to either the hytest-org repo here: https://github.com/hytest-org/hytest/tree/main/dataset_preprocessing/tutorials or Project Pythia.
@rsignell-usgs will go over this tutorial in a rechunking demo meeting. I am checking with Alicia on her availability for the demo and to start work. I will schedule this demo and invite @amrhoades, @rsignell-usgs, @sfoks, @thodson-usgs, @kathymd (rechunking data on NHGF), @ellen-brown (rechunking data on NHGF)
Meeting scheduled for Thursday, Dec. 8 at 4pm ET. I invited all those mentioned above, plus a few more. Let me know if you didn't receive an invite and would like one.
An update on rechunking workflows here:
@thodson-usgs will test out the current zarr data to see if creating a subset of rechunked data would be beneficial to eval
While reviewing this issue, we were unsure the purpose of this issue. After discussing, perhaps we'd intended to have an ARCO subset of NWM at gage locations (and daily averaged) so that we didn't need to scan the entire dataset during our workflows.
We want to rechunk all NWM v2.1 variables into a single zarr file. Chunks will be organized so that each chunk has 1 feature and the entire time series for that feature.
Rechunking will be based on this notebook: https://github.com/NCAR/rechunk_retro_nwm_v21/blob/main/notebooks/usage_example_rerechunk_chrtout.ipynb