[ Hourly Stacking ] Checklist and Notes

EarthScientist commented 7 years ago

These are simply points that need to be examined before moving forward in a significant way. Refer to the check-listed issues by number in the comments for consensus and keeping us all up-to-date. I assigned the 3 of us that are related to this effort to keep us all on the same page, this is not a call to action, though any input is greatly appreciated. :)

--> A current set of largely untested stacked data for the PCPT variable is located here: /workspace/Shared/Tech_Projects/wrf_data/project_data/wrf/hourly/pcpt

THE LIST:

[ ] 1.) when to set the 'time since ...' string inside the new NetCDF files. options include: first year/month/day of the entire series (1979-01-02), which would be the same for all yearly output NetCDF files... or we can set each individually to the begin time of the series stored in that file (what we are doing now)
[ ] 2.) properly setting and storing the Proj4 CRS definition we are attempting to add to the series.
[ ] 3.) are the data being stored and displayed in a consistent way that is useable to end-users? This refers to the lon/lat being pulled from existing Monthly files and added to the stacked hourly files. Do these work as expected?
[ ] 4.) should we set the new CRS to EPSG:3338 to match our existing data sources? or keep it the Polar Stereographic way that is being used currently?
[ ] 5.) Define the Polar Stereographic projection that is being used currently as a well-defined Proj4 string.
[ ] 6.) can the newly generated hourly files by year be read in with xr.open_mfdataset( '*.nc' ) using dask chunking? If so, is the data stacked by the software in a proper way and is it useable?
[ ] 7.) file naming convention for these new outputs? Should we just follow the existing convention from the wrf data producers?
[ ] 8.) what variable naming should we follow? Our current SNAP variable naming convention does not match that of the WRF outputs. I think we were following something very similar to the PCMDI standard for the CMIP outputs, but we also need to examine this so that we can either inform users how variables relate (or don't) to one another.
[ ] 9.) IMPORTANT NOTE: the data currently starts at 1979-01-02, which is a bit odd, but could be explained by needing to drop a day on one end or another to start the series. That said, I am not sure why this was done. This is definitely something to ask the data providers... It will only affect that single year for the daily vars and potentially minimally skew the monthly/weekely averages for that first week/month/year
[ ] 10.) Variable Units... Should we convert the units of variables to SNAP-standard units? or should we leave them raw (this is what we are doing now) and convert them on-the-fly for end users that need it?

EarthScientist commented 7 years ago

If using Python to examine the stacked files (which is HIGHLY recommended currently), here are a couple of snippets to show how to work with the outputs using xarray, dask, toolz

make sure you have the right packages installed in your virtualenv pip install xarray dask toolz

import xarray as xr
import os

data_path = '/workspace/Shared/Tech_Projects/wrf_data/project_data/wrf/hourly/pcpt'
os.chdir( data_path )

#"open" all the chronological files as a single unit using mfdataset
ds = xr.open_mfdataset( '*.nc' ) # use a wildcard to grab all the files for this series

# OR just read in one...
ds1 = xr.open_dataset( 'PCPT_wrf_hour_1979.nc' )

# get to the variable data
var_data = ds1[ 'PCPT' ]

# same logic goes for time
time_data = ds1[ 'time' ]

# we can get attributes at different levels
# current attrs are mostly derived from the input files stacked, with a couple of additions
print( ds1.attrs ) # global attrs
print( ds1.PCPT.attrs ) # local variable attrs
print( ds1.time.attrs ) # time variable attrs

EarthScientist commented 7 years ago

WRF data output

ua-snap / wrf_utils

[ Hourly Stacking ] Checklist and Notes #1

THE LIST: