pacificclimate / climate-explorer-data-prep

0 stars 0 forks source link

Form climo means of streamflows #5

Open rod-glover opened 7 years ago

rod-glover commented 7 years ago

Currently we can form climatological means from files containing variables defined over spatiotemporal grids, such as the outputs of GCMs, but not from streamflow output files.

Streamflow, however, is not defined on a grid. A streamflow for a given spatial location is a time series at that location, called an outlet. The collection of outlets do not form a uniform grid -- instead they are distributed essentially at random. Outlets are addressed by an outlet index, with several dependent variables defining the spatial location, name, and streamflow at that outlet.

We need to handle this case too.

rod-glover commented 7 years ago

It is possible that generate_climos can be tweaked to handle this case. A code inspection generate_climos shows the following observations relevant to processing streamflow output files:

  1. gc selects the temporal subset with cdo seldate, which:

    1. drops dimension nc_chars

    2. renames dimension nv to bnds

    3. renames dimension outlet to x

    4. drops the attribute _ChunkSizes from all variables with dimension outlet/x, but retains it for time and streamflow with altered value for time (1024 -> 524288)

    5. drops variable outlet_name(outlet)

    6. retains all other variables in apparently correct state

  2. gc forms the means using cdo ymonmean (etc.), which:

    1. can tell the difference between variables dependent on time and those that are not, so takes the mean only of streamflow

    2. drops dimension bnds (formerly nv)

    3. drops variable time_bnds(time, bnds) -- which may be OK, since gc replaces that variable anyway to reflect the bounds for climatological means

    4. retains all other variables in apparently correct state

  3. Not sure what concatenating intervals (via cdo copy) would do. I think it would be OK. In any case, since we do not want 17-point chronologies any more, this is not relevant.

  4. Converting longitudes (--convert-longitudes) depends on cf.lon_var, which fails for streamflow files. This is presumably fixable.

  5. Units conversion for pr variables is irrelevant.

  6. gc will split out all dependent variables if --split-variables option is set. This should be changed to splitting out only time-dependent variables. In the particular case of streamflows, this is in practice unnecessary, since there is only one time-dependent variable and so we could get away with unsetting --split-variables. I'd prefer to do this right, which would call for a change to nchelpers to return a list of time-dependent variables. Actually executing on that will depend on how complicated it looks -- I'm thinking not so much since nchelpers can already identify the time dimension (or variable, rather, but that is very close).

rod-glover commented 7 years ago

Proposed fixes to above issues:

Every cdo operator apparently does the following things:

  1. drops dimension nc_chars

  2. renames dimension nv to bnds

  3. renames dimension outlet to x

  4. drops the attribute _ChunkSizes from all variables with dimension outlet/x, but retains it for time and streamflow with altered value for time (1024 -> 524288)

  5. drops variable outlet_name(outlet)

    • apparently cdo has a bug that prevents it from recognizing this variable, because when it is specified in a cdo select command, it complains that it can't find a variable of that name. Weird. Direct inspection of the file using the Python netCDF4 package shows that variable is defined like all others; Panoply agrees. WTF?

Therefore corrections to these issues must be applied after all cdo operators have been applied. The corrections are:

  1. Rename dimension x to outlet (netCDF4.Dataset.renameDimension)

  2. nc_chars, outlet_name: Create new dimension nc_chars and variable outlet_name(outlet, nc_chars) and copy values from input file

  3. ? Copy attribute _ChunkSizes back onto all variables with dimension outlet

  4. Modify creation/updating of time bounds variable as needed to work in this context.

Other fixes:

  1. Time-dependent variables: Extend nchelpers.CFDataset.dependent_varnames to be able to return names of variables dependent on a specified set of dimensions (specifically, time).

  2. Convert longitudes: Fix nchelpers.CFDataset.lon_var.

  3. When splitting, must include all non-time-dependent variables to be included in the split file, otherwise they are dropped. So the split command looks like cdo select,name={all non-time-dependent vars},{time-dependent var} for each time-dependent var.

corviday commented 6 years ago

Starting to look into this issue.

Some rather old discussion around issues people were having with CDO copy renaming and removing variables seems to indicate that CDO ignores variables it thinks don't actually describe the data, and renames variables it thinks aren't compliant with CF standards.

The CF Standards contain guidelines on how to represent "discrete geometries" (for example, stations with associated timeseries) and sets of "station variables". I thought perhaps CDO wasn't able to understand that the streamflow files represent discrete geometries, and that might be why it was renaming and deleting things, so I wrote a script to do the minor modifications to bring our files up to the standard (add cf_role attributes, etc).

Unfortunately, this doesn't actually seem to matter. CDO is still renaming outlets to x and completely dropping outlet_name.

Much newer discussion indicates that more recent versions of CDO don't ever rename dimensions. Perhaps they also understand the (relatively recently developed) CF standards for discrete geometries? Unfortunately, the prebuilt cdo versions available on Ubuntu right now are all 2+ years old, and I don't think making "build CDO from source" a development requirement on this project is justifiable.

Summary: Modifying data files to see if CDO stops deleting dimensions when everything is perfectly CF-Standards-compliant seems to have been a dead end. It looks like correcting the data after all the CDO operations are complete is the way to go.

corviday commented 6 years ago

Upon further investigation, CDO ignores any variable with type character ( a netCDF "classic" string). CDO deletes thenc_chars axis because no variable (that it cares about, anyway) is using it.

So it looks like there are two categories of CDO weirdness we should be able to detect and correct after calculations are done:

Plus whatever the issue with _ChunkSizes is; none of my test files seem to have _ChunkSizes attributes so I haven't really done any testing of it.