Closed agstephens closed 3 years ago
Suggestions from PR #109 ...
I think that our (C3S) requirements could be summed up by implementing something like:
In: clisops.core.average:
from roocs_utils.xarray_utils.xarray_utils import get_coord_type, known_coord_types
def average_over_dims(ds, dims=None, ignore_unfound_dims=False):
if not dims: return ds
if not set(dims).issubset(set(known_coord_types)):
raise Exception(f'Unknown dimension requested for averaging, must be within: {known_coord_types}.')
found_dims = []
# Work out real coordinate types for each dimension
for coord in ds.coords:
coord_type = get_coord_type(coord)
if coord_type:
found_dims.append(coord.name)
if ignore_unfound_dims == False and set(found_dims) != set(dims):
raise Exception(f'Requested dimensions were not found in input dataset: {set(dims) - set(found_dims)}.')
da = ...cast dataset to DataArray...
ds_averaged_over_dims = da.mean(dim=found_dims, skip_na=???, keep_attrs=???)
return ...
What do think? If it is really that easy for us create the functionality we need for C3S work then the real issue here is making sure that we structure the code so that the various layers work together properly.
@cehbrecht @ellesmith88 , how should we make a nice interface to the average_bbox
function? E.g.:
We want to manage 3 cases:
average_bbox(lat_bnds=[-40, 23], lon_bnds=[140, 360], start_time="1999-01-01")
average_bbox(entire_dims=['time', 'latitute'])
average_bbox(lat_bnds=[-40, 23], entire_dims=['time', 'longitude'])
This looks quite messy to me. My suggestion is that we just implement the bare minimum of what we need right now. In which case we just need to implement:
clisops/ops/average.py
:
def average_over_dims(ds, dims, ...)
We'll need to talk more with Ouranos. (see PR).
@ellesmith88 Thanks for putting together the first implementation of clisops/ops/average.py
(https://github.com/roocs/clisops/compare/ops_average). It is looking great. I have some minor comments, and I thought I'd put them in this issue...
Terminology (use of "unfound"):
ignore_unfound_dims
.File-naming (in: https://github.com/roocs/clisops/compare/ops_average#diff-1b2988b7b0ed92b0cd0a836847c693d2b4e5749039b3605910dfe0815f318967R74-R79):
extra=""
to constructor of base file namerextra
, it should be added to the end of the file name before the extension_avg-${dims}
, where ${dims}
is one of t, tz, tzy, tzyx, tzx, tzy, tyx, ty, tx, zyx
etc.,replace
option in - because I suspect it will come in handy laterSpecial case for the file-namer - no averaging over time:
extra="_avg-...."
part)average
function is kept nice and simpleThere are 2 points which are more fundamental and we might want to think deeply about:
average
that loops through each of the time slices, does the code that calculates the time slices already understand the reduced shape(s) of the outputs? If it does, then all is good. If it does not, then maybe we need to make time_slices = get_time_slices(avg_ds, split_method)
into something more generic that looks at the shape of the array to work out whether to split it across multiple files.Sorry, these are dense issues - but probably worth having a good think about before we make the leap into our second operator. Let me know your thoughts.
Two thoughts that I had as well:
Issue number 2 (looking at get_time_slices(avg_ds, split_method)
): the code does understand the reduced outputs. I've done a few tests/ used pdb
- but looked at the same dataset with no dimensions averaged and with one dimension averaged. In the case of no dimensions the files are split and in the case of one dimension they are not
Issue number 2 (looking at
get_time_slices(avg_ds, split_method)
): the code does understand the reduced outputs. I've done a few tests/ usedpdb
- but looked at the same dataset with no dimensions averaged and with one dimension averaged. In the case of no dimensions the files are split and in the case of one dimension they are not
@ellesmith88 : that is great. It means we don't have to do any refactoring of that part 👍
@ellesmith88: here are is an idea for how we might turn the Operations into classes with a common re-usable base class. Here is simple prototype:
class Operation(object):
def __init__(self, ds, file_namer='standard', split_method='time:auto', **params):
self._file_namer = file_namer
self._split_method = split_method
self._resolve_dsets(ds)
self._resolve_params(**params)
def _resolve_dsets(self, ds):
self.ds = ds
def _resolve_params(self, **params):
self.params = params
def process(self):
raise NotImplementedError
class Subset(Operation):
def process(self):
return self.ds[:2]
def subset(ds, file_namer='standard', split_method='time:auto', **params):
subsetter = Subset(ds, file_namer=file_namer, split_method=split_method, **params)
result = subsetter.process()
So, the subset
function is just a wrapper to do:
process()
on that instance and return the resultPossible methods on the Operation
class, might be:
def _resolve_dsets(...):
def _resolve_params(...):
def _set_file_namer(...):
def _calculate(...): <-- overridden in each sub-class
def process(): <-- loops through time slices to generate outputs, returns results
I have been looking at clisops.ops.average
and clisops.ops.subset
side-by-side. There is lots of commonality.
Great, agreed that average
and subset
were very similar. I think this basic prototype looks good, it will definitely help keep things a bit neater
Thanks @ellesmith88, if you have time, could you create a first prototype (in your ops_average
) and we talk it through in our 1130 meeting tomorrow?
Thanks @ellesmith88, if you have time, could you create a first prototype (in your
ops_average
) and we talk it through in our 1130 meeting tomorrow?
I'll give it a go 😊
We need to provide an
average
function in clisops with an associateddaops
andclisops
method. This is a high-level GitHub issue regarding that function. It breaks down into more issues.Overall plan:
Our
average
usesxarrray
directly, as documented here: http://xarray.pydata.org/en/stable/generated/xarray.DataArray.mean.htmlWe are not going to support weighted means or complex averages.
This function only supports a simple and complete average across one or more dimensions of a hypercube.
Issues to explore:
Extra arguments: https://github.com/roocs/clisops/issues/28
How do we limit define the inputs: https://github.com/roocs/clisops/issues/29