Define "average" operation interface

agstephens commented 4 years ago

We need to provide an average function in clisops with an associated daops and clisops method. This is a high-level GitHub issue regarding that function. It breaks down into more issues.

Overall plan:

Our average uses xarrray directly, as documented here: http://xarray.pydata.org/en/stable/generated/xarray.DataArray.mean.html
We are not going to support weighted means or complex averages.
This function only supports a simple and complete average across one or more dimensions of a hypercube.

Issues to explore:

Extra arguments: https://github.com/roocs/clisops/issues/28
How do we limit define the inputs: https://github.com/roocs/clisops/issues/29

agstephens commented 3 years ago

Suggestions from PR #109 ...

I think that our (C3S) requirements could be summed up by implementing something like:

In: clisops.core.average:

from roocs_utils.xarray_utils.xarray_utils import get_coord_type, known_coord_types

def average_over_dims(ds, dims=None, ignore_unfound_dims=False):
    if not dims: return ds
    if not set(dims).issubset(set(known_coord_types)):
         raise Exception(f'Unknown dimension requested for averaging, must be within: {known_coord_types}.')

    found_dims = []

    # Work out real coordinate types for each dimension 
    for coord in ds.coords:
          coord_type = get_coord_type(coord)
          if coord_type:
               found_dims.append(coord.name)

    if ignore_unfound_dims == False and set(found_dims) != set(dims):
        raise Exception(f'Requested dimensions were not found in input dataset: {set(dims) - set(found_dims)}.')

    da = ...cast dataset to DataArray...
    ds_averaged_over_dims = da.mean(dim=found_dims, skip_na=???, keep_attrs=???)

    return ...

What do think? If it is really that easy for us create the functionality we need for C3S work then the real issue here is making sure that we structure the code so that the various layers work together properly.

agstephens commented 3 years ago

@cehbrecht @ellesmith88 , how should we make a nice interface to the average_bbox function? E.g.:

We want to manage 3 cases:

Average over a subset along given dimensions:
- average_bbox(lat_bnds=[-40, 23], lon_bnds=[140, 360], start_time="1999-01-01")
Average over all values along given dimensions:
- average_bbox(entire_dims=['time', 'latitute'])
Average over a combination of entire dimensions and subsets of others:
- average_bbox(lat_bnds=[-40, 23], entire_dims=['time', 'longitude'])

This looks quite messy to me. My suggestion is that we just implement the bare minimum of what we need right now. In which case we just need to implement:

clisops/ops/average.py:

def average_over_dims(ds, dims, ...)

We'll need to talk more with Ouranos. (see PR).

agstephens commented 3 years ago

@ellesmith88 Thanks for putting together the first implementation of clisops/ops/average.py (https://github.com/roocs/clisops/compare/ops_average). It is looking great. I have some minor comments, and I thought I'd put them in this issue...

Terminology (use of "unfound"):
- I am not keen on the name of the argument ignore_unfound_dims.
- Could we refer to "undetected" dims instead of "unfound" dims?
- I know I am nit-picking (sorry :-) )
File-naming (in: https://github.com/roocs/clisops/compare/ops_average#diff-1b2988b7b0ed92b0cd0a836847c693d2b4e5749039b3605910dfe0815f318967R74-R79):
- I have a suggestion for slightly modifying the file namer interactions, something like:
  - add argument extra="" to constructor of base file namer
  - if user provides extra, it should be added to the end of the file name before the extension
  - in our case, we should add: _avg-${dims}, where ${dims} is one of t, tz, tzy, tzyx, tzx, tzy, tyx, ty, tx, zyx etc.,
  - so it is a short string but easy enough to follow
  - you could keep the replace option in - because I suspect it will come in handy later
Special case for the file-namer - no averaging over time:
- if the dataset has not been averaged over time, then it is okay to keep the existing time-component in the output file names (and then append the extra="_avg-...." part)
- this requires a bit more code so I would suggest a helper function that resolves all the file naming issues in one place so that the content of the main average function is kept nice and simple

There are 2 points which are more fundamental and we might want to think deeply about:

Should we be refactoring these into more generic operations - maybe by using a class hierarchy and making the resulting objects callable (i.e. function-like - even though they are class instances) - that would allow us to include stages such as: process args; prepare dataset; get file namer; compute etc ??? 2. In the section of average that loops through each of the time slices, does the code that calculates the time slices already understand the reduced shape(s) of the outputs? If it does, then all is good. If it does not, then maybe we need to make time_slices = get_time_slices(avg_ds, split_method) into something more generic that looks at the shape of the array to work out whether to split it across multiple files.

Sorry, these are dense issues - but probably worth having a good think about before we make the leap into our second operator. Let me know your thoughts.

ellesmith88 commented 3 years ago

Two thoughts that I had as well:

Do we want a parameter class for dims as we have for the subset parameters?
How do we want the file naming to work for a workflow - currently the final operation will decide the file name of the output

ellesmith88 commented 3 years ago

Issue number 2 (looking at get_time_slices(avg_ds, split_method)): the code does understand the reduced outputs. I've done a few tests/ used pdb - but looked at the same dataset with no dimensions averaged and with one dimension averaged. In the case of no dimensions the files are split and in the case of one dimension they are not

agstephens commented 3 years ago

Issue number 2 (looking at get_time_slices(avg_ds, split_method)): the code does understand the reduced outputs. I've done a few tests/ used pdb - but looked at the same dataset with no dimensions averaged and with one dimension averaged. In the case of no dimensions the files are split and in the case of one dimension they are not

@ellesmith88 : that is great. It means we don't have to do any refactoring of that part 👍

agstephens commented 3 years ago

@ellesmith88: here are is an idea for how we might turn the Operations into classes with a common re-usable base class. Here is simple prototype:


class Operation(object):

    def __init__(self, ds, file_namer='standard', split_method='time:auto', **params):
        self._file_namer = file_namer
        self._split_method = split_method
        self._resolve_dsets(ds)
        self._resolve_params(**params)

    def _resolve_dsets(self, ds):
        self.ds = ds

    def _resolve_params(self, **params):
        self.params = params

    def process(self):
        raise NotImplementedError

class Subset(Operation):
    def process(self):
        return self.ds[:2]

def subset(ds, file_namer='standard', split_method='time:auto', **params):
    subsetter = Subset(ds, file_namer=file_namer, split_method=split_method, **params)
    result = subsetter.process()

So, the subset function is just a wrapper to do:

Create Operation instance
Run process() on that instance and return the result

Possible methods on the Operation class, might be:

def _resolve_dsets(...):
def _resolve_params(...):
def _set_file_namer(...):
def _calculate(...): <-- overridden in each sub-class
def process(): <-- loops through time slices to generate outputs, returns results

agstephens commented 3 years ago

I have been looking at clisops.ops.average and clisops.ops.subset side-by-side. There is lots of commonality.

ellesmith88 commented 3 years ago

Great, agreed that average and subset were very similar. I think this basic prototype looks good, it will definitely help keep things a bit neater

agstephens commented 3 years ago

Thanks @ellesmith88, if you have time, could you create a first prototype (in your ops_average) and we talk it through in our 1130 meeting tomorrow?

ellesmith88 commented 3 years ago

Thanks @ellesmith88, if you have time, could you create a first prototype (in your ops_average) and we talk it through in our 1130 meeting tomorrow?

I'll give it a go 😊

roocs / clisops

Define "average" operation interface #30