spencerahill / aospy

Python package for automated analysis and management of gridded climate data
Apache License 2.0
83 stars 12 forks source link

Adding custom reduction methods #208

Open chuaxr opened 7 years ago

chuaxr commented 7 years ago

For my current purposes, I am frequently interested in computing averages of a variable over some condition across the time and domain, treating each as an independent sample. For example, computing the average of vertical velocity over regions with precipitation exceeding a certain threshold. Since masking is required to isolate the samples of interest, stacking the data in (time, lat, lon) and then taking the mean (the 'stacked average') is not the same as taking the mean in time, then in (lat, lon).

One workaround is to simply define the stacked average as a user function. However, this leads to duplication of variables such as w_conditioned_on_precip and stacked_average_of_w_conditioned_on_precip. To avoid that, @spencerkclark helped me add a custom reduction method in my copy of aospy.

It would be ideal for a user to add custom reduction methods as desired. If that is unlikely to happen in the near future, though, it would be nice if this 'stacked average' were added in the next release.

spencerahill commented 7 years ago

I agree, supporting custom reduction methods would be extremely useful; @spencerkclark and I were recently talking offline about exactly that.

We could also consider adding a stacked average directly. I don't immediately understand the operation as you've described it though...maybe you'd be willing to share e.g. your w_conditioned_on_precip function?

Initial thoughts re: supporting custom methods:

calc_suite_specs=dict(
    ..., 
    output_time_regional_reductions=['av', 'std', ('my_reduc', my_custom_reduction)],
    ...)

As the code currently stands, the function would have to accept an xr.DataArray as it's sole argument (and one in which the calculation and within-year averaging already performed).

class Reduction(object):
    def __init__(self, name, function):
        self.name = name
        self.function = function

So these would essentially become part of the user's object library and would be passed in as e.g. output_time_regional_reductions=['av', 'std', my_reduc_obj]. This would be more overhead, but it might be preferable in the sense that the function and its label are glued together, preventing problems arising from e.g. using 'my_reduc' as the label once and the next day forgetting it and using 'my_reduc_method' as the label instead.

chuaxr commented 7 years ago

This is the function I use for selecting variables corresponding to specified percentiles of precipitation:

def cond_pct_prcp(prcp_nc,prcp_c,var2cond):
    """Condition on precip percentile
    Parameters
    ----------
    prcp_nc: precipitation
    """
    prcp_tot=total_precip(prcp_nc,prcp_c)
    prcp_pct=pct_prcp(prcp_nc,prcp_c) #percentiles of precipitation

    var2cond=var2cond.where(prcp_tot>prcp_pct)
    var2cond=var2cond.where(prcp_tot<prcp_pct.shift(quantile=-1))

    return var2cond

If I'm understanding you right, 'my_reduc' is meant only as the label for the output file. 'my_reduc_obj' would be a way of tying 'my_reduc' and 'my_custom_reduction' together for the sake of consistency. Either seems fine, and it would be nice if the syntax were standardized between 'av'/'std' and the custom methods.

The issue I faced is with the requirement that within-year averaging be performed. In my particular case, averaging within the year would be a time average, which defeats the purpose of stacking. In the workaround I mentioned, there are some changes to _make_full_mean_eddy_ts to also return the raw data for _apply_all_time_reductions to use in the stacked case. Barring a more general solution, I would suggest adding this extra case within the code.

spencerahill commented 7 years ago

'my_reduc' is meant only as the label for the output file. 'my_reduc_obj' would be a way of tying 'my_reduc' and 'my_custom_reduction' together for the sake of consistency.

Yes, exactly.

The issue I faced is with the requirement that within-year averaging be performed. In my particular case, averaging within the year would be a time average, which defeats the purpose of stacking.

Yes, the Calc pipeline inherently centers around first within-year averaging, and subsequently across-year time reductions, c.f. #207 . I agree that this is something we should relax: we should support arbitrary time and spatial reductions within each year (including no reduction, which is what you need), and then arbitrary time and spatial reductions across all years.

In the workaround I mentioned, there are some changes to _make_full_mean_eddy_ts to also return the raw data for _apply_all_time_reductions to use in the stacked case. Barring a more general solution, I would suggest adding this extra case within the code.

Your workaround could be a useful starting point, if you don't mind sharing it. We'll have to think about how best to go about it though, amidst a wider refactoring of Calc that is becoming more and more necessary.

chuaxr commented 7 years ago

Well, more accurately, it's @spencerkclark's workaround, so I assume he wouldn't object to sharing it. It's in /nbhome/xrc/anaconda2/envs/py361/lib/python3.6/site-packages/aospy/calc.py, which you should both have read access to.

spencerkclark commented 7 years ago

Another option would be to create a Reduction class [...] This would be more overhead, but it might be preferable in the sense that the function and its label are glued together, preventing problems arising from e.g. using 'my_reduc' as the label once and the next day forgetting it and using 'my_reduc_method' as the label instead.

I agree; I lean towards the object-based approach too for the same reason.

As the code currently stands, the function would have to accept an xr.DataArray as it's sole argument (and one in which the calculation and within-year averaging already performed).

The cleanest intermediate fix would probably be to rewrite the pipeline slightly such that all time reductions started from the full time series. This would introduce some duplication of work (say if one wanted to compute the mean over annual means ('av') as well as the standard deviation over annual means ('std'), aospy would compute the time series of annual means twice), but that way we could treat both built-in and custom time reductions the same way.

spencerahill commented 7 years ago

I agree; I lean towards the object-based approach too for the same reason.

It is decided then :)

This would introduce some duplication of work (say if one wanted to compute the mean over annual means ('av') as well as the standard deviation over annual means ('std'), aospy would compute the time series of annual means twice)

I think we should avoid this if possible. One idea would be for the Reduction class to have two reductions: within_year and across_year (either could be None). We could then compare the within_year method of all specified reductions in a Calc: once the first one has computed that step, hold onto the result, and then for subsequent ones, use that already-computed one. Although how to do that comparison isn't necessarily trivial, unless the within-year reductions are the exact same object. Maybe within_year and across_year are Reduction objects themselves?

(This is analogous to the eventual goal of retaining loaded data from disk across Calcs, i.e. #4, but I think is even simpler (no across-process communication to deal with). So it might be useful for that reason also.)

spencerahill commented 6 years ago

FYI I'm planning on (finally) putting in a few cycles on this at some point this month. The initial sketch of the implementation I think still seems like a good starting point.

spencerkclark commented 6 years ago

Awesome! I'd be happy to devote some time as well to help push this forward. This would be a major improvement to aospy.

spencerahill commented 6 years ago

OK, I have sketched out an initial design but could use some input on a few decisions.

How to register reduction methods

C.f. comments above, I want to register all of the Reductions using their labels as keys, so that we can ensure that there are no duplicate labels. Otherwise it is possible that filenames would become ambiguous.

  1. The canonical approaches I found for registering classes is using metaclasses (e.g. here) and for registering class instances is using a class variable (e.g. here). I had in mind the latter, i.e. having users create instances of some Reduction class rather than subclasses of it. E.g. something along the lines of
class Reduction(object):
    instances = {}
    def __init__(self, func, label):
        self.instances.update({label: self})
        ....
    ...

time_avg = Reduction(lambda x: x.mean(YEAR_STR), 'av')

Does that seem reasonable?

  1. Either way, we need to have built-in to aospy at least those reductions already supported, namely 'av', 'ts', and 'std'. The registry, meanwhile, needs to track these and any user-defined ones. My concern is making sure users don't mangle the built-in ones, or for that matter c.f. above comments mangle their own. To this end, I had in mind replacing the self.instances.update...line above with something like:
@classmethod
def register_reduction(cls, reduction):
    label = reduction.label
    if label in cls.instances:
        raise KeyError("A Reduction object already exists with the desired "
                       "label '{0}', and duplicates are not allowed.  "
                       "The existing object is: "
                       "{1}".format(label, cls.instances[label]))
    else:
        cls.instances[label] = reduction

Call signature of reduction methods

What should the call signature of reduction functions be? The simplest would be for them to have a DataArray as their lone argument. But all but the most naive reductions also require the dt array of time weights. Calc ensures this is included as the internal_names.TIME_WEIGHTS_STR coord of the output DataArray, but we also currently convert its units to days to prevent overflow.

My initial thought is to continue doing this, with the time reductions occurring immediately afterward, so that the user can just count on their reduction function being applied to a DataArray whose TIME_WEIGHTS_STR coord has units days. They can convert back to higher frequency if needed, but otherwise won't have to worry about overflow or have to pass in the dt array separately.

Are there time reductions that would require additional arguments? I personally don't have any, but @chuaxr's example above seems to require additional data. That becomes much more complicated; we'd have to think about how to support it.

Nested reductions

The whole point of this endeavor is to relax the current, rigid, first within-year average, then across-years-reduction approach to allow more general reductions. But what is the cleanest way of implementing multi-step reductions such as these? My initial idea for e.g. 'av':

from .utils.times import yearly_average

def _in_year_avg(arr):
    return yearly_average(arr, arr[TIME_WEIGHTS_STR])

time_avg = Reduction(lambda x: _in_year_avg(x).mean(YEAR_STR), 'av')

EDIT: for a linear operation such as an average, this two-step procedure is unnecessary; the above could be replaced with lambda x: x.mean(TIME_STR). But it is necessary for nonlinear operators such as the standard deviation; just replace .mean with .std (assuming as before the desire is the interannual standard deviation, not the standard deviation across all timesteps).


I think that's enough for now...there are other decisions to be made w/r/t regional vs. point-by-point reductions that we can return to.

chuaxr commented 6 years ago

The assumption seems to be that reduction operations will be performed on a time dimension of the data.

At least for my use case, I often wish to simultaneously reduce both in time and space: perhaps by taking the 99th percentile of some value over the domain and last x days. It's not clear if this is something that is meant to be included, although it would be nice :)

spencerahill commented 6 years ago

I often wish to simultaneously reduce both in time and space: perhaps by taking the 99th percentile of some value over the domain and last x days

@chuaxr So long as you can express this logic in a function that accepts the DataArray containing the Calc's value computed at every timestep, this is fine. In other words, if you can express the reduction like this:

def percentile99(arr):
    arr_out = ... # insert desired logic here, which could include 
    # both spatial and temporal operations
    return arr_out

then, based on what I've proposed above, you should be able to represent it via e.g. percentile99_reduc = Reduction(percentile99, 'percent99'). Does that make sense? And if so, would that be sufficient for your needs?

spencerkclark commented 6 years ago

Thanks for putting together these initial thoughts @spencerahill! I like your idea for registering reduction instances; it seems very clean. One question I have though is will the user need to import their Reduction instances in their main script in order for them to be registered in the class within aospy? This seems sort of counterintuitive if they don't end up being explicitly used in the main script.

Regarding call signatures - I agree that we should just convert the time weights to units of days as soon as we load them. The general call signature topic is going to be a challenge though. I'll describe perhaps my most tricky example below. Maybe we can try discussing things a bit to see if we can come up with a sane solution, but I'm fine if we elect to punt on some of the more complicated issues for now. I don't want to heap tons more work onto your plate and would be happy to contribute PR(s) of my own if we can figure out how we'd like to proceed.

What should the call signature of reduction functions be? [...] Are there time reductions that would require additional arguments? That becomes much more complicated; we'd have to think about how to support it.

This is a really tough one. Let me describe an example of a calculation that I would like to be able to do in aospy to illustrate the challenge. It may require re-thinking parts of the current pipeline (but could open the door to making things massively more flexible).

Example

A type of calculation that I find myself doing quite a bit these days is computing a regression pattern. This is a form of a time reduction (i.e. it removes the time dimension from the data). Without going too much into the details, this reduction requires a couple steps:

  1. I need to create a one-dimensional index (a time series) from a specific variable (say OLR or precipitation); computing this index may require filtering the variable to some region of frequency-wavenumber space, then averaging the variable over some spatial area. Depending on the simulation, the regions of spectral and physical space may change, and so might the variable I base the index on.
  2. I then want to regress an arbitrary variable onto this index (this is where the time reduction occurs); it mainly involves some dimension stacking and matrix multiplication.

Let's say I wanted to compute regression patterns for the zonal and meridional winds onto a precipitation index (i.e. two separate calculations). At the minimum, this would require reduction operations to be able to take multiple arguments, because the reduction requires two variables (the index, and the variable we regress onto it).

Thoughts

At the minimum, this would require reduction operations to be able to take multiple arguments

You might say this isn't so bad. Essentially we would need some way for the user to be able to specify how they would like the name of the reduction to appear in file names (that would be a function of the input arguments), and we would need some way for the user to specify the arguments to the reduction in the main script.

Maybe the reduction class could look like this:

class Reduction(object):
    instances = {}
    def __init__(self, func, label, label_append=None):
        """
        Parameters
        ------------
        func : function
            Reduction operation
        label : str
            Label for reduction used in filenames
        label_extend : function
            Function to extend label based on input arguments (optional)
        """
        self.instances.update({label: self})
        self.func = func
        self.label = label
        self.label_append = label_append

    def label(self, *args, **kwargs):
        if self.label_append is None:
            return self.label
        else:
            return '{}-{}'.format(self.label, self.label_append(*args, **kwargs))

We would call label within Calc when constructing the file names. Sticking with the above example, a Regression reduction might look something like this:

def regress(da, index):
    """
    Parameters
    ------------
    da : DataArray
        Variable to regress onto index
    index : DataArray
        1D index in time
    """
    # Compute the regression pattern ...
    return result

def regress_label(da, index):
    return 'onto-{}'.format(index.name)

Regression = Reduction(regress, 'regression', label_append=regress_label)

And maybe in the main script we could allow for the specification of reductions as strings (which is the standard now) for reductions that don't require input arguments, or tuples, e.g. ('regression', [precipitation_index], {}) for ones that do. E.g.

from custom_reductions import Regression
from variables import precipitation_index, ucomp, vcomp

calc_suite_specs = dict(
    ...,
    output_reductions = ['av', ('regression', [precipitation_index], {})],
    variables = [ucomp, vcomp],
    ...
)

Here the implicit assumption is that if an aospy.Var object is passed as an argument to the output reduction, that within the pipeline it will be loaded/computed using the current data loader and load_variable parameters.

Perhaps there could be some tweaks to that design, but that maybe doesn't seem too bad, right? There is a complication though...

Complication

What if the variable needed to compute the index requires slightly different load_variable parameters than the variable we are regressing? This does come up in my use case. For instance, precipitation is always output on model native coordinates (obviously because it is a 2D field and not defined in the vertical), but I typically want to regress the horizontal winds on pressure-interpolated levels onto it. How would we specify that distinction in the main script?

This is perhaps somewhat provocative, but it would be cool if I could write something like:

from custom_reductions import Regression
from variables import precipitation_index, ucomp, vcomp

calc_suite_specs = dict(
    ...,
    output_reductions = ['av', 
        ('regression', [precipitation_index.with_load_params(dtype_in_vert='sigma')], {})],
    variables = [ucomp, vcomp],
    ...
)

Here precipitation_index.with_load_params(dtype_in_vert='sigma') would tell aospy to override the values of input_vertical_dtypes provided by the main script with the value 'sigma' whenever precipitation_index needed to be loaded. With this sort of syntax, I could also possibly see a way forward for doing calculations relative to a specific run (like computing the deviation from a common control simulation), but that's maybe a separate issue (#179).


I'll leave things there for now, but let me know your thoughts and if you have any questions regarding my example.

chuaxr commented 6 years ago

I think we're on the same page, here's a more precise example:

Let's say I have a 10 by 10 (x,t) grid with the numbers 1 to 100. The 90th percentile of the whole grid is 90 (90th number from the start) But if we take the 90th percentile of x and t separately, we would get 89 (9,9 in grid).

spencerahill commented 6 years ago

Thanks @spencerkclark for the detailed response. I understand your use case.

will the user need to import their Reduction instances in their main script in order for them to be registered in the class within aospy?

I think the answer is yes but I'll check.

This seems sort of counterintuitive if they don't end up being explicitly used in the main script.

But they only need import the ones they're using. I.e. if they're not being used, its irrelevant if they're registered or not. Right?

Although I guess, c.f. further below in your comment, that if we want to retain the ability to pass them in as strings, this does become a little counter-intuitive, since then you'd have to import the object even though you don't ever explicitly use it (and thus would get warnings from your linter).

Another idea: specify the module where your reductions are defined as an aospy config setting. E.g. in your main script, akin to xarray's set_options: aospy.set_options(reductions_path='/path/to/your/reductions'). Then just grab every Reduction object that's in that module's local vars.

And yet another option: drop string-based support; use the registry and require that reduction objects themselves be passed in, except perhaps for aospy's built-in reductions. Need to think more about all this.

reductions as strings (which is the standard now) for reductions that don't require input arguments, or tuples, e.g. ('regression', [precipitation_index], {}) for ones that do

Yes, I think that's a solid way of handling it. But your subsequent comment is an important one:

the implicit assumption is that if an aospy.Var object is passed as an argument to the output reduction, that within the pipeline it will be loaded/computed using the current data loader and load_variable parameters.

I have been frequently overestimating how difficult things will be to change in aospy as of late, but nevertheless I suspect this will be quite a challenge.

What if the variable needed to compute the index requires slightly different load_variable parameters than the variable we are regressing?

For this use case, would resolving #84 be sufficient? In that case, dtype_in_vert would just be ignored anyways for the 2D variable.


I am definitely convinced that having reductions accept other variables as args/kwargs is something we should do, and hopefully soon. But I can tell this is going to be a huge change even without that. Therefore, I'm suggesting we proceed with the single argument version of Reductions at present, and I'll open Issues to track generalizing this once we've implemented it.

spencerahill commented 6 years ago

I think we're on the same page, here's a more precise example:

@chuaxr thanks for this, although I must confess I'm still not sure which page we're on!

Just to be abundantly clear: are you saying that (1) yes, the reductions you want to perform only require the data already in the Calc, or (2) no, the reductions you want to perform would require additioinal data that would somehow need to be passed in.

If the answer is yes, that's further motivation to implement the simple version first, and then probably do a 0.3 release, and then return to more involved cases like @spencerkclark's above.

spencerkclark commented 6 years ago

I am definitely convinced that having reductions accept other variables as args/kwargs is something we should do, and hopefully soon. But I can tell this is going to be a huge change even without that. Therefore, I'm suggesting we proceed with the single argument version of Reductions at present, and I'll open Issues to track generalizing this once we've implemented it.

By all means; since I'm the one who needs this complexity, the burden shouldn't be on you to implement it! I agree let's hash out the single argument version first and then move forward with multiple arguments.

Although I guess, c.f. further below in your comment, that if we want to retain the ability to pass them in as strings, this does become a little counter-intuitive, since then you'd have to import the object even though you don't ever explicitly use it (and thus would get warnings from your linter).

Exactly, this was the concern I had in mind. As I think about it more, anything that doesn't automatically load all reductions at once somewhat defeats the purpose of the registration process. It will guarantee that no one overwrites the built-in reductions, but will not guarantee that someone won't overwrite one of their own (admittedly I don't think this is likely). For that reason, maybe I lean toward your suggested set_options approach, since it does require no name collisions among both custom and built-in reductions and eliminates the unused imports.

Some more comments on multi-argument reductions/complications are below, but we can save those for future discussion.


reductions as strings (which is the standard now) for reductions that don't require input arguments, or tuples, e.g. ('regression', [precipitation_index], {}) for ones that do

Yes, I think that's a solid way of handling it. But your subsequent comment is an important one:

the implicit assumption is that if an aospy.Var object is passed as an argument to the output reduction, that within the pipeline it will be loaded/computed using the current data loader and load_variable parameters.

I have been frequently overestimating how difficult things will be to change in aospy as of late, but nevertheless I suspect this will be quite a challenge.

I actually don't see any particular reason why it would be a major challenge 😄. We already do this for each variable specified in a computed Var's variables attribute, e.g. here: https://github.com/spencerahill/aospy/blob/9f9a1bf2a3196228aa7eae553f1b796f33666d63/aospy/calc.py#L391-L394

The trickier case is if the variable needs different load_variable parameters.

For this use case, would resolving #84 be sufficient? In that case, dtype_in_vert would just be ignored anyways for the 2D variable.

We could potentially build in an implicit solution via that route. For example in GFDLDataLoader the dtype_in_vert parameter will alter the location in the post-processing tree we go looking for the data: https://github.com/spencerahill/aospy/blob/9f9a1bf2a3196228aa7eae553f1b796f33666d63/aospy/data_loader.py#L578-L580 I suppose for non-vertically-defined variables we could look in all possible dtype_in_vert locations?

chuaxr commented 6 years ago

@spencerahill yes, I mean (1) :)

spencerahill commented 6 years ago

Thanks @chuaxr and @spencerkclark. All sounds good. We'll continue the discussion of multi-argument reductions in #286. Meanwhile I'll respond again here and/or open a PR once I've got more to share.