spencerahill / aospy

Python package for automated analysis and management of gridded climate data
Apache License 2.0
84 stars 13 forks source link

How to handle calculations for variables not defined in longitude #194

Open spencerahill opened 7 years ago

spencerahill commented 7 years ago

I am computing the meridional mass streamfunction for a simulation. This quantity takes the (lon, lat, vert, time) meridional wind and pressure and outputs a (lat, vert, time) streamfunction. Essentially all of the logic in region.py assumes data that is defined in both latitude and longitude, and so the calculations are crashing. (The first place is within _add_to_mask, but it's throughout the module and Region class).

So we need to think through this. It seems to me that regional averages could only be meaningfully computed for those regions that span all longitudes, e.g. the Tropics (30S-30N) or Northern Hemisphere (0-90N), but not e.g. the Sahel (10-20N, 20W-40E). So we need to compare the dimensions of the data with those of the Region.

This is the first time I've come across this use-case (at least in aospy's modern incarnation). @spencerkclark, I know you've analyzed zonal mean circulations some over the past year or so. Have you come across this at all?

spencerahill commented 7 years ago

Before even getting to the regional calculations, this non-zonally-defined data requires another fix: performing the zonal average within the Var's function causes sfc_area and land_mask get dropped. A quick-fix that worked for me was to add the following within Calc._compute_full_ts just above this line:

        # Re-introduce any coords that got dropped.
        if LON_STR not in full_ts:
            for coord in (SFC_AREA_STR, LAND_MASK_STR):
                if coord not in full_ts:
                    coord_arr = data[0][coord]
                    full_ts = full_ts.assign_coords(
                        **{coord: coord_arr.mean(dim=LON_STR)}
                    )

Essentially, this checks if the longitude dimension exists, and if not (assuming that sfc_area and land_mask have thus been dropped but did exist in the original data), it averages surface area and the land mask in longitude and re-appends them.

I think we could do better here though than this quick fix. So let's think through more carefully about how best to approach it and how to make it more general. One thing that immediately comes to mind is cases like EBMs or axisymmetric GCMs (both of which I'm working with right now outside of aospy) where the input data is not defined in longitude either, so this quick-fix would fail.

spencerahill commented 7 years ago

One idea: use the def_lat, def_lon, etc. attributes of Var to check against the dimensions/coords of the output in Calc._compute.

spencerkclark commented 7 years ago

@spencerahill sorry for taking so long to get back to you on this. To date I've done all my analysis on zonal mean data outside of aospy (also after the time mean). It requires a lot of comparison between simulations [for which aospy, as it's designed currently is also not directly useful for (it's great for generating the data for doing this analysis though)].

These are my quick thoughts on the discussion you started above.

Region objects: For determining which Region objects would be applicable for particular Var objects, I think your idea to use the existing def* attributes to denote which dimensions remain after applying calculation functions could be a nice explicit solution. Hopefully it wouldn't be too hard to determine if a Region was defined in longitude programmatically (e.g. you Sahel example) based on the current method of user input (even for complex regions). Otherwise we might want to consider something more explicit.

Coordinate averaging: It feels a little messy to try to predict exactly what the user would want to do with grid attributes under the hood (e.g. automatically re-appending average versions of them over dropped dimensions). In particular, how would we know whether the dimension was reduced out (causing coordinates to be dropped) or whether those dimensions never existed (as in longitude for the EBMs and axisymmetric GCMs)? Perhaps we could do something more explicit (i.e. give users the tools to solve this problem themselves within their own object libraries)? My ideas aren't particularly well-formed on this, but perhaps an alternative solution could be to use xarray's accessor capability to create some helper functions that act on DataArrays for this purpose?

For instance one could define a coord_preserving_mean function, which would automatically promote all non-index coordinates to variables in a Dataset, take a mean over the specified dimensions, and revert those coordinates back to coordinates within a returned DataArray. Then, for example, wherever one took zonal means in their object libraries, one could use this function rather than xarray's, and coordinates would be preserved.

import xarray as xr

@xr.register_dataarray_accessor('aospy')
class AospyAccessor(object):
    def __init__(self, xarray_obj):
        self._obj = xarray_obj

    def coord_preserving_mean(self, *args, **kwargs):
        name = self._obj.name
        ds = self._obj.reset_coords()
        names = set(self._obj.coords) - set(self._obj.dims)
        ds = ds.mean(*args, **kwargs)
        return ds.set_coords(names)[name]

For example, this would work like:

In [1]: import numpy as np

In [2]: import xarray as xr

In [3]: a = xr.DataArray(np.ones((5, 6)), coords=[np.arange(5), np.arange(6)], 
                         dims=['a', 'b'], name='test')

In [4]: a['test_coord'] = xr.DataArray(np.ones((5, 6)), coords=a.coords)

In [5]: a.aospy.coord_preserving_mean('b')

Out [5]:
<xarray.DataArray 'test' (a: 5)>
array([ 1.,  1.,  1.,  1.,  1.])
Coordinates:
  * a           (a) int64 0 1 2 3 4
    test_coord  (a) float64 1.0 1.0 1.0 1.0 1.0

In [6]: a.mean('b')

Out [6]: 
<xarray.DataArray 'test' (a: 5)>
array([ 1.,  1.,  1.,  1.,  1.])
Coordinates:
  * a        (a) int64 0 1 2 3 4

My gut instinct is to push against adding more logic to the main Calc pipeline, but I could also see how this alternative solution might be pushing additional complexity onto the users. What are your thoughts?

spencerahill commented 7 years ago

@spencerkclark thanks a lot for this...apologies; I also might take a while to respond substantively

spencerahill commented 7 years ago

It requires a lot of comparison between simulations [for which aospy, as it's designed currently is also not directly useful for

Side issue, but what specifically do you mean here? Room for improvement on the aospy side?

It feels a little messy to try to predict exactly what the user would want to do with grid attributes under the hood

I agree.


Very cool example of a custom accessor. I must admit I didn't even really understand them until this!

this alternative solution might be pushing additional complexity onto the users

This concerns me too. Let's see what emerges from https://github.com/pydata/xarray/issues/1497 and then we can re-visit this

spencerkclark commented 7 years ago

Side issue, but what specifically do you mean here? Room for improvement on the aospy side?

Possibly yes -- I have some code that I use to systematically load data produced by aospy, which is particularly helpful for setting the stage for doing this kind of analysis. For instance it can be convenient to have a Dataset of a collection of variables indexed by a run name (in addition to the usual grid-dimensions/coordinates).

E.g. something like:

<xarray.Dataset>
Dimensions:      (run: 3, lat: 64, lon: 128 pfull: 30)
Coordinates:
  * run  (run) object 'run_a' 'run_b' 'run_c'
  * lon          (lon) float64 0.0 2.812 5.625 8.438 11.25 14.06 16.88 19.69 ...
    sfc_area     (lat, lon) float64 3.553e+09 3.553e+09 3.553e+09 3.553e+09 ...
    land_mask    (lat, lon) float64 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ...
  * lat          (lat) float64 -87.86 -85.1 -82.31 -79.53 -76.74 -73.95 ...
  * pfull        (pfull) float64 3.385 10.78 14.5 19.3 25.44 33.2 42.9 54.88 ...
Data variables:
    var_a       (run, lat, lon) float64 5.845e-06 ...
    var_b          (run, lat) float64 -6.344e+13 ...
    var_c (run, pfull, lat, lon) float64 0.00162 ...
    var_d          (run, lat, lon) float64 150.1 ...

If you have some parameters in simulations that you tweak in a systematic way (my current use-case), it can alternatively be even more helpful to have those as coordinates (rather than run names).

E.g. something like:

<xarray.Dataset>
Dimensions:      (param_a: 2, lat: 64, param_b: 2, lon: 128, param_c: 4, pfull: 30, param_d: 4)
Coordinates:
  * param_a  (param_a) int64 0 1
  * param_b     (param_b) object 'c' 'd'
  * param_c    (param_c) int64 1 2 3 4
  * param_d          (param_d) int64 5 6 7 8
  * lon          (lon) float64 0.0 2.812 5.625 8.438 11.25 14.06 16.88 19.69 ...
    sfc_area     (lat, lon) float64 3.553e+09 3.553e+09 3.553e+09 3.553e+09 ...
    land_mask    (lat, lon) float64 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ...
  * lat          (lat) float64 -87.86 -85.1 -82.31 -79.53 -76.74 -73.95 ...
  * pfull        (pfull) float64 3.385 10.78 14.5 19.3 25.44 33.2 42.9 54.88 ...
Data variables:
    var_a       (param_a, param_b, param_c, param_d, lat, lon) float64 5.845e-06 ...
    var_b          (param_a, param_b, param_c, param_d, lat) float64 -6.344e+13 ...
    var_c           (param_a, param_b, param_c, param_d, pfull, lat, lon) float64 0.00162 ...
    var_d          (param_a, param_b, param_c, param_d, lat, lon) float64 150.1 ...

Then if I wanted to look at differences in 'var_a' between cases with param_a = 1 and param_a = 0, holding all other parameters fixed, I could simply write: ds['var_a'].sel(param_a=1) - ds['var_a'].sel(param_a=0).

The code I use now to construct these essentially just recreates the expected file names of aospy output, loads one Dataset at a time, assigns appropriate coordinates, and then does some merging to put things all in a single Dataset. It could be nice if we added some utility functions to aospy that could make things like this more straightforward, but I agree, off-topic in this issue.

darothen commented 7 years ago

@spencerkclark my experiment tool does exactly what you describe in that post. It constructs everything using dask arrays so that calculations are deferred. Earlier in the year I was playing around with ways to extend the NetCDF data model to handle the case where one of the "param" or "run" dimensions was a dataset with different lat/lon dimensions (e.g. a different model participating in CMIP5), and I've scoped out solutions which lean on hierarchical Groups inside a NetCDF file to do so, but it's pretty rough-shod.

spencerkclark commented 7 years ago

@darothen indeed your experiment tool caught my eye a few weeks ago and I've been meaning to try it out. Maybe later this week!

spencerahill commented 7 years ago

Thanks @spencerkclark for fleshing that out, and @darothen for sharing experiment. I agree, looks like a great tool.


it can be convenient to have a Dataset of a collection of variables indexed by a run name

If you have some parameters in simulations that you tweak in a systematic way (my current use-case), it can alternatively be even more helpful to have those as coordinates (rather than run names)

The code I use now to construct these essentially just recreates the expected file names of aospy output, loads one Dataset at a time, assigns appropriate coordinates, and then does some merging to put things all in a single Dataset.

Now I understand...in fact I also have hacked together similar functionality for use on one of my projects where my simulations are a sweep over two independent parameters (although it's entirely outside of aospy). Clearly it's a common use case :)


I was playing around with ways to extend the NetCDF data model to handle the case where one of the "param" or "run" dimensions was a dataset with different lat/lon dimensions (e.g. a different model participating in CMIP5)

Is this related to https://github.com/pydata/xarray/issues/1482 ? I'd be interested to hear more about this...CMIP-style comparisons are obviously a huge use case and what partly motivated aospy in the first place. aospy shines in looping over models with different grids to repeat calculation across them, but it doesn't have a data structure like you're describing (or even data loading methods) that makes subsequent interaction with those results especially easy.

Do you have an issue in your repo open on this topic where we could discuss further? On the one hand I could see a place for something like this within aospy. On the other hand, I like the idea of it as a more general purpose tool, which seems to be what you've started in on. Also, TBH I am struggling to find the time lately even to fix our existing bugs, let alone add new features!

Last thought: such a tool seems like a great candidate for an official pangeo-data project...


In terms of the actual subject of this Issue: I still need to think about it! 😛

spencerkclark commented 7 years ago

I was playing around with ways to extend the NetCDF data model to handle the case where one of the "param" or "run" dimensions was a dataset with different lat/lon dimensions (e.g. a different model participating in CMIP5)

I'm also very interested in this, and would like to continue the discussion in the appropriate place. @spencerahill see also: https://github.com/pydata/xarray/issues/1092

darothen commented 7 years ago

Sorry to derail your thread! Started a new issue over at darothen/experiment#18 if you want to brainstorm on the Groups idea.

spencerahill commented 7 years ago

Excellent! Thanks Dan.

jdossgollin commented 6 years ago

Flagging this issue -- I would also be interested in things like zonal means