mcgibbon / sympl

A toolkit for building planetary/Earth system models in Python
http://sympl.readthedocs.io
Other
50 stars 14 forks source link

How to take care of xarray broadcasting #45

Closed JoyMonteiro closed 6 years ago

JoyMonteiro commented 6 years ago

I have run into this "issue" a couple of times, and thought we should think about it.

I have two arrays, one on interface_levels and one on mid_levels. I finite difference the first, multiply it with the second and sum the output. This is a very common operation when you are taking the pressure weighted mean of some quantity in the vertical.

The two arrays have the same shape, but the dimensions have different labels. Multiplying them gives gives an array of shape [mid_levels, interface_levels] and therefore the answer is incorrect.

I figured this out once, forgot about it, and fell for it again. This is bound to happen if we encourage users to use the DataArray wrapper to perform all calculations.

Any suggestions about how to keep these issues to a minimum?

mcgibbon commented 6 years ago

Either we encourage users to use numpy arrays for such operations, or this needs to be solved at the xarray level. Any thoughts on this @shoyer @jhamman? We could make a helper function that changes the dimension for use in climt, but people are bound both to not use it and to want slightly different numerical schemes.

I think the solution is a combination of giving helper functions and encouraging users to do their computation using numpy arrays when taking differences before wrapping back into DataArrays. We do our computation in components using numpy arrays for many reasons, including this one.

mcgibbon commented 6 years ago

For the general case of a DataArray based solution, the best we can do is replace the dimension with a null or random name, since it's impossible to automatically figure out if a given operation is a differencing operation.

shoyer commented 6 years ago

I'm not sure I understand the problem. Are there coordinates associated with mid_levels and interface_levels? If not, can you simply use rename to change the names of one of these dimensions?

To be clear, I'm also not sure that xarray is the right solution to this problem. For calculations in the inner loop of a numerical model you generally don't want the overhead of using xarray anyways. My general suggestion is to only bother with labeled dimensions/coordinates in your high level interfaces, but use NumPy arrays inside. I went into this a little bit in the xarray docs: http://xarray.pydata.org/en/stable/computation.html#wrapping-custom-computation

mcgibbon commented 6 years ago

@shoyer within our own components we strictly take numpy arrays out of DataArrays and perform computation on the numpy arrays inside. We're talking here about when users write models and perform analysis outside of the components we've written, because at that level all they see is DataArray objects unless they explicitly take out the numpy arrays on their own. We're trying to think of a way to make it impossible for the user to make this kind of mistake (performing finite difference operations with DataArray objects).

Your response again makes me feel like we should include a helper function that takes finite differences on DataArray objects, which internally uses numpy arrays, and generally tell users to perform differencing operations on numpy arrays.

mcgibbon commented 6 years ago

it sounds like (array[1:] - array[:-1]).rename({interface_levels: 'mid_levels'}) would work on DataArray objects.

JoyMonteiro commented 6 years ago

Yes, it appears as if the most straightforward way right now is to provide a helper function. However, instead of providing a finite difference helper function, which has the limitations that @mcgibbon mentioned, it might help to provide a weighted mean helper function.

It might be helpful to think of some of these functions that get used often (like weighted vertical mean, area averaged mean etc.,) that we could add.

mcgibbon commented 6 years ago

A finite difference helper function does not have the limitations I mentioned above, since it would be tied to the grid that CliMT uses. However, this would be more appropriate as a function (and issue) on CliMT, and we should move the discussion there.