pygeode / pygeode

Python package for dealing with large gridded datasets
http://pygeode.github.io/
Other
11 stars 8 forks source link

removal of the View class #9

Open neishm opened 9 years ago

neishm commented 9 years ago

Original issue 9 created by neishm on 2011-06-23T19:09:38.000Z:

Background

Right now, we have a special class for defining which subset of data to retrieve, called a View. Views contain both a list of axes, and a list of corresponding slices / indices that define the subset of data we're interested in. This was designed with the intent of abstracting out the details of the order of dimensions and to allow mapping to different domains.

For instance, we can define a particular View on an output variable (say, pressure-interpolated ozone), and this View provides a mechanism for automagically determining the corresponding domains needed for the inputs (ozone on model levels, and pressure on model levels).

In theory, this is a really useful interface to have. We can mix & match different input domains, and the View can usually figure out what to do with no further programming needed (e.g., if our surface pressure was defined on a 6h timestep, but our output field was only available every 18h, the View would figure out how to map the data request from the 18h domain into the 6h domain).

In practice, though, this has turned out to have some flaws.

Problems

-In some cases, we want the input vars to conform to the same shape as the output var, in other cases we want the input shape to be preserved. This means we need a special flag to alter the way the mapping is done. There's also a special flag for strictly asserting that we're mapping to all the input axes, because this was an annoying bug in the early days of development. To be honest, I don't even remember the exact usage of the conform/strict flags, and I think there's still occasionally problems with them.

-Similarly, in many cases it's impossible to do a direct mapping between inputs and output (e.g. when an axis is being transformed). To handle this, we have to use either add_axis / remove / replace_axis / modify_slice / unslice / only_slice to alter the View to suit our inputs. This is a common occurrence in any advanced operator that changes axes around or needs different input domains (deriv, interp, intgr, smooth, composite, and many, many more). We basically lose the benefit of the View abstraction, because we have to coerce the axes anyway.

-The code is fragile. In order to work its magic, the View compares every input and output axis, to see if it's possible for one to map to the other. If a mapping is possible, then the requested slice on the output axis is contorted to fit into the domain of the input axis. This is no small feat - it was a gruelling effort of programming-by-permutation on some test slices until it 'worked'. Moreover, the code will fail in a fundamental way if we have, for instance, 2 output axes of the same type (perhaps we're defining an averaging kernel or a covariance matrix). In that instance, the mapping mechanism will have no way to know which slice is supposed to map to which input, because it just doesn't have enough information to do its job.

-The code is hard to follow. The data goes through so many contortions in a single View.get() that it's near impossible to verify the correctness of the routine just by looking at it. If a young, doe-eyed developer were interested in learning our program, this kind of mess might scare them away indefinitely. I know there are times it scares me.

-The mechanism is too abstract. It's designed to figure out how to relate a set of output axes to an arbitrary set of input axes. In our use of the View (in getview() methods), we don't have arbitrary inputs - we know exactly what we have. We know exactly how the input axes are supposed to map to the outputs, because we've already had to figure that out in the init method! Right now, Var operators go through the following perverse steps: 1) at init time, look at all the input axes, and figure out what the output axes of our operator should be. 2) at the end of init, completely forget all context on how our axes were chosen. 3) when our getview() is invoked, tell the View object to get our inputs for us, somehow. 4) the View then has to work its magic to reverse-engineer how the input and output axes are related, and figure out what input slices are needed. In the best case scenario, this involves duplication of the work done in init. In the worst case, the wrong magic spell is used and we don't get the data we thought we'd get.

Proposed Solution

I say we scrap the View class entirely. It has served its purpose well in the past (as a stepping stone in developing on-the-fly data operators), but it's starting to stick out like a sore thumb.

I propose the following approach to replace the functionality of View: In the init method of operators, we wrap the input vars so they have a consistent axis order, and so they also map nicely to the output axes. By wrapping our inputs, we eliminate most of the magic required from View. We know what our inputs are (since we've asserted control over them), so we know at runtime how to pass slices to them.

For example, in UfuncVar.init, wrap all the input vars so they all have the same axes - add dummy axes if needed, and perform transposes to get a consistent order among all inputs. This step doesn't have to be hard-coded in ufunc, it could be put into a separate helper function (say, conform_all()) that other operators could use as well. In this case, our output axes would now be the same as all our input axes, and our get....() routine can simply pass the requested slices directly to the inputs!

For a more complicated example, let's look at a hypothetical vertical-interpolation routine (I'm going to ignore our more general interpolator for the sake of brevity). We have 3 inputs to our init - data on model levels, a pressure field on model levels, and a 1D array of pressure values to interpolate to. Since our low-level code works best when we have contiguous vertical columns, transpose the input data so that model levels are the fastest-varying (rightmost) axis. Similarly, conform the pressure field so it has the same order of axes as our data - we could use the same conform_all() logic as in ufunc for this. To form our output axes, take the axes from our inputs, and replace the rightmost axis with the 1D pressure axis. There's no ambiguity about our axes anymore - we know exactly where to find what we need. Now, suppose our get....() method is invoked, requesting some particular slices of data. All we need to do is scrape off the rightmost slice (pressure levels), and replace it with a slice(None) to select all model levels. We don't need any special magic to map our slices, because we've asserted control over the inputs at an earlier stage. We can now invoke the get() routines on our input data and pressure field - no muss, no fuss. All that's left is the actual interpolation (but who cares about that).

There will be a small number of core operations (transpose, squeeze, extend, slice) that will need to be properly implemented as a result of these changes. At the moment, they each have a degenerate getview() that invokes the magic of View to do the dirty work for them, which is kind of like asking Superman to open a jar of pickles. Instead, each of these 'core' operators would take ownership of their operation, and do the work themselves. This will make the behaviour of fundamental operators like SlicedVar much easier to follow in the code!

There are some utilities within the view module that we'll want to keep (simplify, expand, loop_mem), but once the View references are stripped from them, they could be rolled into the 'tools' module.

Plan

It would be impossible to remove the View class in one fell swoop, since it touches damn near every other module. It would have to be done in stages (with adequate sanity checks in between):

1) Implement alternate versions of the above-mentioned core operations, which don't rely on View magic. Test these against the old operators, to make sure they work. Once they're fully tested, replace the old operators with them.

2) In the Var class, create a new method called, say, 'getstuff', which takes as arguments a list of slices and a progress bar (pbar). At this stage, it will be a simple stub that makes a View object from the slices, and invokes its own self.getview().

3) Create the above-mentioned conform_all() wrapper, and any other useful wrappers for manipulating input vars. Check that they actually work.

4) In the init() of every PyGeode operator, start using these wrappers to assert control over the input variables (e.g., make the axes consistent).

5) As (4) progresses, change the corresponding getview() command to skip the View-related stuff. Even though we still get a View as an input parameter, just pull out view.slices and use that directly. If (4) was done properly, we should need only trivial changes to get input slices from our output slices. We can then invoke invar.getstuff(slices) directly (instead of view.get(invar)). At the end, we should have no more explcit calls to view.get (or view.map_to).

6) Once all references to View.get() have been removed, we can transition each variable's getview() command into a getstuff() command (overriding the stub in the Var class).

7) Once there are no more getview() implementations left, we can remove the getstuff() stub in the Var class (since it should now be overridden in every single Var subclass).

8) Move any useful utility method from the view module into tools.

9) Delete the View class

Notes

Of course, this is probably not a complete plan, and there'll be a few snags along the way.

We don't have to use 'getstuff', it's just a placeholder for a more cromulent name. I didn't want to use 'getslice' because we already have get_slice() in the Axis class.

This proposal does not address some other known shortcomings in the getview() method, such as no caching of data and not allowing multiple outputs for an operator. That kind of stuff will be a separate proposal. I figured the View class would be an easier(!) thing to fix first, and my hope is the getview (/getstuff?) implementations will become simple enough that we can start addressing these other concerns.

Any comments / suggestions would be appreciated.

Thanks, Mike

neishm commented 9 years ago

Comment #1 originally posted by aph42 on 2011-06-30T08:17:39.000Z:

I agree with the issue that the code is hard to follow; the calls back and forth between view.get and var.get are quite convoluted, and I think there's a lot of code that's there to support deprecated modes of accessing variable data.

Coming up with some defined way of dealing with multiple axes of the same type, and generalizing the mapping between axes are also areas for improvement.

Beyond that, however, I'm not convinced the other `issues' are problems. The 'duplication' of work between init and getview isn't one as far as I can see

I go back and forth on the approach you've suggested. While in principle i like the idea of making use of the elementary variable operations, i'm not really convinced it will simplify the code, either in writing it or in debugging it - for one I can see the variable stack growing quite a bit bigger if each operation needs three layers between it and its inputs in order to properly shape the input data. I'm not really clear on how one would write the interpolation class in your example - the whole problem is that there isn't a simple map between the input and the output axis - that's what the interpolation has to do. Mapping a slice in the interpolated space to a slice in the original space isn't something a generic slice variable knows how to do; that's one part of what a variable's getview() is all about, and hence all the helper operations for modifying slices in the view class. Maybe some more specific pseudo-code would help?

It's also not clear how this would help with two major features we've spoken about: parallelization (and related to that the optimization of the chain of calculation), and the joint calculation of multiple variables (i.e. computing regression coefficients and p-values at the same time, etc.). If we're going to tackle such a major refactoring, we should at least think about those two issues.

neishm commented 9 years ago

Comment #2 originally posted by neishm on 2011-07-18T18:24:12.000Z:

Maybe some more specific pseudo-code would help?

Sure thing. I hope this is somewhat legible.

Here's an interpolator that uses Views to do the mapping:

class Vertical_Interpolation (Var):

-- Init routine --

indata - input data on model levels

pfield - pressure field on model levels

paxis - output pressure axis

def init (self, indata, pfield, paxis):

# << make model levels the fastest varying axis for indata, pfield >>

# Determine the output axes
outaxes = list(indata.axes)[:-1] + [paxis]

self.indata = indata
self.pfield = pfield

Var.__init__(self, outaxes, name=indata.name)

def getview (self, view, pbar=None):

Modify our view to work on the input axes (switch paxis to model levels)

inview = view.replace_axis(self.naxis-1, self.invar.axes[-1])
# Read input data and pressure field for the requested sub-domain
indata = inview.get(self.indata)
pfield = inview.get(self.pfield)

# << manipulate indata/pfield so they have the same order of axes >>

# << do interpolation >>
# << apply any paxis slicing >>

This implementation isn't too complicated, but there's a lot going on in the calls to inview.get(). Every invocation triggers a call to inview.map_to(), which in turn has to loop over every input/output axis and determine which ones match. This happens for every single chunk of data requested. While it's not that much CPU overhead relative to the numerical operations themselves, it seems to me like a brute-force approach to the problem.

Here's the version that avoids using Views:

class Vertical_Interpolation (Var):

-- Init routine --

indata - input data on model levels

pfield - pressure field on model levels

paxis - output pressure axis

def init (self, indata, pfield, paxis):

# Conform indata and pfield so they have the same order of axes
indata, pfield = conform_all (indata, pfield)

# << make model levels the fastest varying axis for indata, pfield >>

# Determine the output axes
outaxes = list(indata.axes)[:-1] + [paxis]

self.indata = indata
self.pfield = pfield

def getstuff (self, slices, pbar=None):

Replace the paxis slice argument with slice(None) to get all model levels

inslice = list(slices)[:-1] + [slice(None)]
# Read input data and pressure field for the requested sub-domain.
indata = indata.getstuff(slices)
pfield = pfield.getstuff(slices)

# << do interpolation >>
# << apply any paxis slicing >>

In init here, we make sure the inputs have the same axes, in the same order. The output axes are also the same, except for the last axis. This takes care of part of the work that was done in view.map_to - determining the order of the input/output axes between the mapping. This ordering is handled once - at init time - instead of repeatedly for each chunk like in the first class definition.

Let's look at the stack. Before, we had alternating Var.getview()/View.get() calls: Interp.getview -> inview.get -> indata.getview -> ...

Now, we get a bit more of a homogeneous stack: Interp.getstuff -> indata.getstuff -> original_indata.getstuff -> ...

We have to handle the wrapping of the original inputs, so our stack is still about the same size as before. However, we're not doing as much work in between each layer, because each mapping is explicitly defined. The mapping from our outputs to our wrapped inputs involves un-slicing the last axis. I'm presuming that the conform_all wrapper would keep track of any ordering changes it applies, and so the mapping in that stage would involve transposing the list of slices by that predetermined order. There is no need here to loop over all input/output axes and reverse-engineer the mapping at runtime.

Suppose our input data has multiple axes of the same type, i.e. maybe we have a set of covariance matrices defined for each model level, and want to interpolate that to a set of covariance matrices on pressure levels. I don't know why you'd want to do that, but you could with the second example. Views will crap out on any inputs that have repeated axes, so they'd need to undergo some serious re-engineering to handle that kind of stuff.

neishm commented 9 years ago

Comment #3 originally posted by a.r.erler on 2011-07-22T00:39:11.000Z:

I find the changes Mike is proposing quite attractive, though I don't have a good sense of how much work it is... sounds like a lot! Getting rid of View would make the call stack A LOT clearer to newcomers; I still find it very confusing, and I can easily see people being scared away by such a messy call stack, that looks like circular references going nowhere, until they magically after many iterations spit out a result. And I actually already ran into the covariance-matrix-problem Mike described: I wanted to average covariance matrices over several time steps (you can do that if you know how many elements are in each, in order to get the global covariance matrix without re-computation - I do this sort of thing quite often). I finally got it to work by fudging the axis instances and the conform/Strict flags, but to this day I don't understand why and how it first didn't and then worked. At the moment I'm in favor of Mikes proposal, but of course we have to think it through.

neishm commented 9 years ago

Comment #4 originally posted by neishm on 2012-05-29T21:25:00.000Z:

This is still an open issue, but I have no plans to try out my previously suggested changes anytime soon. I agree with Peter that if we're going to change the 'View' interface, then we might as well address all the outstanding issues like parallelization, intelligent caching/re-use of computed values, and multi-variable operators. I currently have no idea how to do this in PyGeode.

neishm commented 9 years ago

Comment #5 originally posted by aph42 on 2013-11-06T15:47:10.000Z:

Set target for release 1.0; probably some should not be included.

neishm commented 9 years ago

Comment #6 originally posted by aph42 on 2013-11-07T15:03:40.000Z:

<empty>