Refactor oceanspy subsample module to use custom xarray indexes

rabernat commented 2 years ago

Current Status

An important part of oceanspy's feature space is concerned with providing custom ways to select data from Xarray datasets. These are enumerated on the API docs. The relevant functions are

function	description
cutout(od[, varList, YRange, XRange, ...])	Cutout the original dataset in space and time preserving the original grid structure.
mooring_array(od, Ymoor, Xmoor, **kwargs)	Extract a mooring array section following the grid.
particle_properties(od, times, Ypart, Xpart, ...)	Extract Eulerian properties of particles using nearest-neighbor interpolation.
survey_stations(od, Ysurv, Xsurv[, delta, ...])	Extract survey stations.

Oceanspy needed to implement these functions because xarray's built in indexers (e.g. .sel() or .interp()) were not capable of performing the required operations in the case of curvilinear grids.

Ongoing Xarray Refactor

Supported by a CZI grant, the Xarray team has been hard at work on the so-called Flexible Indexes Refactor

Xarray currently keeps track of indexes associated with coordinates by storing them in the form of a pandas.Index in special xarray.IndexVariable objects.

The limitations of this model became clear with the addition of pandas.MultiIndex support in xarray 0.9, where a single index corresponds to multiple xarray variables. MultiIndex support is highly useful, but xarray now has numerous special cases to check for MultiIndex levels.

A cleaner model would be to elevate indexes to an explicit part of xarray’s data model, e.g., as attributes on the Dataset and DataArray classes. Indexes would need to be propagated along with coordinates in xarray operations, but will no longer would need to have a one-to-one correspondance with coordinate variables. Instead, an index should be able to refer to multiple (possibly multidimensional) coordinates that define it. See GH 1603 for full details

Specific tasks:

Add an indexes attribute to xarray.Dataset and xarray.Dataset, as dictionaries that map from coordinate names to xarray index objects.

Use the new index interface to write wrappers for pandas.Index, pandas.MultiIndex and scipy.spatial.KDTree.

Expose the interface externally to allow third-party libraries to implement custom indexing routines, e.g., for geospatial look-ups on the surface of the Earth.

In addition to the new features it directly enables, this clean up will allow xarray to more easily implement some long-awaited features that build upon indexing, such as groupby operations with multiple variables.

Additional information about the refactor can be found at:

Once the PR 5692 is merged, this feature should be useable for development purposes.

Proposal: Refactor Oceanspy subsample function to be custom Xarray indexes

The whole point of this refactor ("allow third-party libraries to implement custom indexing routines") is to enable projects like OceanSpy to bring their own concepts of indexing directly to xarray datasets. So I thought I would propose we do exactly that. The steps would look something like this.

[ ] We experiment with the flexible indexes API and try to learn how it works using simple prototypes
[ ] We translate the subsample functions one by one to into third-party xarray indexes
[ ] Consider separating these indexes into a standalone package which provides Xarray entrypoints, such that the indexes can be used independently from oceanspy
[ ] Refactor oceanspy by deleting code, reducing the future maintenance burden for oceanspy developers 🎉

Pros

This is a more module design with better separation of concerns between elements
The indexes can be used more widely on all manner of datasets, not just those loaded by oceanspy
Potentially less code to maintain in oceanspy itself

Cons

The work involved in the refactoring
Probably others I can't think of

asiddi24 commented 2 years ago

Thanks for opening an issue, @rabernat ! Does seem like an exciting proposal and a great way to keep maintaining oceanspy in the long run. Will get to this after ocean sciences. Till then, I'll wait to hear what @malmans2 and @Mikejmnez have to say.

rabernat commented 2 years ago

Thanks for the enthusiasm @asiddi24! It's great that you see this as exciting. I see it as more of an unglamorous backend maintenance task that oceanspy users may not even notice...but will ultimately lead to better performance and maintainability.

malmans2 commented 2 years ago

I think they're all good suggestions! OceanSpy definitely needs maintenance/refactoring. I've another couple of suggestions:

Make use of xoak to perform nearest neighbor interpolations and extract stations/moorings/floats. However, it might be that xoak will get superseded by xarray refactoring. I'm not up to speed with the ongoing xarray refactor, but xoak has been working great for me so far.
OceanSpy naming convention is currently based on MITgcm conventions. I think using cf_xarray under the hood would make OceanSpy much more robust and easy to use (especially for users that are not familiar with MITgcm).

rabernat commented 2 years ago

The xoak creator (@benbovy) is also the one leading the xarray index refactor. So I imagine they will converge in some way. Maybe xoak will just provide the index objects themselves, which xarray can then use? Benoit, we would love to hear about your plans for xoak (and get your general feedback on this issue).

benbovy commented 2 years ago

Thanks for pinging me @rabernat.

Maybe xoak will just provide the index objects themselves, which xarray can then use?

Yes that's the plan with Xoak. I think it will still be useful to provide Dataset / DataArray accessors, for example to expose Xarray-compatible low-level API like an .xoak.query() method to get the indices and distances of/to the nearest neighbors.

While I've not looked much into Oceanspy and this a bit outside of my domain of expertise, the subsample functions seem good uses cases for experimenting with Xarray custom indexes, which at this stage would also be really helpful for the Xarray index refactor itself as I'm sure there's still much room for improvement!

Consider separating these indexes into a standalone package which provides Xarray entrypoints, such that the indexes can be used independently from oceanspy

Make use of xoak to perform nearest neighbor interpolations and extract stations/moorings/floats.

Those are sensible points. I think that in the mid/long term it will be better for the ecosystem if we can avoid a jungle of Xarray indexes with lots of overlapping features.

In https://github.com/pydata/xarray/pull/5692 we require that matching indexes for alignment (merge, etc.) must have exactly the same type, which limits interoperability between indexes but makes the implementation much simpler. We might eventually support some kind of "duck" indexes, but it's a considerably harder problem.

hainegroup / oceanspy