pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.62k stars 1.08k forks source link

Extend xarray with custom "coordinate wrappers" #1961

Closed benbovy closed 2 years ago

benbovy commented 6 years ago

Recent and ongoing developments in xarray turn DataArray and Dataset more and more into data wrappers that are extensible at (almost) every level:

Regarding the latter, I’m thinking about the idea of extending xarray at an even more abstract level, i.e., the possibility of adding / registering "coordinate wrappers" to DataArray or Dataset objects. Basically, it would correspond to adding any object that allows to do some operation based on one or several coordinates (I haven’t found any better name than "coordinate agent" to describe that).

EDIT: "coordinate agents" may not be quite right here, I changed that to "coordinate wrappers")

Indexes are a specific case of coordinate wrappers that serve the purpose of indexing. This is built in xarray.

While indexing is enough in 80% of cases, I see a couple of use cases where other coordinate wrappers (built outside of xarray) would be nice to have:

In those examples we usually rely on coordinate attributes and/or classes that encapsulate xarray objects to implement the specific features that we need. While it works, it has limitations and I think it can be improved.

Custom coordinate wrappers would be a way of extending xarray that is very consistent with other current (or considered) extension mechanisms.

This is still a very vague idea and I’m sure that there are lots of details that can be discussed (serialization, etc.).

But before going further, I’d like to know your thoughts @pydata/xarray. Do you think it is a silly idea? Do you have in mind other use cases where custom coordinate wrappers would be useful?

benbovy commented 6 years ago

As an example, in xgcm we would have something like

>>> ds = ds_original.xgcm.generate(...)
>>> ds.xgcm.interp(‘var’, axis=‘X’)

instead of

>>> ds = xgcm.generate_grid_ds(ds_original, ...)
>>> grid = xgcm.Grid(ds)
>>> grid.interp(ds.var, axis=‘X’)

The advantage in the first example is that the information on the grid’s physical axes is bound to a Dataset object (as coordinate wrappers), so we don’t need to deal with any instance of another class (i.e., Grid in the second example) to perform grid operations like interpolation on a given axis, which can rather be implemented into a Dataset accessor (i.e., Dataset.xgcm in the first example).

@rabernat I don't have much experience with xgcm so maybe this isn't a good example?

I guess we could just use Dataset attributes and/or private instance attributes in the Dataset accessor class for that, but

shoyer commented 6 years ago

This has some similarity to what we would need for a KDTreeIndex (e.g., as discussed in https://github.com/pydata/xarray/issues/1603). If we can use the same interface for both, then it would be natural to support other "derived indexes", too.

What would the proposed interface be here?

shoyer commented 6 years ago

I guess the common pattern for "coordinate wrappers"/"indexes" looks like:

Possible future features for coordinate wrappers:

I'm open to other names, but my inclination would be to still call all of these indexes, even if they don't actually implement indexing.

benbovy commented 6 years ago

I don't have a full idea yet of what would be the interface, but taking the repr() in your comment and mixing it with a a simplified version of an example of repr(xgcm.Grid) found in the docs, this could look like

<xarray.Dataset (exp_time: 5, x_c: 9, x_g: 9)>
Coordinates:
  * experiment  (exp_time) int64 0 0 0 1 1 
  * time        (exp_time) float64 0.0 0.1 0.2 0.0 0.15
  * x_g         (x_g) float64 0.5 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5
  * x_c         (x_c) int64 1 2 3 4 5 6 7 8 9
Indexes:
    exp_time: pandas.MultiIndex[experiment, time] 
Grid axes:
    X: xgcm.Axis[x_c, x_g]

Like Dataset.indexes returns all Index objects, Dataset.xgcm.grid_axes would return all xgcm.Axis objects.

Like Dataset.sel or Dataset.set_index use/act on indexes, Dataset.xgcm.interp or Dataset.xgcm.generate_grid would use/act on grid axes.

3rd-party coordinate wrappers thus make sense only if there is accessors to handle them.

If we add an indexes argument in Dataset and DataArray constructors, we might even think adding **kwargs as well in the constructors for, e.g., grid_axes. But I can see it is something that we probably don't want :-).

I use xgcm here because I think it is a nice example of application. This might co-exist with other pairs of custom coordinate wrappers / accessors.

More generally, on the xarray side we would need

benbovy commented 6 years ago

Agreed with all your points @shoyer.

I'm open to other names, but my inclination would be to still call all of these indexes, even if they don't actually implement indexing.

Except here where, instead of a flat collection of coordinate wrappers, I was rather thinking about a 1-level nested collection that separates them depending on what they implement. Indexes would represent one of these sub-collections.

shoyer commented 6 years ago

Except here where, instead of a flat collection of coordinate wrappers, I was rather thinking about a 1-level nested collection that separates them depending on what they implement. Indexes would represent one of these sub-collections.

This seems messier to me. I would rather stick with adding a single OrderedDict to the data model for Dataset and DataArray.

Would it be that confusing to see an xgcm grid or xarray-simlab clock listed as in the repr as an "Index"? Letting third-party libraries add their own repr categories seems like possibly going too far.

benbovy commented 6 years ago

Letting third-party libraries add their own repr categories seems like possibly going too far.

Yes you're probably right.

I can imagine in the example above that Dataset.xgcm.grid_axes returns a subset of a flat collection, for convenience.

It is just that the name "Index" feels a bit wrong to me in this case, and also that xgcm.Axis (and potentially other wrappers) can do things very different than Index classes, which may be confusing.

benbovy commented 6 years ago

It is just that the name "Index" feels a bit wrong to me in this case, and also that xgcm.Axis (and potentially other wrappers) can do things very different than Index classes, which may be confusing.

That said, as real indexes cover most of the use cases, I'd by fine if we keep calling these indexes.

stale[bot] commented 4 years ago

In order to maintain a list of currently relevant issues, we mark issues as stale after a period of inactivity

If this issue remains relevant, please comment here or remove the stale label; otherwise it will be marked as closed automatically

benbovy commented 2 years ago

I think we can close this issue. The flexible index refactor now provides a nice framework for the suggestions made here.