Closed benbovy closed 2 years ago
As an example, in xgcm
we would have something like
>>> ds = ds_original.xgcm.generate(...)
>>> ds.xgcm.interp(‘var’, axis=‘X’)
instead of
>>> ds = xgcm.generate_grid_ds(ds_original, ...)
>>> grid = xgcm.Grid(ds)
>>> grid.interp(ds.var, axis=‘X’)
The advantage in the first example is that the information on the grid’s physical axes is bound to a Dataset
object (as coordinate wrappers), so we don’t need to deal with any instance of another class (i.e., Grid
in the second example) to perform grid operations like interpolation on a given axis, which can rather be implemented into a Dataset accessor (i.e., Dataset.xgcm
in the first example).
@rabernat I don't have much experience with xgcm
so maybe this isn't a good example?
I guess we could just use Dataset attributes and/or private instance attributes in the Dataset accessor class for that, but
This has some similarity to what we would need for a KDTreeIndex
(e.g., as discussed in https://github.com/pydata/xarray/issues/1603). If we can use the same interface for both, then it would be natural to support other "derived indexes", too.
What would the proposed interface be here?
I guess the common pattern for "coordinate wrappers"/"indexes" looks like:
Possible future features for coordinate wrappers:
I'm open to other names, but my inclination would be to still call all of these indexes
, even if they don't actually implement indexing.
I don't have a full idea yet of what would be the interface, but taking the repr()
in your comment and mixing it with a a simplified version of an example of repr(xgcm.Grid)
found in the docs, this could look like
<xarray.Dataset (exp_time: 5, x_c: 9, x_g: 9)>
Coordinates:
* experiment (exp_time) int64 0 0 0 1 1
* time (exp_time) float64 0.0 0.1 0.2 0.0 0.15
* x_g (x_g) float64 0.5 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5
* x_c (x_c) int64 1 2 3 4 5 6 7 8 9
Indexes:
exp_time: pandas.MultiIndex[experiment, time]
Grid axes:
X: xgcm.Axis[x_c, x_g]
Like Dataset.indexes
returns all Index
objects, Dataset.xgcm.grid_axes
would return all xgcm.Axis
objects.
Like Dataset.sel
or Dataset.set_index
use/act on indexes, Dataset.xgcm.interp
or Dataset.xgcm.generate_grid
would use/act on grid axes.
3rd-party coordinate wrappers thus make sense only if there is accessors to handle them.
If we add an indexes
argument in Dataset and DataArray constructors, we might even think adding **kwargs
as well in the constructors for, e.g., grid_axes
. But I can see it is something that we probably don't want :-).
I use xgcm
here because I think it is a nice example of application. This might co-exist with other pairs of custom coordinate wrappers / accessors.
More generally, on the xarray side we would need
Dataset
or DataArray
objects so that we can bind coordinate wrappers to them.AbstractCoordinateWrapper
class that would provide a unified interface for dealing with issues of serialization, etc. Agreed with all your points @shoyer.
I'm open to other names, but my inclination would be to still call all of these indexes, even if they don't actually implement indexing.
Except here where, instead of a flat collection of coordinate wrappers, I was rather thinking about a 1-level nested collection that separates them depending on what they implement. Indexes would represent one of these sub-collections.
Except here where, instead of a flat collection of coordinate wrappers, I was rather thinking about a 1-level nested collection that separates them depending on what they implement. Indexes would represent one of these sub-collections.
This seems messier to me. I would rather stick with adding a single OrderedDict to the data model for Dataset
and DataArray
.
Would it be that confusing to see an xgcm grid or xarray-simlab clock listed as in the repr as an "Index"? Letting third-party libraries add their own repr categories seems like possibly going too far.
Letting third-party libraries add their own repr categories seems like possibly going too far.
Yes you're probably right.
I can imagine in the example above that Dataset.xgcm.grid_axes
returns a subset of a flat collection, for convenience.
It is just that the name "Index" feels a bit wrong to me in this case, and also that xgcm.Axis
(and potentially other wrappers) can do things very different than Index classes, which may be confusing.
It is just that the name "Index" feels a bit wrong to me in this case, and also that xgcm.Axis (and potentially other wrappers) can do things very different than Index classes, which may be confusing.
That said, as real indexes cover most of the use cases, I'd by fine if we keep calling these indexes
.
In order to maintain a list of currently relevant issues, we mark issues as stale after a period of inactivity
If this issue remains relevant, please comment here or remove the stale
label; otherwise it will be marked as closed automatically
I think we can close this issue. The flexible index refactor now provides a nice framework for the suggestions made here.
Recent and ongoing developments in xarray turn DataArray and Dataset more and more into data wrappers that are extensible at (almost) every level:
DataStore
interfaceRegarding the latter, I’m thinking about the idea of extending xarray at an even more abstract level, i.e., the possibility of adding / registering "coordinate wrappers" to
DataArray
orDataset
objects. Basically, it would correspond to adding any object that allows to do some operation based on one or several coordinates(I haven’t found any better name than "coordinate agent" to describe that).EDIT: "coordinate agents" may not be quite right here, I changed that to "coordinate wrappers")
Indexes are a specific case of coordinate wrappers that serve the purpose of indexing. This is built in xarray.
While indexing is enough in 80% of cases, I see a couple of use cases where other coordinate wrappers (built outside of xarray) would be nice to have:
In those examples we usually rely on coordinate attributes and/or classes that encapsulate xarray objects to implement the specific features that we need. While it works, it has limitations and I think it can be improved.
Custom coordinate wrappers would be a way of extending xarray that is very consistent with other current (or considered) extension mechanisms.
This is still a very vague idea and I’m sure that there are lots of details that can be discussed (serialization, etc.).
But before going further, I’d like to know your thoughts @pydata/xarray. Do you think it is a silly idea? Do you have in mind other use cases where custom coordinate wrappers would be useful?