Open benbovy opened 2 years ago
Following thoughts and discussions in various issues (e.g., #6836), I'd like to suggest another section to the ones in the top comment:
pandas.MultiIndex
special cases in Xarraypandas.MultiIndex
objects as dimension + level coordinates, e.g., like in xr.Dataset(coords={“x”: pd_midx})
but instead treat it as a single duck-array.pandas.MultiIndex
as dim
argument in xarray.concat()
(#7148)obj.to_index()
for all xarray objects?Dataset.reset_index()
and DataArray.reset_index()
They are source of many problems and complexities in Xarray internals (many regressions reported since the index refactor were related to those special cases) and I'm not sure that the value they add is really worth the trouble. Also, in the long term the special treatment of PandasMultiIndex
vs. other Xarray multi-indexes may add some confusion.
Some of those features are widely used (e.g., the creation of Dataset / DataArray from pandas multi-indexes is used in many places in unit tests), so we would need convenient alternatives and a smooth transition.
Yes yes -- the sooner we can get rid of MultiIndex special cases the better!
Any progress on this? I 'd love to see #2233 get resolved.
5692 is
not merged yetnow mergedbutand we canalreadystart thinking about the next steps. I’m opening this issue to list and track the remaining tasks. @pydata/xarray, do not hesitate to add a comment below if you think about something that is missing here.Continue the refactoring of the internals
Although in #5692 everything seems to work with the current pandas index wrappers for dimension coordinates, not all of Xarray's internals have been refactored yet to fully support (or at least be compatible with) custom indexes. Here is a list of
Dataset
/DataArray
methods that still need to be checked / updated (this list may be incomplete):as_numpy
(#8001)broadcast
(#6430, #6481 )drop_sel
(#6605, #7699)drop_isel
drop_dims
drop_duplicates
(#8499)transpose
interpolate_na
ffill
bfill
reduce
map
apply
quantile
rank
integrate
cumulative_integrate
filter_by_attrs
idxmin
idxmax
argmin
argmax
concat
(partially refactored, may not fully work with multi-dimension indexes)polyfit
I ended up following a common pattern in #5692 when adding explicit / flexible index support for various features (it is quite generic, though, the actual procedure may vary from one case to another and many steps may be skipped):
Index
base class. There may be several motivations:PandasIndex
orPandasMultiIndex
wrapper classes for clarity and also if eventually we want to make Xarray less dependent on Pandas)Variable
’s corresponding method for speed-up or for other reasons, e.g.,IndexVariable.concat
exists to avoid unnecessary Pandas/Numpy conversions ; in #5692PandasIndex.concat
has the same logic and will fully replace the former if/once we get rid ofIndexVariable
PandasIndex.roll
reusespandas.Index
indexing andappend
capabilitiesIndex
API closely follows DataArray, Dataset and Variable API (i.e., same method names) for consistencyIndex
API (if it exists) to create new indexesIndexes
class (i.e., the.xindexes
property returns an instance of this class) provides convenient API for iterating through indexes (e.g., get a list of unique indexes, get all coordinates or dimensions for a given index, etc.)Index
API, either raise an error or fallback to calling theVariable
API (below) depending on the caseIndex.create_variables
Index.create_variables
; it is used to propagate variable metadata (dtype
,attrs
andencoding
)Variable
API (if it exists)filter_indexes_from_coords
andassert_no_index_corrupted
_replace
,_replace_with_new_dims
or_overwrite_indexes
methodsRelax all constraints related to “dimension (index) coordinates” in Xarray
7989
Indexes repr
Indexes
section to Dataset and DataArray reprs6795
7185
Indexes
(i.e.,.xindexes
property) consistent with the repr ofCoordinates
(.coords
property)Index._repr_inline_
for tweaking the inline representation of each index shown in the reprs above7183
Public API for assigning and (re)setting indexes
There is no public API yet for creating and/or assigning existing indexes to Dataset and DataArray objects.
indexes
parameter in Dataset and DataArray constructorsdata
,data_vars
orcoords
arguments in favor of a more explicit way to pass it.6392
7214
7368
set_xindex
anddrop_indexes
methods6849
6971
set_index
andreset_index
? See https://github.com/pydata/xarray/issues/4366#issuecomment-920458966We still need to figure out how best we can (1) assign existing indexes (possibly with their coordinates) and (2) pass index build options.
Other public API for index-based operations
To fully leverage the power and flexibility of custom indexes, we might want to update some parts of Xarray’s public API in order to allow passing arbitrary options per index. For example:
sel
: the currentmethod
andtolerance
may not be relevant for all indexes, pass extra arguments to Scipy's cKDTree.query, etc. #7099align
: #2217Also:
Indexes
API as it provides convenient methods that might be useful for end-usersIndex
base class into Xarray’s main namespace (i.e.,xr.Index
)? AlsoPandasIndex
andPandasMultiIndex
? The latter may be useful if we depreciateset_index(append=True)
and/or if we depreciate “unpacking”pandas.MultiIndex
objects to coordinates when given ascoords
in the Dataset / DataArray constructors.Documentation
Indexes
APIIndex
API: #6975Index types and helper classes built in Xarray
Index
abstract subclass that would basically dispatch the given arguments to the corresponding, encapsulatedPandasIndex
instances and then merge the results7182
PandasMultiIndex
dimension coordinate?3rd party indexes