Open JulienBrn opened 10 months ago
Thanks for opening your first issue here at xarray! Be sure to follow the issue template! If you have an idea for a solution, we would really welcome a Pull Request with proposed changes. See the Contributing Guide for more. It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better. Thank you!
FYI pd.RangeIndex
is worth looking at as prior art
FYI
pd.RangeIndex
is worth looking at as prior art
Thanks for the advice. While it seems that on the pandas side pd.RangeIndex does not take up any space, the code on the xarray side makes me believe it will be computed into its full array (see the code for create_variables, defined here ). Thus, I still do not know how to implement create_variables...
Another question that I had not realized is whether the netcdf file format allows one to store a RegularIndex without saving the full array. At worse, I could pickle for now...
This is a good example of Xarray index that may be worth mentioning in https://github.com/pydata/xarray/discussions/7041.
In the current version of the Xarray data model every index must have at least one corresponding coordinate. Changing the coordinate(s) or drop them after setting their index may invalidate the latter so indeed it either drops the index or raises an error.
That said, in theory you could create a lazy coordinate, for example using dask.array.arange()
. This way the coordinate data won't be loaded in memory until explicitly calling Dataset / DataArray .compute()
or .load()
.
I say "in theory" because in practice we still need to refactor some parts of Xarray to make it work well, see https://github.com/pydata/xarray/pull/8124.
Another question that I had not realized is whether the netcdf file format allows one to store a RegularIndex without saving the full array. At worse, I could pickle for now...
Not sure about that but another option for serialization would be to store the start / stop / step values as attributes of a scalar coordinate.
Actually, if pd.RangeIndex
doesn't store the whole array I think it is possible to directly subclass xarray.indexes.PandasIndex
and implement the extra logic in .from_variables()
to create a pd.RangeIndex
instance before passing it to the constructor.
xarray.indexes.PandasIndex.__init__
won't convert the input array if it is already a pd.Index
, it will just make a shallow copy of the pandas range index..create_variables()
: it will already put the pd.RangeIndex
in a lightweight wrapper for use as coordinate variable data, so normally it won't consume more memory than the wrapped pandas range index.Caveat: it is currently not possible to set a new index without giving the name of an existing coordinate in .set_xindex
. Creating a range index from coordinate labels fully loaded in memory is not ideal and possibly expensive (validation). That said, I think a possible trick is to implement the range index's .from_variables()
method so that it takes a scalar coordinate with "start", "stop" and "step" attributes, like said above. When setting the index, this scalar coordinate will be converted into an indexed 1-d coordinate (you'd need to provide the dimension name as index build option, though).
Prototype of a RangeIndex
based on a pandas.RangeIndex
: notebook
It doesn't really do much apart from supporting multiple ways to set the index from different kinds of coordinates. The rest is delegated to the wrapped pandas.RangeIndex
. It is already possible to use a pandas.RangeIndex
via the default xarray.indexes.PandasIndex
, actually.
This doesn't fully solve @JulienBrn 's problem since pandas.RangeIndex
is limited to integer values, but I guess we could make the Xarray RangeIndex
wrapper (partially) work with float steps by introducing a scale_factor
option and use it appropriately in RangeIndex.sel
, RangeIndex.equal
, RangeIndex.join
, etc. For proper coordinate representation we would also need to subclass xarray.core.indexing.PandasIndexingAdapter
so that it is aware of the scale factor... However the latter is not public API. Ideally, support for floats should be implemented in pandas: https://github.com/pandas-dev/pandas/issues/46484.
This looks great @benbovy . I think a RangeIndex
is useful as a pedagogical example of one of the simplest types of realistically-useful index.
I am closing this issue at the notebook provides a good base on how to do it.
However, I am still concerned about the requirement of having an actual array in the coordinates and overall the design behind dimension coordinates and indexes (what is responsible for what, especially in a xr.merge operation?). Perhaps this should be continued here
However, I am still concerned about the requirement of having an actual array in the coordinates and overall the design behind dimension coordinates and indexes (what is responsible for what, especially in a xr.merge operation?). Perhaps this should be continued https://github.com/pydata/xarray/discussions/8955
This concern seems very similar to what we discussed recently in https://github.com/TomNicholas/VirtualiZarr/issues/18#issuecomment-2025423042
Is your feature request related to a problem?
Most of my dimension coordinates fall into three categories:
start + np.arange(n)/fs
for some start, fsI feel the way the latter is currently handled in xarray is suboptimal (unless I'm misusing this great library) as it has the following drawbacks:
step=np.diff(a)[0], assert (np.abs(np.diff(a)-step))<epsilon).all(), fs=1/step
Note: It is not obvious for me from the documentation whether this is more of a "coordinate" enhancement or an "index" enhancement (index being to my knowledge discussed only in this part of the documentation ).
Describe the solution you'd like
A new type of index/coordinate where only the "start" and "fs" are stored. The _repr_inline may look like "RegularIndex(start, end, step=1/fs)". Perhaps another more generic possibility would be a type of coordinate system that is expressed as a transform from
np.arange(s, e)
by the bijective function f (with the inverse of f also provided).RegularIndex(start, end, fs)
would then be an instance withf = lambda x: x/fs, inv(f) = lambda y: y*fs, s=round(start*fs), e = round(end*fs)+1
The advantage of this approach is that joins/alignment/selection/... could be handled generically on thenp.arange(s, e)
and this would also work on non linear spaces (for example log spaces)Describe alternatives you've considered
I have tried writing an Index subclass but I struggle on the
create_variables
method. If I do not return a coordinate for the current dimension, thena.set_xindex(["t"], RegularIndex)
keeps the previous coordinates and if I do, then I need to provide a Variable from the np.array that I do not want to create (for memory efficiency). I have tried to drop the coordinate after setting my custom index, but that seems to remove the index as well...There may be many other problems as I have just quickly tried. Should this be a viable approach I may be open to writing a version myself and post it for review. However, I am relatively new to xarray and I would appreciate to first know if I am on the right track.
Additional context
No response