Open Hoeze opened 3 years ago
Thanks for the issue @Hoeze . Multi-dimensional bool indexing is definitely something we'd like to add.
How does your code differ from the proposals in https://github.com/pydata/xarray/issues/1887? In a brief look through the code — thanks for supplying it — I couldn't immediately see why we need a new dimension?
@max-sixty The reason is that my method is basically a special case of point-wise indexing: http://xarray.pydata.org/en/stable/indexing.html#more-advanced-indexing You can get the same result by calling:
core_dim_locs = {key: value for key, value in core_dim_locs_from_cond(mask, new_dim_name="newdim")}
# pointwise selection
data.sel(
dim_0=outliers_subset["dim_0"],
dim_1=outliers_subset["dim_1"],
dim_2=outliers_subset["dim_2"]
)
(Note that you loose chunk information by this method, that's why it is less efficient)
When you want to select random items from a N-dimensional array, you can either model the result as some sparse array or by stacking the dimensions. (OK, stacking the dimensions means also a sparse COO encoding...)
Ah right, I see now, thanks for explaining.
Allowing pointwise indexing with bool indexes would also be welcome.
I wonder if this is just a better proposal than making N-dimensional boolean indexing an alias for where
: https://github.com/pydata/xarray/issues/1887#issuecomment-823673654
@max-sixty and I have been having some more discussion about whether this is what ds[key]
should do for N-dimensional boolean indexing over in #1887.
But regardless of what we want boolean indexing with []
to do, this would certainly be welcome functionality and should exist in a dedicated method. ds[key]
is already very heavily overloaded in Xarray, so a more explicit option is nice to have, e.g., for the benefit of readability and static type checking. For the same reason, I would rather not put it inside isel()
which already integer based indexing with a different call signature. My tentative suggestion is to call this new method sel_mask()
, since that's what it does -- selection like sel
/isel
except based on a mask.
fyi, I updated the boolean indexing to support additional or missing dimensions: https://gist.github.com/Hoeze/96616ef9d179180b0b7de97c97e00a27 I'm using this on a 4D-array with >300GB to flatten three of the four dimensions and it works, even on 64GB of RAM.
Currently, the docs state that boolean indexing is only possible with 1-dimensional arrays: http://xarray.pydata.org/en/stable/indexing.html
However, I often have the case where I'd like to convert a subset of an xarray to a dataframe. Usually, I would call e.g.:
However, this approach is incredibly slow and memory-demanding, since it creates a MultiIndex of every possible coordinate in the array.
Describe the solution you'd like A better approach would be to directly allow index selection with the boolean array:
This way, it is possible to 1) Identify the resulting coordinates with
np.argwhere()
2) Directly use the underlying array for fancy indexing:variable.data[mask]
Additional context I created a proof-of-concept that works for my projects: https://gist.github.com/Hoeze/c746ea1e5fef40d99997f765c48d3c0d Some important lines are those:
As a result, I would expect something like this: