pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.61k stars 1.08k forks source link

N-dimensional boolean indexing #5179

Open Hoeze opened 3 years ago

Hoeze commented 3 years ago

Currently, the docs state that boolean indexing is only possible with 1-dimensional arrays: http://xarray.pydata.org/en/stable/indexing.html

However, I often have the case where I'd like to convert a subset of an xarray to a dataframe. Usually, I would call e.g.:

data = xrds.stack(observations=["dim1", "dim2", "dim3"])
data = data.isel(~ data.missing)
df = data.to_dataframe()

However, this approach is incredibly slow and memory-demanding, since it creates a MultiIndex of every possible coordinate in the array.

Describe the solution you'd like A better approach would be to directly allow index selection with the boolean array:

data = xrds.isel(~ xrds.missing, dim="observations")
df = data.to_dataframe()

This way, it is possible to 1) Identify the resulting coordinates with np.argwhere() 2) Directly use the underlying array for fancy indexing: variable.data[mask]

Additional context I created a proof-of-concept that works for my projects: https://gist.github.com/Hoeze/c746ea1e5fef40d99997f765c48d3c0d Some important lines are those:

def core_dim_locs_from_cond(cond, new_dim_name, core_dims=None) -> List[Tuple[str, xr.DataArray]]:
    [...]
    core_dim_locs = np.argwhere(cond.data)
    if isinstance(core_dim_locs, dask.array.core.Array):
        core_dim_locs = core_dim_locs.persist().compute_chunk_sizes()

def subset_variable(variable, core_dim_locs, new_dim_name, mask=None):
    [...]
    subset = dask.array.asanyarray(variable.data)[mask]
    # force-set chunk size from known chunks
    chunk_sizes = core_dim_locs[0][1].chunks[0]
    subset._chunks = (chunk_sizes, *subset._chunks[1:])

As a result, I would expect something like this: image

max-sixty commented 3 years ago

Thanks for the issue @Hoeze . Multi-dimensional bool indexing is definitely something we'd like to add.

How does your code differ from the proposals in https://github.com/pydata/xarray/issues/1887? In a brief look through the code — thanks for supplying it — I couldn't immediately see why we need a new dimension?

Hoeze commented 3 years ago

@max-sixty The reason is that my method is basically a special case of point-wise indexing: http://xarray.pydata.org/en/stable/indexing.html#more-advanced-indexing You can get the same result by calling:

core_dim_locs = {key: value for key, value in core_dim_locs_from_cond(mask, new_dim_name="newdim")}

# pointwise selection
data.sel(
    dim_0=outliers_subset["dim_0"], 
    dim_1=outliers_subset["dim_1"],
    dim_2=outliers_subset["dim_2"]
)

(Note that you loose chunk information by this method, that's why it is less efficient)

When you want to select random items from a N-dimensional array, you can either model the result as some sparse array or by stacking the dimensions. (OK, stacking the dimensions means also a sparse COO encoding...)

max-sixty commented 3 years ago

Ah right, I see now, thanks for explaining.

Allowing pointwise indexing with bool indexes would also be welcome.

shoyer commented 3 years ago

I wonder if this is just a better proposal than making N-dimensional boolean indexing an alias for where: https://github.com/pydata/xarray/issues/1887#issuecomment-823673654

shoyer commented 3 years ago

@max-sixty and I have been having some more discussion about whether this is what ds[key] should do for N-dimensional boolean indexing over in #1887.

But regardless of what we want boolean indexing with [] to do, this would certainly be welcome functionality and should exist in a dedicated method. ds[key] is already very heavily overloaded in Xarray, so a more explicit option is nice to have, e.g., for the benefit of readability and static type checking. For the same reason, I would rather not put it inside isel() which already integer based indexing with a different call signature. My tentative suggestion is to call this new method sel_mask(), since that's what it does -- selection like sel/isel except based on a mask.

Hoeze commented 3 years ago

fyi, I updated the boolean indexing to support additional or missing dimensions: https://gist.github.com/Hoeze/96616ef9d179180b0b7de97c97e00a27 I'm using this on a 4D-array with >300GB to flatten three of the four dimensions and it works, even on 64GB of RAM.