Open ilan-gold opened 6 months ago
My main concern here is whether doing adata[adata.obs[...] == ...].X
will hit the optimized path for indexing if X
is backed (since we only optimized for mask indexing, and not positional)
It does not, which kind of defeats the optimization. This now needs doing.
Confirmed by adding a print(row, col)
to BaseCompressedSparseDataset.__getitem__
and running:
import anndata as ad, numpy as np, zarr
from scipy import sparse
rng = np.random.default_rng()
g = zarr.open()
X = sparse.random(10_000, 1_000, density=0.01, format="csr", random_state=rng)
ad.experimental.write_elem(g, "X", X, dataset_kwargs={"chunks": 1_000})
X_backed = ad.experimental.sparse_dataset(g["X"])
bool_idx = np.zeros(X.shape[0], dtype=bool)
bool_idx[5000:] = True
bool_idx[:1000] = True
adata = ad.AnnData(X=X_backed)
adata[bool_idx].X
[ 0 1 2 ... 9997 9998 9999] slice(None, None, None)
<6000x1000 sparse matrix of type '<class 'numpy.float64'>'
with 59999 stored elements in Compressed Sparse Row format>
slight aside: np.where
is designed for np.where(cond, vals_if_true, vals_if_false)
, using np.where(cond)[0]
as replacement for np.flatnonzero(cond)
is awkward.
This issue has been automatically marked as stale because it has not had recent activity. Please add a comment if you want to keep the issue open. Thank you for your contributions!
@ilan-gold what’s missing here?
Please describe your wishes and possible alternatives to achieve the desired result.
Working on https://github.com/scverse/anndata/issues/1224, we discovered that
np.where
is called internally byscipy
on boolean masks, but there are places in our code where we also call it, likely unnecessarily. It would be good to review these places to see if they are still necessary and then see if there is an effect on performance.