Option to return sparse arrays from `sc.get.aggregate`

scverse / scanpy

Single-cell analysis in Python. Scales to >1M cells.

https://scanpy.readthedocs.io

BSD 3-Clause "New" or "Revised" License

1.93k stars 603 forks source link

Option to return sparse arrays from `sc.get.aggregate` #2898

Open ivirshup opened 8 months ago

ivirshup commented 8 months ago

What kind of feature would you like to request?

Additional function parameters / changed functionality / changed defaults?

Please describe your wishes

@Intron7, found a use case 😆

It could be nice for sc.get.aggregate to be able to return sparse matrices where we don't expect the aggregation to return very dense data.

Previously discussed in:

https://github.com/scverse/scanpy/issues/2892

Usecases include:

Taking the max for multiple reports of a genes (sc.get.aggregate(adata, "probe_target", "max"), e.g. https://discourse.scverse.org/t/merging-identical-genes-from-10x-fixed-scrna/2142)
- (note: max is not currently implemented)
Small aggregations, e.g. only summing neighbors

This would require both api design choices for what the argument is called, and efficient implementations for both dense and sparse results (python-graphblas could be useful here)

Intron7 commented 8 months ago

@ivirshup Do we need the computation itself to be sparse, or do we require the result to be returned in a sparse format? If it's the latter, we could consider developing a decorator that transforms the output into the desired format, as is done in cuML.

ivirshup commented 8 months ago

The computation should be sparse. Otherwise I'd be fine with the user doing it themselves, but the main value add here would be lower memory overhead/ optimized implementation.

ivirshup commented 8 months ago

Here's a demo implementation using python-graphblas for max: Setup:

from scipy import sparse
import numpy as np

N_OBS, N_VAR = 2_000, 10_000
N_CLASSES = N_VAR - int(N_VAR / 1000)
rng = np.random.default_rng(0)

X = sparse.random(N_OBS, N_VAR, density=0.01, format="csr", random_state=rng)
var_labels = np.concatenate([np.arange(N_CLASSES), rng.choice(N_CLASSES, size=N_VAR - N_CLASSES)])

Implementation:

import graphblas as gb
from sklearn.preprocessing import label_binarize

var_labels_mtx = label_binarize(var_labels, classes=np.arange(N_CLASSES), sparse_output=True)

result = gb.io.to_scipy_sparse(
    gb.semiring.max_times(
        gb.io.from_scipy_sparse(X) @ gb.io.from_scipy_sparse(var_labels_mtx)
    ).new()
)

Test

```python import numpy_groupies as npg npg_result = npg.aggregate(var_labels, X.toarray(), func="max", axis=1) np.testing.assert_array_equal(npg_result, result.toarray()) ```