scverse / scanpy

Single-cell analysis in Python. Scales to >1M cells.
https://scanpy.readthedocs.io
BSD 3-Clause "New" or "Revised" License
1.93k stars 603 forks source link

Option to return sparse arrays from `sc.get.aggregate` #2898

Open ivirshup opened 8 months ago

ivirshup commented 8 months ago

What kind of feature would you like to request?

Additional function parameters / changed functionality / changed defaults?

Please describe your wishes

@Intron7, found a use case 😆

It could be nice for sc.get.aggregate to be able to return sparse matrices where we don't expect the aggregation to return very dense data.

Previously discussed in:

Usecases include:

This would require both api design choices for what the argument is called, and efficient implementations for both dense and sparse results (python-graphblas could be useful here)

Intron7 commented 8 months ago

@ivirshup Do we need the computation itself to be sparse, or do we require the result to be returned in a sparse format? If it's the latter, we could consider developing a decorator that transforms the output into the desired format, as is done in cuML.

ivirshup commented 8 months ago

The computation should be sparse. Otherwise I'd be fine with the user doing it themselves, but the main value add here would be lower memory overhead/ optimized implementation.

ivirshup commented 8 months ago

Here's a demo implementation using python-graphblas for max: Setup:

from scipy import sparse
import numpy as np

N_OBS, N_VAR = 2_000, 10_000
N_CLASSES = N_VAR - int(N_VAR / 1000)
rng = np.random.default_rng(0)

X = sparse.random(N_OBS, N_VAR, density=0.01, format="csr", random_state=rng)
var_labels = np.concatenate([np.arange(N_CLASSES), rng.choice(N_CLASSES, size=N_VAR - N_CLASSES)])

Implementation:

import graphblas as gb
from sklearn.preprocessing import label_binarize

var_labels_mtx = label_binarize(var_labels, classes=np.arange(N_CLASSES), sparse_output=True)

result = gb.io.to_scipy_sparse(
    gb.semiring.max_times(
        gb.io.from_scipy_sparse(X) @ gb.io.from_scipy_sparse(var_labels_mtx)
    ).new()
)
Test ```python import numpy_groupies as npg npg_result = npg.aggregate(var_labels, X.toarray(), func="max", axis=1) np.testing.assert_array_equal(npg_result, result.toarray()) ```