Open ivirshup opened 8 months ago
@ivirshup Do we need the computation itself to be sparse, or do we require the result to be returned in a sparse format? If it's the latter, we could consider developing a decorator that transforms the output into the desired format, as is done in cuML.
The computation should be sparse. Otherwise I'd be fine with the user doing it themselves, but the main value add here would be lower memory overhead/ optimized implementation.
Here's a demo implementation using python-graphblas
for max
:
Setup:
from scipy import sparse
import numpy as np
N_OBS, N_VAR = 2_000, 10_000
N_CLASSES = N_VAR - int(N_VAR / 1000)
rng = np.random.default_rng(0)
X = sparse.random(N_OBS, N_VAR, density=0.01, format="csr", random_state=rng)
var_labels = np.concatenate([np.arange(N_CLASSES), rng.choice(N_CLASSES, size=N_VAR - N_CLASSES)])
Implementation:
import graphblas as gb
from sklearn.preprocessing import label_binarize
var_labels_mtx = label_binarize(var_labels, classes=np.arange(N_CLASSES), sparse_output=True)
result = gb.io.to_scipy_sparse(
gb.semiring.max_times(
gb.io.from_scipy_sparse(X) @ gb.io.from_scipy_sparse(var_labels_mtx)
).new()
)
What kind of feature would you like to request?
Additional function parameters / changed functionality / changed defaults?
Please describe your wishes
@Intron7, found a use case 😆
It could be nice for
sc.get.aggregate
to be able to return sparse matrices where we don't expect the aggregation to return very dense data.Previously discussed in:
Usecases include:
max
for multiple reports of a genes (sc.get.aggregate(adata, "probe_target", "max")
, e.g. https://discourse.scverse.org/t/merging-identical-genes-from-10x-fixed-scrna/2142)This would require both api design choices for what the argument is called, and efficient implementations for both dense and sparse results (
python-graphblas
could be useful here)