scverse / scvi-tools

Deep probabilistic analysis of single-cell and spatial omics data
http://scvi-tools.org/
BSD 3-Clause "New" or "Revised" License
1.2k stars 344 forks source link

VeloVI uses np.argsort, does not work with sparse spliced/unspliced layers #2687

Closed jberkh closed 1 month ago

jberkh commented 5 months ago

Issue

The scvi.external.velovi.VELOVI class uses np.argsort on the spliced/unspliced matrices, but this returns array([0]) when these matrices are sparse.

Minimally reproducible code

import scvi
import anndata
from scipy.sparse import csr_matrix
import numpy as np

adata = anndata.AnnData(X = np.random.randint(0, 100, 10000).reshape(100, 100))
adata.layers["spliced"] = csr_matrix(adata.X.copy())
adata.layers["unspliced"] = csr_matrix(adata.X.copy())

scvi.external.velovi.VELOVI.setup_anndata(adata, "spliced", "unspliced")
model = scvi.external.velovi.VELOVI(adata)
model.train()
IndexError                                Traceback (most recent call last)
Cell In[1], [line 11]
      [8] adata.layers["unspliced"] = csr_matrix(adata.X.copy())
     [10] scvi.external.velovi.VELOVI.setup_anndata(adata, "spliced", "unspliced")
---> [11] model = scvi.external.velovi.VELOVI(adata)
     [12] model.train()

File [~/miniconda3/envs/envSCVI-2/lib/python3.10/site-packages/scvi/external/velovi/_model.py:77] in VELOVI.__init__(self, adata, n_hidden, n_latent, n_layers, dropout_rate, gamma_init_data, linear_decoder, **model_kwargs)
     [75] sorted_unspliced = np.argsort(unspliced, axis=0)
     [76] ind = int(adata.n_obs * 0.99)
---> [77] us_upper_ind = sorted_unspliced[ind:, :]
     [79] us_upper = []
     [80] ms_upper = []

IndexError: too many indices for array: array is 1-dimensional, but 2 were indexed

Versions:

conda list | grep scvi            
scvi-tools                1.1.2              pyhd8ed1ab_0    conda-forge

(Installed with pip install git+https://github.com/scverse/scvi-tools on 4/4/24 ~12:00, as velovi is not available in the current pypi/conda release)

martinkim0 commented 5 months ago

Thanks! I'll take a look at this

martinkim0 commented 4 months ago

Just to give a quick update on this, the simplest patch is to explicitly convert the unspliced data into a dense array when it comes in as sparse since there's no argsort equivalent for csr_matrix. I don't think this is the best solution because it will require more memory than necessary.

jberkh commented 4 months ago

I figured as much, given that np.argsort by defnition returns a dense matrix since all the zero value return non-zero indices. Unfortunately, calling .todense() on too large datasets does blow up in memory quite a bit, possibly to the point of crashing.

Maybe solution could be to convert the matrices to dense in batches, something along these lines:

if isinstance(unspliced, csr_matrix) or isinstance(unspliced, csr_matrix):
    sorted_unspliced = np.empty(unspliced.shape)
    # batch size of 1000 here
    for i in range(0, unspliced_shape.shape[1], 1000):
        sorted[:, i:i+1000] = np.argsort(unspliced[:, i:i+1000].todense(), axis = 0)
martinkim0 commented 4 months ago

Thanks for the suggestion. I'm personally leaning towards a solution that natively deals with the sparse matrix instead of having to implicitly convert to dense. I'm also looking into what might be a potential bug in this whole preprocessing step.

canergen commented 1 month ago

I’m sorry for the long time it took us to get back. We expect users to preprocess data using: https://scvelo.readthedocs.io/en/stable/scvelo.pp.moments.html as highlighted in our tutorial. This function returns dense matrices. Please follow our tutorial. It should be safe to downsample n_cells after running the moments function, reduce highly variable genes to less genes or focus your analysis on several celltypes to overcome memory constraints.