scverse / scanpy

Single-cell analysis in Python. Scales to >1M cells.
https://scanpy.readthedocs.io
BSD 3-Clause "New" or "Revised" License
1.86k stars 595 forks source link

Get errors when performing sc.pp.highly_variable_genes! #456

Closed jipeifeng closed 5 years ago

jipeifeng commented 5 years ago

I am following workflow of 'Best-practices in single-cell RNA-seq: a tutorial' to analyze my single-cell sequencing data sets. I have calculated the size factor using the scran package and did not perform the batch correction step as I have only one sample. Then, I intended to extract highly variable genes by using the function sc.pp.highly_variable_genes. Unfortunately, I got an error:

LinAlgError: Last 2 dimensions of the array must be square

Traceback ```pytb LinAlgError Traceback (most recent call last) in ----> 1 sc.pp.highly_variable_genes(adata) ~/miniconda3/lib/python3.6/site-packages/scanpy/preprocessing/highly_variable_genes.py in highly_variable_genes(adata, min_disp, max_disp, min_mean, max_mean, n_top_genes, n_bins, flavor, subset, inplace) 94 X = np.expm1(adata.X) if flavor == 'seurat' else adata.X 95 ---> 96 mean, var = materialize_as_ndarray(_get_mean_var(X)) 97 # now actually compute the dispersion 98 mean[mean == 0] = 1e-12 # set entries equal to zero to small value ~/miniconda3/lib/python3.6/site-packages/scanpy/preprocessing/utils.py in _get_mean_var(X) 16 mean_sq = np.multiply(X, X).mean(axis=0) 17 # enforece R convention (unbiased estimator) for variance ---> 18 var = (mean_sq - mean**2) * (X.shape[0]/(X.shape[0]-1)) 19 else: 20 from sklearn.preprocessing import StandardScaler ~/miniconda3/lib/python3.6/site-packages/numpy/matrixlib/defmatrix.py in pow(self, other) 226 227 def pow(self, other): --> 228 return matrix_power(self, other) 229 230 def ipow(self, other): ~/miniconda3/lib/python3.6/site-packages/numpy/linalg/linalg.py in matrix_power(a, n) 600 a = asanyarray(a) 601 _assertRankAtLeast2(a) --> 602 _assertNdSquareness(a) 603 604 try: ~/miniconda3/lib/python3.6/site-packages/numpy/linalg/linalg.py in _assertNdSquareness(*arrays) 213 m, n = a.shape[-2:] 214 if m != n: --> 215 raise LinAlgError('Last 2 dimensions of the array must be square') 216 217 def _assertFinite(*arrays): ```

Versions of my modules: scanpy==1.3.7 anndata==0.6.17 numpy==1.15.4 scipy==1.2.0 pandas==0.24.0 scikit-learn==0.20.2 statsmodels==0.9.0 python-igraph==0.7.1 louvain==0.6.1

I have downgraded pandas to 0.23.4, however, it not works. But I figured out where the problem lies in.

adata.X /= adata.obs['size_factors'].values[:,None]

This step transform the adata.X to a structure of matrix. Before the adata.X is

<6242x15065 sparse matrix of type '<class 'numpy.float32'>'
with 19234986 stored elements in Compressed Sparse Row format>

But after performing this step, the adata.X is This is my adata.X looks like right now:

matrix([[0. , 0. , 0. , ..., 0. , 0. , 0. ],
[0. , 0. , 1.203, ..., 0. , 0. , 0. ],
[0. , 1.096, 0. , ..., 0. , 0. , 0. ],
...,
[0. , 0. , 2.042, ..., 0. , 0. , 0. ],
[0. , 0. , 0. , ..., 0.926, 0. , 0. ],
[0. , 0. , 2.951, ..., 0. , 0. , 0. ]]),

And this format of adata.X caused error of sc.pp.highly_variable_genes. But I don't know how to fix it.

Looking forward your response! Thank you !

jipeifeng commented 5 years ago

Hi, I have fixed the issue. It appears that adding, subtracting or dividing numpy.ndarrays with scipy.sparse matrices returns a numpy.matrix. numpy_array /= scipy_sparse_matrix, This command changed the type of numpy_array to numpy.matrix which caused downstream problems. So, you have to transfer the matrix to sparse format again for downstream analysis. I used the command 'adata.X = scipy.sparse.csr_matrix(adata.X) ' after dividing the measured counts by the size factor. So, I paste it here as a note of warning when performing this type of operation.

falexwolf commented 5 years ago

Thank you for sharing this! We should indeed print warnings if numpy.matrix causes problems. I wasn't aware of it.