swolock / scrublet

Detect doublets in single-cell RNA-seq data
MIT License
138 stars 73 forks source link

Memory Error when running scrublet #18

Closed yeroslaviz closed 4 years ago

yeroslaviz commented 4 years ago

Hi, I'm getting the following error, when trying to run my file

>>> doublet_scores, predicted_doublets = scrub.scrub_doublets(min_counts=2, 
...                                                           min_cells=3, 
...                                                           min_gene_variability_pctl=85, 
...                                                           n_prin_comps=30)
Preprocessing...
/home/scrublet/helper_functions.py:321: RuntimeWarning: divide by zero encountered in true_divide
  w.setdiag(float(target_total) / tots_use)
/home/scrublet/helper_functions.py:252: RuntimeWarning: invalid value encountered in sqrt
  CV_input = np.sqrt(b);
Simulating doublets...
/home/scrublet/helper_functions.py:321: RuntimeWarning: divide by zero encountered in true_divide
  w.setdiag(float(target_total) / tots_use)
Traceback (most recent call last):
  File "<stdin>", line 4, in <module>
  File "/home/scrublet/scrublet.py", line 224, in scrub_doublets
    pipeline_zscore(self)
  File "/home/scrublet/helper_functions.py", line 65, in pipeline_zscore
    self._E_sim_norm = np.array(sparse_zscore(self._E_sim_norm, gene_means, gene_stdevs))
  File "/home/scrublet/helper_functions.py", line 173, in sparse_zscore
    return sparse_multiply((E - gene_mean).T, 1/gene_stdev).T
  File "/home/scrublet/helper_functions.py", line 164, in sparse_multiply
    return w * E
  File "/home/scipy/sparse/base.py", line 518, in __mul__
    result = self._mul_multivector(np.asarray(other))
  File "/home/scipy/sparse/base.py", line 536, in _mul_multivector
    return self.tocsr()._mul_multivector(other)
  File "/home/scipy/sparse/compressed.py", line 485, in _mul_multivector
    dtype=upcast_char(self.dtype.char, other.dtype.char))
MemoryError: Unable to allocate 167. GiB for an array with shape (1651, 13589760) and data type float64
>>> 

The Tools was ran within a conda environment (if this makes any difference).

my data set contains Counts matrix shape: 6794880 rows, 31053 columns Number of genes in gene list: 31053

Is there a way to deal with this problem?

thanks

swolock commented 4 years ago

Hey @yeroslaviz, looks like you have more cells than your machine can handle. Do you really have 6,794,880 cells? If so, are they all from the same sample/similar samples?

yeroslaviz commented 4 years ago

I have tried to with with the raw data, which is of course a huge matrix. This is why it ran out of memory.