jon-xu / scSplit

Genotype-free demultiplexing of pooled single-cell RNA-Seq, using a hidden state model for identifying genetically distinct samples within a mixed population.
MIT License
41 stars 9 forks source link

Memory error with scSplit #22

Closed ahmadsam66 closed 2 years ago

ahmadsam66 commented 2 years ago

Hello

I get this error, I used samtools instead of freebayes.

Do you have any idea how many SNP we should have in order to run scSplit count? Can I filter samtools VCF file in order to reduce number of SNPs?

Kind regards, Ahmad

Traceback (most recent call last):
  File "/home/c97000131/apps/scSplit/scSplit", line 699, in <module>
    scSplit()
  File "/home/c97000131/apps/scSplit/scSplit", line 357, in __init__
    getattr(self, args.command)()
  File "/home/c97000131/apps/scSplit/scSplit", line 457, in count
    base_calls_mtx = mixed_VCF().build_base_calls_matrix(args.bam, filtered_vcf, barcodes, args.tag, args.out)
  File "/home/c97000131/apps/scSplit/scSplit", line 38, in build_base_calls_matrix
    ref_base_calls_mtx = pd.DataFrame(0, index=filtered_vcf.index, columns=barcodes, dtype=np.int16)
  File "/home/c97000131/miniconda3/lib/python3.7/site-packages/pandas/core/frame.py", line 761, in __init__
    copy,
  File "/home/c97000131/miniconda3/lib/python3.7/site-packages/pandas/core/dtypes/cast.py", line 1877, in construct_2d_arraylike_from_scalar
    return np.full(shape, arr)
  File "/home/c97000131/miniconda3/lib/python3.7/site-packages/numpy/core/numeric.py", line 343, in full
    a = empty(shape, dtype, order)
numpy.core._exceptions.MemoryError: Unable to allocate 315. GiB for an array with shape (24897, 6794880) and data type int16
jon-xu commented 2 years ago

Hi Ahmad, Thanks for your interest in using our tool. As you can see in the documentation:

  1. Building allele count matrices ... d) This step could be memory consuming, if the number of SNVs and/or cells are high. As a guideline, building matrices for 60,000 SNVs and 10,000 cells might need more than 30GB RAM to run, please allow enough RAM resource for running the script. ... f) Typical number of filtered SNVs after this step is usually between 10,000 and 30,000.
XFWuCN commented 8 months ago

Hello, I have encountered a similar error, but my server still has 47GB of available memory, and the error message indicates that only 4.86GB of memory is needed. There is also a FutureWarning related to data types. pic1 pic2 RAM

jon-xu commented 8 months ago

Sorry not sure about the memory allocation issue in numpy. Also, sorry but we don’t have plan to catch up updates on newer versions of dependent packages.