single-cell-genetics / vireo

Demultiplexing pooled scRNA-seq data with or without genotype reference
https://vireoSNP.readthedocs.io
Apache License 2.0
71 stars 27 forks source link

Uninformative MemoryError message #28

Open Zepeng-Mu opened 3 years ago

Zepeng-Mu commented 3 years ago

Hi, when I'm running vireo, I encountered this error:

numpy.core._exceptions.MemoryError: Unable to allocate 178. MiB for an array with shape (1939760, 4, 3) and data type float64

This happens even when I use 100GB memory on a cluster. After inspecting my code carefully, I found that my VCF file only has 2 individuals, but I specified 4 in vireo command. After fixing this, there's no error even with just 50GB memory. So I think this error happens when --nDonor does not match that in the VCF file. If this is the case, the current error message is not very informative.

Thanks!

huangyh09 commented 3 years ago

Thanks for reporting it. Didn't experience this memory issue before. Will test it for future release.

BTW, if you don't provide the donor VCF, it works with 50GB, right? For the time being, you demultiplex it into 4 donor without donor VCF and align to the two known donors with the vireoSNP.vcf.match_VCF_samples() to align the output donor genotype files GT_donors.vireo.vcf.gz. Example usage here.

Yuanhua

Zepeng-Mu commented 3 years ago

I tried running without VCF before, and there was also a memory issue, but not sure if it's the same reason as this one. Right now I'm running with nDonor=2 together with the VCF for that two donors, and the result seems fine.

It just using nDonor=4 and a VCF of 2 that had the error above. This was a mistake in my code, in reality there are only two samples mixed in the experiment.

Is it necessary to demultiplex without VCF to 4 samples first and then match back to known vcf into 2 samples?

huangyh09 commented 3 years ago

OK, if there are genuine only 2 donors in the scRNA, i would suggest directly use nDonor=2. BTW, how many cells do you have? I wonder whether it is the reason for large memory usage?

Zepeng-Mu commented 3 years ago

I have about 11K cells with 1.9 million SNPs from 2 samples. I have also used 60K cells with 2.7 million SNPs from 4 samples and there was no memory error. So it may not be real memory issue.

mperalc commented 1 year ago

Hello, I'm recently getting a similar issue when trying to deconvolute 27,358 cells from 73 donors. The initial SNP file has 8 million SNPs and after using --minCOUNT 20 and --minMAF 0.2 my sample_subset.vcf.gz file has 72749 variants. I noticed that reducing the number of variants prevents the error (by increasing --minCOUNT to 40, for example). Curiously, no matter if I'm giving 300Gb or 400Gb of memory the amount of memory that can't be allocated is the same (4.27 GiB). Here's the error message:

Traceback (most recent call last):
  File "/software/teamtrynka/conda/trynka-base/bin/vireo", line 10, in <module>
    sys.exit(main())
  File "/software/teamtrynka/conda/trynka-base/lib/python3.6/site-packages/vireoSNP/vireo.py", line 209, in main
    nproc=options.nproc)
  File "/software/teamtrynka/conda/trynka-base/lib/python3.6/site-packages/vireoSNP/utils/vireo_wrap.py", line 152, in vireo_wrap
    doublet_prob, ID_prob, doublet_LLR = predict_doublet(modelCA, AD, DP)
  File "/software/teamtrynka/conda/trynka-base/lib/python3.6/site-packages/vireoSNP/utils/vireo_doublet.py", line 39, in predict_doublet
    GT_both = add_doublet_GT(vobj.GT_prob)
  File "/software/teamtrynka/conda/trynka-base/lib/python3.6/site-packages/vireoSNP/utils/vireo_doublet.py", line 127, in add_doublet_GT
    GT_prob[:, s_idx2, :])
numpy.core._exceptions.MemoryError: Unable to allocate 4.27 GiB for an array with shape (2628, 72749, 3) and data type float64

Thanks!