Memory overload for large number of patients

single-cell-genetics / vireo

Demultiplexing pooled scRNA-seq data with or without genotype reference

https://vireoSNP.readthedocs.io

Apache License 2.0

73 stars 27 forks source link

Memory overload for large number of patients #65

Open racng opened 2 years ago

racng commented 2 years ago

I am trying to demultiplex 8000 cells that pooled 63 patients together using the following:

vireo -t GT -N 63  \
--vartrixData=${VARDIR}/vt_alt.mtx,${VARDIR}/vt_ref.mtx,${BC},${VCF}  \
-d ${VCF} -o $OUTDIR

but the process was killed when it was using 111GB of virtual memory based on message buffer of the kernel printed by dmesg.

[Wed Jul 27 12:13:53 2022] Out of memory: Killed process 22763 (vireo) total-vm:111413196kB, anon-rss:103289000kB, file-rss:0kB, shmem-rss:0kB, UID:1307212172 pgtables:203096kB oom_score_adj:0

Here are the sizes of different input files:

# Vartrix output
332M vt_alt.mtx
334M vt_ref.mtx
17M vt_var.mtx

# VCF file
820M vcf.gz

I am using only 1 subprocess. How does the the memory usuage scale with the number of patients or the size of the vcf file?

I also wanted to clarify, does increasing subprocesses using -p option increase the memory usage?

Thanks!

huangyh09 commented 2 years ago

Hi, thanks for sharing the issue.

It indeed looks like a memory issue from the large vcf.gz, which will be much larger than 820M when unzipped and loaded into memory. Does this VCF file only contains the relevant SNPs, and what is the rough number? Otherwise, you may use bcftools to filter variants that are not included in your Vartrix data. Also, even a subset of SNPs can be enough to separate donors. The memory usage should be scaled linearly with n_donors in the VCF file.

Yes, -p will increase the memory usage as Python only supports multiple processes.

Yuanhua

racng commented 2 years ago

Thanks for the advice! The VCF file contained SNPs generated by joint calling on GVCFs. I used bcftols isec to get its intersection with cellSNP's list of common variants. This reduced the number of SNPs from 1.1M to 370K. Currently the vireo process is using 65GB of RAM steadily so hopefully it will work.

I saw the vireo online documentation which recommended "filtering out SNPs with too much missing values or the gentoypes too similar across donors". Do you have examples of how to apply these two filters?

racng commented 2 years ago

I have tried reducing the number of SNPs to ~300K but it still ran out of RAM. But then I wrote a python script to read vartrix outputs of ref and alt allele counts per cell. I only kept SNPs that have both the ref and alt alleles detected somewhere in the scRNA dataset. This reduced the number of SNPs to ~60K. I was able to run vireo on single thread with this VCF to demultiplex 63 patients without memory overload (under 128GB).