Open racng opened 2 years ago
Hi, thanks for sharing the issue.
It indeed looks like a memory issue from the large vcf.gz, which will be much larger than 820M when unzipped and loaded into memory. Does this VCF file only contains the relevant SNPs, and what is the rough number? Otherwise, you may use bcftools
to filter variants that are not included in your Vartrix data. Also, even a subset of SNPs can be enough to separate donors. The memory usage should be scaled linearly with n_donors in the VCF file.
Yes, -p
will increase the memory usage as Python only supports multiple processes.
Yuanhua
Thanks for the advice! The VCF file contained SNPs generated by joint calling on GVCFs. I used bcftols isec to get its intersection with cellSNP's list of common variants. This reduced the number of SNPs from 1.1M to 370K. Currently the vireo process is using 65GB of RAM steadily so hopefully it will work.
I saw the vireo online documentation which recommended "filtering out SNPs with too much missing values or the gentoypes too similar across donors". Do you have examples of how to apply these two filters?
I have tried reducing the number of SNPs to ~300K but it still ran out of RAM. But then I wrote a python script to read vartrix outputs of ref and alt allele counts per cell. I only kept SNPs that have both the ref and alt alleles detected somewhere in the scRNA dataset. This reduced the number of SNPs to ~60K. I was able to run vireo on single thread with this VCF to demultiplex 63 patients without memory overload (under 128GB).
I am trying to demultiplex 8000 cells that pooled 63 patients together using the following:
but the process was killed when it was using 111GB of virtual memory based on message buffer of the kernel printed by
dmesg
.Here are the sizes of different input files:
I am using only 1 subprocess. How does the the memory usuage scale with the number of patients or the size of the vcf file?
I also wanted to clarify, does increasing subprocesses using -p option increase the memory usage?
Thanks!