single-cell-genetics / vireo

Demultiplexing pooled scRNA-seq data with or without genotype reference
https://vireoSNP.readthedocs.io
Apache License 2.0
71 stars 25 forks source link

Proper steps for cellSNP and Vireo for large dataset #61

Open vincycheng opened 2 years ago

vincycheng commented 2 years ago

Hi. I have been working on a set of data with 20K cells, and I have few questions regarding how to approach the data.

Q1: For cellSNP, it was taking forever (more than 15days) to run cellSNP as one whole, so I follow the suggestion I saw and split the bam file by chromosome and got individual cellSNP output. I then merge them together. I wonder if there is a better/prefer way to merge them for Vireo. What I am currently doing is: bcftools merge, then bcftools sort

Q2: For Vireo, I used the VCF file (1.8GB) I mentioned above as $CELL_DATA and I also have the $DONOR_GT_FILE (744KB) which I follow the suggestion to subset it using bcftools view. The issue is, it seems to be using a lot of memory, and it is hard for me to estimate the amount of memory space I need to reserve for this. The command I used is: vireo -c $CELL_DATA -d $DONOR_GT_FILE -o $OUT_DIR

Please advice. Thanks!

huangyh09 commented 2 years ago

Hi, thanks for the questions.

Q1: please use cellsnp-lite; it is re-implemented with C/C++ for much faster and memory-efficient performance.

Q2: vireo supports loading the sparse matrices directly, so won't touch the large cellSNP.cells.vcf.gz. It's generally OK with memory usage for 20K cells, and I guess ~20GB memory should be sufficient. Otherwise, how many SNPs are there in your CELL_DATA folder (you can get it from cellSNP.base.vcf.gz)?

Hope these help. Yuanhua

vincycheng commented 2 years ago

Hi things seems to work well now after using cellsnp-lite instead. Thanks!

hsymoon commented 3 months ago

Hi, thanks for the questions.

Q1: please use [cellsnp-lite](https: //cellsnp-lite.readthedocs.io); it is re-implemented with C/C++ for much faster and memory-efficient performance.

Q2: vireo supports loading the sparse matrices directly, so won't touch the large cellSNP.cells.vcf.gz. It's generally OK with memory usage for 20K cells, and I guess ~20GB memory should be sufficient. Otherwise, how many SNPs are there in your CELL_DATA folder (you can get it from cellSNP.base.vcf.gz)?

Hope these help. Yuanhua

Hello,I met "Memoryerror" when I use viero mode2. My command is vireo -c $sc_vcf -d 2donor.sorted.vcf.gz -o ${OUT_DIR} -N 4 --randSeed 2 --genoTag PL. Information about $sc_vcf is followed: bcftools +counts $sc_vcf Number of samples: 9808 Number of SNPs: 747932 Number of INDELs: 120289 Number of MNPs: 0 Number of others: 0 Number of sites: 868210 Can you help me?.Thanks very much.

huangyh09 commented 3 months ago

Hi, thanks for sharing the issue. It looks similar to Q2 above, so try not using the sc_vcf but use the CELL_DATA folder as the output of the cellsnp-lite. Then it will directly load the sparse matrices and skip parsing the vcf file.