single-cell-genetics / vireo

Demultiplexing pooled scRNA-seq data with or without genotype reference
https://vireoSNP.readthedocs.io
Apache License 2.0
73 stars 27 forks source link

Vireo taking ages to run #42

Open lucygarner opened 2 years ago

lucygarner commented 2 years ago

Hi,

I have some single-cell RNA-seq data for which I don't have genotype information.

I ran cellSNP-lite on a merged BAM file containing all of the donors to genotype the single cells as follows:

cellsnp-lite -s data.dir/merged.bam -b data.dir/barcodes.tsv -O results.dir/merged.dir -R vcf/genome1K.phase3.SNP_AF5e2.chr1toX.hg38.vcf.gz --genotype --minCOUNT 10 --minMAF 0.1 -p 10

I am now running Vireo as follows:

vireo -c results.dir/merged.dir -N 4 -o results/merged.dir --randSeed=3245 -p 30

However, it has been running for three days and still hasn't finished. I have spoken to others who have used Vireo and they mentioned that it was fast, so I'm not sure if I'm doing something wrong?

This is the log message so far:

[vireo] Loading cell folder ...
[vireo] Demultiplex 41622 cells to 4 donors with 104779 variants.

Many thanks for the help.

Best wishes, Lucy

lucygarner commented 2 years ago

This is the log for cellSNP-lite in case that helps.

[I::main] start time: 2022-03-07 10:57:55
[W::check_args] Max depth set to maximum value (2147483647)
[I::main] loading the VCF file for given SNPs ...
[I::main] fetching 7352497 candidate variants ...
[I::main] mode 1a: fetch given SNPs in 41622 single cells.
[I::csp_fetch_core][Thread-2] 2.00% SNPs processed.
[I::csp_fetch_core][Thread-3] 2.00% SNPs processed.
[I::csp_fetch_core][Thread-5] 2.00% SNPs processed.
...
[I::csp_fetch_core][Thread-9] 90.00% SNPs processed.
[I::csp_fetch_core][Thread-9] 92.00% SNPs processed.
[I::csp_fetch_core][Thread-9] 94.00% SNPs processed.
[I::csp_fetch_core][Thread-9] 96.00% SNPs processed.
[I::csp_fetch_core][Thread-9] 98.00% SNPs processed.
[I::main] All Done!
[I::main] end time: 2022-03-08 10:09:17
[I::main] time spent: 83482 seconds.
huangyh09 commented 2 years ago

Hi Lucy,

Thanks for the issue. Your dataset indeed looks relatively large. I wonder if the memory is a bottleneck. You check the memory usage by free -h.

If it is the case, you can change your command line to -p 1 by only using one CPU.

Another is that you may set a more stringent cutoff on --minCOUNT, e.g., with 30 or 100 in cellsnp. It looks you already have much more than enough variants. Probably, this is not the fastest strategy to sort it out, as you need to re-run cellsnp.

Yuanhua

lucygarner commented 2 years ago

Hi @huangyh09,

Thank you for the quick response. I am running the command on a large compute cluster but maybe I didn't specify enough memory. How much memory would you recommend specifying?

Why do you suggest to use only one CPU (-p 1)? Would using more CPUs not make it quicker?

If this does not work, I will try increasing the --minCOUNT threshold for cellSNP.

Best wishes, Lucy

huangyh09 commented 2 years ago

I see. Probably you could start with specifying 50GB memory. I guess it won't use more than 100GB. Another major factor for memory usage is the n_CPUs it uses, as n copies for data will be used, one for each sub-processor. So you may use -p 4 as a safer start instead of 30.

Yuanhua