Run time estimation - Githubissues

PeteHaitch commented 4 years ago

Similar to https://github.com/single-cell-genetics/cellSNP/issues/3 but wondering if things have changed. Running cellSNP v0.1.7 as

  cellSNP --samFile ${CELLRANGERDIR}/"${SAMPLE}"/outs/possorted_genome_bam.bam \
          --outDir ${OUTDIR} \
          --regionsVCF genome1K.phase3.SNP_AF5e4.chr1toX.hg38.vcf.gz \
          --barcodeFile ${PROJECT_ROOT}/data/emptyDrops/"${SAMPLE}".barcodes.txt \
          --nproc 20 \
          --minMAF 0.1 \
          --minCOUNT 20

with 31,707 barcodes on a 25G BAM file has been going for > 18 days! It's still writing output, too (as of 2020-02-03 5PM):

% ll -t data/cellSNP/cellSNP.cells.vcf.gz.temp_*
-rw-r----- 1 hickey grpu_mritchie_1 2.0G Feb  3 17:02 data/cellSNP/cellSNP.cells.vcf.gz.temp_17_
-rw-r----- 1 hickey grpu_mritchie_1 2.4G Feb  3 17:02 data/cellSNP/cellSNP.cells.vcf.gz.temp_3_
-rw-r----- 1 hickey grpu_mritchie_1 2.3G Feb  3 17:02 data/cellSNP/cellSNP.cells.vcf.gz.temp_11_
-rw-r----- 1 hickey grpu_mritchie_1 2.2G Feb  3 17:02 data/cellSNP/cellSNP.cells.vcf.gz.temp_15_
-rw-r----- 1 hickey grpu_mritchie_1 2.0G Feb  3 17:01 data/cellSNP/cellSNP.cells.vcf.gz.temp_16_
-rw-r----- 1 hickey grpu_mritchie_1 1.1G Feb  3 16:59 data/cellSNP/cellSNP.cells.vcf.gz.temp_19_
-rw-r----- 1 hickey grpu_mritchie_1 1.9G Feb  3 16:56 data/cellSNP/cellSNP.cells.vcf.gz.temp_12_
-rw-r----- 1 hickey grpu_mritchie_1 2.2G Feb  3 16:55 data/cellSNP/cellSNP.cells.vcf.gz.temp_8_
-rw-r----- 1 hickey grpu_mritchie_1 2.4G Feb  3 16:54 data/cellSNP/cellSNP.cells.vcf.gz.temp_6_
-rw-r----- 1 hickey grpu_mritchie_1 2.0G Feb  3 16:54 data/cellSNP/cellSNP.cells.vcf.gz.temp_9_
-rw-r----- 1 hickey grpu_mritchie_1 2.0G Feb  3 16:51 data/cellSNP/cellSNP.cells.vcf.gz.temp_10_
-rw-r----- 1 hickey grpu_mritchie_1 2.1G Feb  3 16:50 data/cellSNP/cellSNP.cells.vcf.gz.temp_1_
-rw-r----- 1 hickey grpu_mritchie_1 2.2G Feb  3 16:46 data/cellSNP/cellSNP.cells.vcf.gz.temp_2_
-rw-r----- 1 hickey grpu_mritchie_1 2.3G Feb  3 16:38 data/cellSNP/cellSNP.cells.vcf.gz.temp_14_
-rw-r----- 1 hickey grpu_mritchie_1 2.3G Feb  3 16:37 data/cellSNP/cellSNP.cells.vcf.gz.temp_13_
-rw-r----- 1 hickey grpu_mritchie_1 2.0G Feb  3 16:32 data/cellSNP/cellSNP.cells.vcf.gz.temp_4_
-rw-r----- 1 hickey grpu_mritchie_1 2.5G Feb  3 16:19 data/cellSNP/cellSNP.cells.vcf.gz.temp_7_
-rw-r----- 1 hickey grpu_mritchie_1 2.1G Feb  3 15:04 data/cellSNP/cellSNP.cells.vcf.gz.temp_18_
-rw-r----- 1 hickey grpu_mritchie_1 1.9G Feb  3 13:44 data/cellSNP/cellSNP.cells.vcf.gz.temp_0_
-rw-r----- 1 hickey grpu_mritchie_1 1.8G Feb  2 23:52 data/cellSNP/cellSNP.cells.vcf.gz.temp_5_

I've run cellSNP before and although it took a few days it certainly didn't take this long. I'm wondering:

What particular parts of this (e.g., size of BAM, number of barcodes, number of loci in --regionVCF, ...) might be causing this huge runtime?
What might I do to speed cellSNP up for subsequent datasets (I'm anticipating several datasets, many larger than this, over the course of the year)?
How can I estimate how much longer this particular process has to run?

Thanks, Pete

huangyh09 commented 4 years ago

Hi Pete,

The bottleneck is still there for large data set. In your case, it is probably caused by the large number of cell barcodes. Normally, it runs within one or two days for ~10k cells. In your case, 31k cells may increase the running time. Also, it is linearly sensitive to the candidate SNP size (i.e., --regionVCF). I suggest you change it the SNP_AF5e4 version to SNP_AF5e2 or even the one you got last time (i.e., the output cellSNP.base.vcf.gz).

For speeding up, maybe you could split the candidate SNPs (e.g., by chromosome or random) and run it in multiple nodes if it runs on cluster.

For estimating the running time, you could read the log file, which shows how many SNPs have been processed.

Yuanhua

PeteHaitch commented 4 years ago

Thanks, Yuanhua.

I've started looking into providing a much-reduced set of candidate SNPs. Might I suggest adding to the documentation to explain how cellSNP scales in the number of barcodes, candidate SNPs, and number of reads? It would also be useful to have a reduced set of candidate SNPs for common use cases, e.g., SNP_AF5e4 or SNP_AF5e2 intersected with 3' UTRs (or similar) for use with 10X 3' scRNA-seq data.

micans commented 4 years ago

Thank you for this great tool. We are running cellSNP without a VCF file (10x data mode 2) and it has now been running for a week. Is there any downside to further parallelising by running a separate cellSNP process for each chromosome using --chrom (this would give me greater flexibility in task distribution)? Do you have any other/further recommendations for speeding up processing? Thanks, Stijn

single-cell-genetics / cellSNP

Run time estimation #9