Open PeteHaitch opened 4 years ago
Hi Pete,
The bottleneck is still there for large data set. In your case, it is probably caused by the large number of cell barcodes. Normally, it runs within one or two days for ~10k cells. In your case, 31k cells may increase the running time. Also, it is linearly sensitive to the candidate SNP size (i.e., --regionVCF
). I suggest you change it the SNP_AF5e4
version to SNP_AF5e2
or even the one you got last time (i.e., the output cellSNP.base.vcf.gz
).
For speeding up, maybe you could split the candidate SNPs (e.g., by chromosome or random) and run it in multiple nodes if it runs on cluster.
For estimating the running time, you could read the log file, which shows how many SNPs have been processed.
Yuanhua
Thanks, Yuanhua.
I've started looking into providing a much-reduced set of candidate SNPs.
Might I suggest adding to the documentation to explain how cellSNP scales in the number of barcodes, candidate SNPs, and number of reads?
It would also be useful to have a reduced set of candidate SNPs for common use cases, e.g., SNP_AF5e4
or SNP_AF5e2
intersected with 3' UTRs (or similar) for use with 10X 3' scRNA-seq data.
Thank you for this great tool.
We are running cellSNP without a VCF file (10x data mode 2) and it has now been running for a week.
Is there any downside to further parallelising by running a separate cellSNP process for each chromosome using --chrom
(this would give me greater flexibility in task distribution)? Do you have any other/further recommendations for speeding up processing?
Thanks,
Stijn
Similar to https://github.com/single-cell-genetics/cellSNP/issues/3 but wondering if things have changed. Running cellSNP v0.1.7 as
with 31,707 barcodes on a 25G BAM file has been going for > 18 days! It's still writing output, too (as of 2020-02-03 5PM):
I've run cellSNP before and although it took a few days it certainly didn't take this long. I'm wondering:
--regionVCF
, ...) might be causing this huge runtime?Thanks, Pete